After
reading it, I was shocked! I believe it is a perfect collection of bad advice.
So bad and so perfect that I decided to write a “reaction” post about it and
express my view on the problem.
Here are my favourite quotes and comments:
They
must have the ability to steer the data science team carefully to make sure
that the team stays on track toward an eventually useful business solution.
This is very difficult if the managers don’t really understand the principles.
Managers need to be able to ask probing questions of a data scientist, who
often can get lost in technical details.
I believe this is an incredibly arrogant statement! What
data scientists need is not probing questions, but a clear vision about the
business needs (they can understand that, trust me). If one tries to
interrogate a data scientist, he will end up “dumbing it down for a 3-year-old”
and all the meaning will get lost. This is a way of a manager from Dilbert cartoons.
You
should learn either R or SAS, as they are the most
widely used programming languages by data scientists. Having the
intermediate skills in either of these languages will help you to communicate
confidently with data scientists.
So, in other words, managers are people, who are too dumb to
be employees. I expect from a manager to be professional in something (like,
actually, managing people!). Not just the ability to fake understanding and
confidence. Not just having intermediate skills to be able to communicate. If
you are going to be a data scientist, even manager, you will need to learn how
to code. And please, do it well. Maybe not ninja level, but enough to be able
to do some tasks well and to read the code of others.
Also, why only R or SAS? I would prefer manager to spend
time learning about different tools/languages/databases/etc. and their pros and
cons to be able to make proper decisions about architecture. Manager actually
could have some time to do that. J
For example, if the manager suspects that middle-aged
men living in San Francisco have some particular interesting churn behaviour,
they could compose a SQL query like this:SELECT * FROM CUSTOMERS WHERE AGE > 45
and SEX = ‘M and STATE = ‘ SANFRAN’
This example is greatly oversimplifying the whole thing. So
much that it can be misleading. It
makes the impression that such analysis is simple and has little depth to it. Data
science way in SQL is way different than what developers normally use.
For that example, I would prefer the query that compares
“churn behavior” across various groups! So it’s a GROUP BY. Then you need a
metric of churn, suppose average lifespan of the user. Then you need to decide
how to account for users that didn’t churn yet. And you need to control for
sample size. Here is a query:
SELECT
SEX, STATE, IF(AGE>45,”>45”, “<=45”) as age1,
COUNT(1) as all_users,
COUNT(quit) as churned_users,
AVG(IF(quit IS NOT NULL, DATE_DIFF(quit, join), NULL)) as
avg_lifespan_churned,
AVG(IF(quit IS NULL, DATE_DIFF(NOW(), join), NULL)) as avg_lifespan_live
FROM CUSTOMERS
GROUP BY SEX, STATE, age1
And this is
still a fake example that is greatly oversimplified.
To
understand data science, one must also know the basics of hypothesis testing
and experiment-design to comprehend the meaning and context of the data.
This is
obsolete statistics! P-values are crap! Just read something from this guy - http://andrewgelman.com/. Bayesians are taking over, and it should be mentioned at least.
Visualization
wise, it can be immensely helpful to be familiar with data visualization tools
like ggplot and d3.js.
Why just those two? Better know the market!
A
data science manager will know the basic machine learning techniques like regression, clustering, SVM,
decision trees, etc. They will also understand how these concepts are applied
to real world big data problems.
The same here. Better do research
on who is using what and why. There are tons of algorithms. And data scientists
tend to favor the ones they already worked with. Learn from people who tried
solving similar problems and tried many techniques already. This would be
useful and requires strong desk analysis and networking skills, so MBAs a
welcome.
If
“profitable” can be defined clearly based on existing data, this is a
straightforward database query.
I would point out that ETL
(collecting, cleaning and preparing data in all ways) is 90% of the work. It is
a task that is never done well and always stays the major challenge. The manager
needs to plan for that.
Enough ranting! How to do it
right:
- If you are a manager – be a
really good one. Who not only translates from one “language” to another but
also empowers, changes people outside into data users and data scientists into
active implementers.
- Be a really good specialist in
something! Hands-on specialist. You never can master everything. Probably
statistical inference is closest to the business. Leave ML, ETL, DBA to pros. J
- Remember about the bigger picture.
Where is your “Digital Manager” in the marketing department? I guess nowhere to
be found. Used to be a hot role, but now digital has taken over. CMO is that
manager now. Programmatic is a new silo in the marketing department, btw. The
same will happen with Data Science. People are realizing now that making
decisions with data is better than with gut feeling and data will soon take
over all decision making in the companies. Every manager will be a data science
manager. Manager of marketing data scientists, product data scientists, finance
data scientists, HR data scientists, etc. We are on the big trip right now!