Thoughts on statistical inference and big data: MBAs as Data Scientists

Recently a post called “How an MBA professional can become a data science savvy manager” made it to my feed. Well, I’m an MBA, who graduated INSEAD in 2011, and currently a Chief Data Scientist at Wego. So, I was interested.

After reading it, I was shocked! I believe it is a perfect collection of bad advice. So bad and so perfect that I decided to write a “reaction” post about it and express my view on the problem.

Here are my favourite quotes and comments:

They must have the ability to steer the data science team carefully to make sure that the team stays on track toward an eventually useful business solution. This is very difficult if the managers don’t really understand the principles. Managers need to be able to ask probing questions of a data scientist, who often can get lost in technical details.

I believe this is an incredibly arrogant statement! What data scientists need is not probing questions, but a clear vision about the business needs (they can understand that, trust me). If one tries to interrogate a data scientist, he will end up “dumbing it down for a 3-year-old” and all the meaning will get lost. This is a way of a manager from Dilbert cartoons.

You should learn either R or SAS, as they are the most widely used programming languages by data scientists. Having the intermediate skills in either of these languages will help you to communicate confidently with data scientists.

So, in other words, managers are people, who are too dumb to be employees. I expect from a manager to be professional in something (like, actually, managing people!). Not just the ability to fake understanding and confidence. Not just having intermediate skills to be able to communicate. If you are going to be a data scientist, even manager, you will need to learn how to code. And please, do it well. Maybe not ninja level, but enough to be able to do some tasks well and to read the code of others.

Also, why only R or SAS? I would prefer manager to spend time learning about different tools/languages/databases/etc. and their pros and cons to be able to make proper decisions about architecture. Manager actually could have some time to do that. J

For example, if the manager suspects that middle-aged men living in San Francisco have some particular interesting churn behaviour, they could compose a SQL query like this:SELECT * FROM CUSTOMERS WHERE AGE > 45 and SEX = ‘M and STATE = ‘ SANFRAN’

This example is greatly oversimplifying the whole thing. So much that it can be misleading. It makes the impression that such analysis is simple and has little depth to it. Data science way in SQL is way different than what developers normally use.

For that example, I would prefer the query that compares “churn behavior” across various groups! So it’s a GROUP BY. Then you need a metric of churn, suppose average lifespan of the user. Then you need to decide how to account for users that didn’t churn yet. And you need to control for sample size. Here is a query:

SELECT

SEX, STATE, IF(AGE>45,”>45”, “<=45”) as age1,

COUNT(1) as all_users,

COUNT(quit) as churned_users,

AVG(IF(quit IS NOT NULL, DATE_DIFF(quit, join), NULL)) as avg_lifespan_churned,

AVG(IF(quit IS NULL, DATE_DIFF(NOW(), join), NULL)) as avg_lifespan_live

FROM CUSTOMERS

GROUP BY SEX, STATE, age1

And this is still a fake example that is greatly oversimplified.

To understand data science, one must also know the basics of hypothesis testing and experiment-design to comprehend the meaning and context of the data.

This is obsolete statistics! P-values are crap! Just read something from this guy - http://andrewgelman.com/. Bayesians are taking over, and it should be mentioned at least.

Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot and d3.js.

Why just those two? Better know the market!

A data science manager will know the basic machine learning techniques like regression, clustering, SVM, decision trees, etc. They will also understand how these concepts are applied to real world big data problems.

The same here. Better do research on who is using what and why. There are tons of algorithms. And data scientists tend to favor the ones they already worked with. Learn from people who tried solving similar problems and tried many techniques already. This would be useful and requires strong desk analysis and networking skills, so MBAs a welcome.

If “profitable” can be defined clearly based on existing data, this is a straightforward database query.

I would point out that ETL (collecting, cleaning and preparing data in all ways) is 90% of the work. It is a task that is never done well and always stays the major challenge. The manager needs to plan for that.

Enough ranting! How to do it right:

If you are a manager – be a really good one. Who not only translates from one “language” to another but also empowers, changes people outside into data users and data scientists into active implementers.

Be a really good specialist in something! Hands-on specialist. You never can master everything. Probably statistical inference is closest to the business. Leave ML, ETL, DBA to pros. J

Remember about the bigger picture. Where is your “Digital Manager” in the marketing department? I guess nowhere to be found. Used to be a hot role, but now digital has taken over. CMO is that manager now. Programmatic is a new silo in the marketing department, btw. The same will happen with Data Science. People are realizing now that making decisions with data is better than with gut feeling and data will soon take over all decision making in the companies. Every manager will be a data science manager. Manager of marketing data scientists, product data scientists, finance data scientists, HR data scientists, etc. We are on the big trip right now!

Thoughts on statistical inference and big data

Sunday, December 11, 2016

MBAs as Data Scientists – rant post

Here are my favourite quotes and comments:

Enough ranting! How to do it right:

No comments:

Post a Comment