Thoughts on statistical inference and big data

Monday, August 21, 2017

The Death Spiral of small retail

I work with retail now, and I started to notice interesting stuff every time I buy something.

My daughter just came back from Russia with a dream – a hoverboard (aka gyro-scooter). I wanted to try it before buying, so Qoo10 or Lazada was not an option. After a bit of search, I found a shop nearby. We come there, and it’s closed. On Sunday! I already saw the signs of retailers “Death Spiral” as taught by prof. Jack Cohen at INSEAD. So I’m not surprised with their poor selection of models and colors when I get to the shop next day.

Poor advertising, poor selection and failure to keep the lights on – these are very recognizable turns on the Death Spiral.

It is natural for a business to run in cycles. Good times pass and hard times come. When cash is low, it is simple to cut some “frills” like advertising. But this undermines the sales – new customers don’t know about you.

The cash gets lower, and the next thing to cut is stock. Why should I buy the new stuff if sales are weak? The thing is – your A products will run out of stock first. And here you stand with B and C products in stock teaching every customer that you are a bad shop with bad products. Not worth even checking out next time.

The cash gets lower, and you try to cut operating hours. Getting rid of employees seems reasonable because it’s pretty much the only flexible cost that you have left. You can do the math and cut the days or hours when sales don’t cover salaries. And yet you keep teaching people walking by that you are closed and not worth checking next time they need something. The end is near.

Is there a way out?

The deeper you go down the spiral – the harder is to get out. But you can try certain things (in that order):

Cut costs that don’t undermine your effective advertising, the assortment of A and B products and open hours.
Use financial reserves that you created during good times (you wish, nobody does that).
Try to delay payments to suppliers and your landlord. All my small business experience tells me – it is very possible and underused by SMEs.
Seek external funding. Yes, even if it costs you interest – still better than cutting advertising.

If nothing works – think about exiting. If you are forced to step on the Death Spiral, then you are better out of business faster with fewer losses.

PS By A, B and C products I refer to ABC-analysis.

Friday, April 21, 2017

My talk at http://www.marketing-interactive.com/analytics/sg/.

https://www.slideshare.net/NikolayNovozhilov/privacy-75272539

Sunday, December 11, 2016

MBAs as Data Scientists – rant post

Recently a post called “How an MBA professional can become a data science savvy manager” made it to my feed. Well, I’m an MBA, who graduated INSEAD in 2011, and currently a Chief Data Scientist at Wego. So, I was interested.

After reading it, I was shocked! I believe it is a perfect collection of bad advice. So bad and so perfect that I decided to write a “reaction” post about it and express my view on the problem.

Here are my favourite quotes and comments:

They must have the ability to steer the data science team carefully to make sure that the team stays on track toward an eventually useful business solution. This is very difficult if the managers don’t really understand the principles. Managers need to be able to ask probing questions of a data scientist, who often can get lost in technical details.

I believe this is an incredibly arrogant statement! What data scientists need is not probing questions, but a clear vision about the business needs (they can understand that, trust me). If one tries to interrogate a data scientist, he will end up “dumbing it down for a 3-year-old” and all the meaning will get lost. This is a way of a manager from Dilbert cartoons.

You should learn either R or SAS, as they are the most widely used programming languages by data scientists. Having the intermediate skills in either of these languages will help you to communicate confidently with data scientists.

So, in other words, managers are people, who are too dumb to be employees. I expect from a manager to be professional in something (like, actually, managing people!). Not just the ability to fake understanding and confidence. Not just having intermediate skills to be able to communicate. If you are going to be a data scientist, even manager, you will need to learn how to code. And please, do it well. Maybe not ninja level, but enough to be able to do some tasks well and to read the code of others.

Also, why only R or SAS? I would prefer manager to spend time learning about different tools/languages/databases/etc. and their pros and cons to be able to make proper decisions about architecture. Manager actually could have some time to do that. J

For example, if the manager suspects that middle-aged men living in San Francisco have some particular interesting churn behaviour, they could compose a SQL query like this:SELECT * FROM CUSTOMERS WHERE AGE > 45 and SEX = ‘M and STATE = ‘ SANFRAN’

This example is greatly oversimplifying the whole thing. So much that it can be misleading. It makes the impression that such analysis is simple and has little depth to it. Data science way in SQL is way different than what developers normally use.

For that example, I would prefer the query that compares “churn behavior” across various groups! So it’s a GROUP BY. Then you need a metric of churn, suppose average lifespan of the user. Then you need to decide how to account for users that didn’t churn yet. And you need to control for sample size. Here is a query:

SELECT

SEX, STATE, IF(AGE>45,”>45”, “<=45”) as age1,

COUNT(1) as all_users,

COUNT(quit) as churned_users,

AVG(IF(quit IS NOT NULL, DATE_DIFF(quit, join), NULL)) as avg_lifespan_churned,

AVG(IF(quit IS NULL, DATE_DIFF(NOW(), join), NULL)) as avg_lifespan_live

FROM CUSTOMERS

GROUP BY SEX, STATE, age1

And this is still a fake example that is greatly oversimplified.

To understand data science, one must also know the basics of hypothesis testing and experiment-design to comprehend the meaning and context of the data.

This is obsolete statistics! P-values are crap! Just read something from this guy - http://andrewgelman.com/. Bayesians are taking over, and it should be mentioned at least.

Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot and d3.js.

Why just those two? Better know the market!

A data science manager will know the basic machine learning techniques like regression, clustering, SVM, decision trees, etc. They will also understand how these concepts are applied to real world big data problems.

The same here. Better do research on who is using what and why. There are tons of algorithms. And data scientists tend to favor the ones they already worked with. Learn from people who tried solving similar problems and tried many techniques already. This would be useful and requires strong desk analysis and networking skills, so MBAs a welcome.

If “profitable” can be defined clearly based on existing data, this is a straightforward database query.

I would point out that ETL (collecting, cleaning and preparing data in all ways) is 90% of the work. It is a task that is never done well and always stays the major challenge. The manager needs to plan for that.

Enough ranting! How to do it right:

If you are a manager – be a really good one. Who not only translates from one “language” to another but also empowers, changes people outside into data users and data scientists into active implementers.

Be a really good specialist in something! Hands-on specialist. You never can master everything. Probably statistical inference is closest to the business. Leave ML, ETL, DBA to pros. J

Remember about the bigger picture. Where is your “Digital Manager” in the marketing department? I guess nowhere to be found. Used to be a hot role, but now digital has taken over. CMO is that manager now. Programmatic is a new silo in the marketing department, btw. The same will happen with Data Science. People are realizing now that making decisions with data is better than with gut feeling and data will soon take over all decision making in the companies. Every manager will be a data science manager. Manager of marketing data scientists, product data scientists, finance data scientists, HR data scientists, etc. We are on the big trip right now!

Sunday, June 5, 2016

Tracking user events – dealing with two timestamps

Long time ago a man with grey hair in data warehousing told me that dealing with timestamps is the most complicated thing. I didn’t believe him, but you can find a bit of “fun” in this area.

This is a very common pitfall that many people hit (I see it in 3^rd party data a lot!). Suppose you decided to track user events/actions. Typically, you will have some code on the client (JS on the web or SDK in apps) that will be sending data to your server.

Each event needs a timestamp. You have a choice of 2 different timestamps to use: your client code can get it on user’s device or your server can record the moment of getting the data from client.

Both timestamps have advantages and disadvantages:

Building robust real-time ETL on Google BigQuery. Lambda architecture inside BigQuery.

We have quite a standard setup with our data. Raw data is collected from several internal and external sources and is regularly uploaded into one place (in our case into Google BigQuery). Some sources (like our own production) do it hourly, others (often 3^rd parties) do it daily.

The data is then transformed into many aggregated forms for consumption of end-users: dashboards for internal users, reports for partners, inputs that go back to production and affect product behaviour. This transformation may be quite involved, with multiple dependencies. But mostly we cope with the task using BigQuery SQL+UDF and saving results as separate tables.

The obvious way to perform these transformations is to schedule them. If this data source is done uploading every day at 1am, then schedule that job at 1.05am. If this hourly data is normally uploaded by 5^th minute of the next hour – then schedule these jobs to run every 10^th minute of every hour. Small 5min gaps are not such a big problem anyway and everything is supposed to work nicely.

But the world is cruel! Data is not always delivered in time. Sometimes not delivered at all without manual intervention. If your hourly data made it to BigQuery at 11^th minute – here you go, please wait for another hour to see it in dashboards. And if your transformations require several data sources – the mess becomes even worse.

Moreover, the data is not always correct (always incorrect to be more correct!). Once in a while there is an issue and data gets re-uploaded or undergoes additional cleanups. Then you need to take care of all jobs that used this data and refresh them.

Well, all these issues are issues with raw data and we should be fighting them. But this is a war that you can’t win in reality! Something will be broken anyway. If the data source is internal – your developers can have other prioritized tasks. And 3^rd parties are just not under your control. But at least it would be very nice if once your raw data is in place, all end users have instant access to it without waiting for scheduled refreshes.

This is a big real-world problem. And what are possible solutions?

BigQuery: don’t count(*) -> sum(1) !!!

I was always taught that the way to count rows in SQL is count(*). Google tutorial for BigQuery also has that - link.

But there is a problem with such approach if you start working with complicated queries that use sub-queries. Look at this code:

SELECT
  SUM(IF(year%2=0, children, 0))
FROM (
  SELECT year, COUNT(1) AS children
  FROM [publicdata:samples.natality]
  GROUP BY year 
)

It will return a mistake:

“Error: Argument type mismatch in function IF: 'children' is type uint64, '0' is type int32.”

Somewhere inside the guts of BigQuery count(*) and 0 have different types and it returns a mistake. Of course you can cast count(*) as integer – this code works:

SELECT
  SUM(IF(year%2=0, INTEGER(children), 0))
FROM (
  SELECT year, COUNT(*) AS children
  FROM [publicdata:samples.natality]
  GROUP BY year
)

But this solution is a nightmare – you never recall it upfront, only after BigQuery throws a mistake. A better way is to do INTEGER(count(*)) in the very beginning.

Today I learned that the best way is to do sum(1). Shorter and just works, use everywhere and never get that mistake again:

SELECT
  SUM(IF(year%2=0, children, 0))
FROM (
  SELECT year, sum(1) AS children
  FROM [publicdata:samples.natality]
  GROUP BY year 
)

Tuesday, June 30, 2015

BigQuery in Wego

The post is "to be written" - but here is a slide deck: