Thoughts on statistical inference and big data: Tracking user events

Long time ago a man with grey hair in data warehousing told me that dealing with timestamps is the most complicated thing. I didn’t believe him, but you can find a bit of “fun” in this area.

This is a very common pitfall that many people hit (I see it in 3^rd party data a lot!). Suppose you decided to track user events/actions. Typically, you will have some code on the client (JS on the web or SDK in apps) that will be sending data to your server.

Each event needs a timestamp. You have a choice of 2 different timestamps to use: your client code can get it on user’s device or your server can record the moment of getting the data from client.

Both timestamps have advantages and disadvantages:

Server timestamp:

Plus: You have full control over accuracy of your server’s time and even time zone. This will really help to keep your data clean and accurate.
Minus: Your timestamp is distorted by network latency. Moreover, in smartphone apps you are probably using queueing to reduce network usage. You would be queueing events up to 10, for example, and then flush all of them. Especially this is important if you are working with developing markets with lots of cheap Androids on EDGE. As a result, your events arrive in batches with the same server timestamp and it’s impossible to get their order and time passed in between of these events.

Client timestamp:

Plus: Very accurate representation of user events on small scale – easy to get accurate order of events and gaps (seconds/milliseconds) between them.
Minus: You will be surprised how many users have inaccurate time set in their devices!

From my experience 1% to 5% of users (different products and audiences) always live in 2011 or even in 2028. I have hard time understanding what is the reason for it. Why would they do it?

I noticed that Philippines and Japan are clearly leading in number of users like that. The map below shows countries with high concentration of weird timestamps.

This problem is actually a nightmare. It will break all requests where you want to know trends for your user behavior. And this is the most common use-case for such data.

One possible solution is to create your own timestamp on the client - call some time server on the internet from the client and save the difference with system timestamp. But maintaining it would be kind of odd, no guarantee what user does to his timestamp next.

So what is a good solution? It’s obvious! Just store both timestamps and use the most appropriate for different tasks. As usual “store all you can” approach works best. For all fellow data engineers to note.

P.S. Short questions to readers:

Is there any other, better solution?
Any idea what people in Philippines and Japan are doing to their timestamps?

Thoughts on statistical inference and big data

Sunday, June 5, 2016

Tracking user events – dealing with two timestamps

No comments:

Post a Comment