Long time ago a man with grey hair in data warehousing told me that
dealing with timestamps is the most complicated thing. I didn’t believe him,
but you can find a bit of “fun” in this area.
This is a very common pitfall that many people hit (I see it in 3rd
party data a lot!). Suppose you decided to track user events/actions. Typically,
you will have some code on the client (JS on the web or SDK in apps) that will
be sending data to your server.
Each event needs a timestamp. You have a choice of 2 different
timestamps to use: your client code can get it on user’s device or your server
can record the moment of getting the data from client.
Server timestamp:
- Plus: You have full control over accuracy of your server’s time and even time zone. This will really help to keep your data clean and accurate.
- Minus: Your timestamp is distorted by network latency. Moreover, in smartphone apps you are probably using queueing to reduce network usage. You would be queueing events up to 10, for example, and then flush all of them. Especially this is important if you are working with developing markets with lots of cheap Androids on EDGE. As a result, your events arrive in batches with the same server timestamp and it’s impossible to get their order and time passed in between of these events.
Client timestamp:
- Plus: Very accurate representation of user events on small scale – easy to get accurate order of events and gaps (seconds/milliseconds) between them.
- Minus: You will be surprised how many users have inaccurate time set in their devices!
From my experience 1%
to 5% of users (different products and audiences) always live in 2011 or even
in 2028. I have hard time understanding what is the reason for it. Why would
they do it?
I noticed that Philippines
and Japan are clearly leading in number of users like that. The map below shows countries with high concentration of weird timestamps.
This problem is
actually a nightmare. It will break all requests where you want to know trends
for your user behavior. And this is the most common use-case for such data.
One possible solution is to create your own timestamp on the client -
call some time server on the internet from the client and save the difference
with system timestamp. But maintaining it would be kind of odd, no guarantee
what user does to his timestamp next.
So what is a good solution? It’s obvious! Just store both timestamps
and use the most appropriate for different tasks. As usual “store all you
can” approach works best. For all fellow data engineers to note.
P.S. Short questions to
readers:
- Is there any other, better solution?
- Any idea what people in Philippines and Japan are doing to their timestamps?
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.