Sunday, May 24, 2015

Analysing spurious correlations: Nicolas Cage is drowning people?


Recently noticed in my Facebook timeline several funny pictures with spurious correlations. The source is here - http://www.tylervigen.com/spurious-correlations. It has about 20 funny pictures like one above. Spurious correlations are wide known phenomena. But I decided to do some stats analysis about it to explain human psychology behind it. 





Removing trends


If 2 variables grow constantly they will correlate. Nothing special about it. People love to measure stuff that keeps growing. So let’s take trends out and see what happens. I just made linear regression on each variable, minused from the variable and looked at correlation of residuals. 


In some cases it killed correlation:




In others nothing changed:



There must be something else!


Also I saw that number of positive correlations on that website is much higher that negative ones. Maybe author just selected positive correlations. But maybe it happens because people like to measure growing stuff. There could be many pairs with strong negative correlation of residuals but overall growth trend was cancelling it out!



What is the probability to get such correlation at random?


Back to stats class, back to formulas. On average there are 11 observations (n) and after trend correction correlation coefficient (r) is about 70%. Knowing this we can get variable that has t–student distribution with n-2 degrees of freedom: 


We get t=2.98 and probability to get this at random is about 0.77%. Looks like we need to compare many variables to find such correlations at random! It can make you think that some of these correlations probably have cause–effect relationship. But don’t rush to this conclusion!



Birthday paradox


The probability of 0.77% looks too low but the key is that we never care witch pairs of variables we compare to each other! This is similar to famous birthday paradox - https://en.wikipedia.org/wiki/Birthday_problem

Probability that 2 people were born at the same date of the year is about 1/365. But among only 23 people you find one match with probability 50%. This happens because we don’t care that pair it is and compare all people to each other. And number of pairs grows exponentially. This is a very counterintuitive fact, many people have problems believing it. But it’s true.
Same thing happens with our correlations. 2 random variables will correlate with odds 1/65 (I care about negative correlations as well – abs(r)>70%):

But if you only take 9 random variables (11 observations in each) with 50% you will find at least one strong positive or negative correlation (abs(r)>70%): 


I believe the author of this website took a database of different variables and just looked at all combinations. It would easily give him a lot of correlations. In practice the problem would be to scroll through all these pairs to find funny ones. Most of these pairs would make too much sense and wouldn’t entertain the readers. But the important fact is that even among those “reasonable” correlations that the author presumably dropped there were so many spurious correlations! 


Take out: Correlation-causation fallacy is well-known. I believe psychological reason for it is similar to birthday paradox – underestimation of number of pairs that is possible to compare to each other. 





No comments:

Post a Comment

Note: Only a member of this blog may post a comment.