Robert Matthews
If you’ve got a swimming pool, take extra care this month. No silly dives, running round the edge or other tomfoolery. In fact, you might want to think about draining the pool and covering it over.
The reason: Nicolas Cage has some new movies coming out over the coming months.
Say what? Why on earth should a spate of releases featuring the famously hardworking Hollywood star have any bearing on swimming-pool safety?
Who knows – but the statistics are clear. An analysis of a decade of data by Harvard criminology student Tyler Vigen has revealed a clear correlation between the number of movies Cage appears in each year and the number of people who drown in their swimming-pools.
And it doesn’t seem to be a fluke, either: detailed analysis shows the correlation is statistically significant. Many scientists would take that as pretty good evidence that we’re not dealing with random chance.
Before all this gets traction on social media (or Cage calls his lawyers), it should be made clear that it’s all true – and all baloney.
That may sound paradoxical, and it highlights a problem in the use of statistics. There really is a clear, strong and statistically significant correlation between Mr Cage’s annual output and deaths by falling into swimming pools (at least, in the US). And there is also no reason whatever for believing it to be true.
There is every reason for Mr Vigen winning an award for the most entertaining demonstration of one of the most important lessons in statistical science: correlation is not causation.
Hardly a week goes by without headlines proclaiming some “correlation” between one thing and another. Some of these make sense. There is, for example, a correlation between lung-cancer risk and number of cigarettes smoked.
But statisticians routinely warn against mistaking mere correlation for a genuine, causative connection.
So to keep us all on the path of righteousness, Mr Vigen has devised software that trawls the web for ridiculous correlations between random data-sets.
Some of the more entertaining ones now appear on his personal website Spurious Correlations. They include correlations between divorce rates and per capita consumption of margarine, sales of German cars in the US and suicides by car-crashing, and consumption of cheese and death by getting tangled in bedsheets.
But there’s a problem. Clearly only the most statistically naive would believe in a link between, say, the age of Miss America and murders involving a source of heat (another found by Mr Vigen’s software).
Yet it’s really hard to avoid trying to find some sense in some of the others.
Could it be, for example, that the correlation between divorce rates and margarine reflects the financial hardship caused by divorce, which thus drives people to buy marg instead of butter?
Or perhaps the correlation between cheese and “death by bedsheet” is proof of the age-old belief that eating cheese at night causes restless sleep – with potentially lethal results?
The trouble is, standard statistics doesn’t really have a good way of dealing with such possibilities.
When a correlation emerges from data, researchers typically test only for the possibility it’s just the result of random fluke.
To do that, they perform a so-called significance test.
This takes into account the size of the data-set (the bigger, the stronger the evidence), and the strength of the correlation, measured by the “correlation coefficient”, which ranges from zero (no discernable pattern) to plus or minus one (perfect correlation or anti-correlation).
Plugged into a formula, these two figures give a so-called p-value, which many researchers think measures the risk of the correlation being just a random fluke. Clearly, the lower this is, the better – and anything below 1 in 20 is regarded as “statistically significant”.
In the case of that crazy link between drownings and Nic Cage movies, the correlation was based on just 11 data points, but had a high correlation coefficient of 0.67. That leads to a p-value of just 1 in 40, making the correlation statistically significant.
Many researchers then assume they’ve ruled out fluke, and so must look to another explanation for the spurious connection.
The most obvious is a so-called “confounder” – that is, some hidden connection lurking within the correlation, making it seem real.
For example, you can bet that the number of sunburn cases is correlated to tanning lotion sales. Yet tanning lotion patently doesn’t cause sunburn; the correlation is caused by the hidden (if obvious) confounder: intense sunlight.
This leads statisticians to suggest confounding as an explanation for some crazy correlation.
There’s no test to prove one exists, however. And in many cases – such as the link between drownings and Nic Cage movies – mere fluke still seems the most plausible cause.
Which kind of leaves us nowhere – until one learns that, contrary to what many scientists think, p-values aren’t very good at spotting fluke results.
Put simply, p-values assume the observed effect really is a fluke. As such, they can’t also be used to test if that assumption is valid – which, unfortunately, is just the question we want answered.
Worse still, using p-values in this way tends to underestimate the risk of falling for a random fluke when the finding is inherently implausible.
Statisticians have issued warnings about this for decades, seemingly with little impact. As a result, the research literature in many disciplines is shot through with “statistically significant” correlations every bit as spurious as the idea that we should avoid swimming pools when Nic Cage has a movie out.
Mr Vigen’s treasure trove of statistical silliness is undoubtedly entertaining. But it highlights serious issues about understanding correlations that have been ignored for far too long.
newsdesk@thenational.ae
Robert Matthews is visiting reader in science at Aston University, Birmingham