A skeptic’s guide to statistical significance

“There’s no safe amount of alcohol,” CNN reported. This year the largest ever study on the health risks of alcohol was released, attracting mass media attention and igniting a science journalism furor over its interpretation.

In the study, researchers found a significant increased in risk of death for individuals who consume even one drink a day. That is statistically significant, which misleadingly has little to do with what most people think of as “significant.” Statisticians quickly went after the media’s interpretation of the study; even though the increased risk of death was statistically significant, it represented only a tiny increase in personal risk. Although, risks do spike rapidly for binge drinkers.

A meme about statistical significance. "The media" is looking past "Statistics" and focusing on a "new scientific study" instead. — Original Photo by Antonio Guillem.

The problem, however, is not with the study but with the interpretation of statistical significance. While this measure is a powerful tool and deployed throughout the sciences, misunderstandings are rampant even by scientists themselves.

Statistical significance is defined by reference to a “p-value,” most commonly p < 0.05 . These are precise technical concepts with some strengths, but many limitations. To understand this, let’s imagine a different study finding: “People taking antidepressants report a statistically significant decrease in depression (p = 0.03).”

If we see something like this, what are we supposed to think?

First, before even worrying about the p-value, don’t assume a “statistically significant” effect is “clinically significant.” This problem is most acute in large-sample studies. It’s easy to think that a large study, with many participants, is better than a small study, with few participants; for most cases this is true. However, large studies often pick up extremely small effects, and we have to be careful not to confuse these “statistically significant” effects with meaningful effects. This is what happened with the alcohol study: there appears to be a real effect of alcohol on health at very low levels, but its effect on health is so minimal that most people shouldn’t worry about it. Yet these overstated conclusions are not rare. Antidepressants, which are tested in massive pharmaceutical trials, are notorious for being of relatively minor effect, although this problem is not confined to psychiatric medication.

Second, and trickier, is p = 0.03. A common, but wrong, interpretation is that a p-value of 0.03 means that the study has only a 3% chance of being false. Unfortunately, the correct interpretation of “p” is clunkier and based on the idea of a null hypothesis. A null hypothesis is when we expect that there is no effect of the intervention we are studying. With regard to this example, the null hypothesis would be that the antidepressant has the same effect as placebo (sugar pills).

P-values cannot tell you how likely your preferred hypothesis (e.g., antidepressants improve depression) is to be true given your data, but can only tell you how likely your data are given your null hypothesis. In other words, they tell you how likely your data would be IF your null hypothesis were true. In this case, the null hypothesis is that antidepressants are no better than placebos. IF antidepressants were no better than placebo, THEN we would get data that looked like ours only 3% of the time. The smaller the p-value, the less the data matches the null hypothesis, and the less confident we should be in the null hypothesis.

With the p-value in hand, scientists have a choice to make. Either, no, the data are too far-fetched if we assume the null hypothesis, therefore we should “reject the null.” That is, we should not assume the real world is like the null hypothesis. Or, yes, the null hypothesis does a sufficient job accounting for the data, and we should not reject it. This is a judgement call — there is no definite answer about how much certainty is required. However, different fields have their conventions. In the social sciences, when p-values are less than 0.05 scientists usually reject the null.

Finally, our interpretation of statistical significance should be informed by current knowledge. There’s already evidence that antidepressants are at least mildly effective, so another study corroborating that is no big surprise. But what would happen if a new study came out that said, “Antidepressants actually make depression worse (p=0.03)?” This is just the kind of shocking finding that makes it into top scientific journals, and gets science journalists frothing at the mouth with excitement. Just one problem, there’s very little reason to think it’s right. Looked at in isolation, the single study may be compelling; but, considered on the background of all other studies that have been done on antidepressants, a fluke seems far more likely. The boring truth is, when a finding flies in the face of existing evidence, it’s probably wrong. Unfortunately, scientists and journalists don’t always play their part and it can be left to the reader to exercise due skepticism and interpret the finding carefully. Reading science is like drinking: consume responsibly.

Thanks to Dr. John Kruschke for helpful comments.

Edited by Abigail Kimmitt and Maria Tiongco

A skeptic’s guide to statistical significance

Leave a Reply Cancel reply

Subscribe By Email