We recently took a guided tour of statistical significance, in which we focused on how the media often fails to correctly interpret statistical information. But, journalists are not the only group that is tripped up by statistics. The scientific community itself has been engaged in deep debate about the proper use of statistical methodology. These debates have been so intense that some have referred to them as the “Statistics Wars.” The latest salvo was a statement in Nature, the world’s premier scientific journal, signed by over 800 scientists and statisticians calling for the end of statistical significance.
Just what is it about statistical significance that gets scientists so worked up?
You can find a more detailed explanation via the link in the first sentence of this post, but the key thing to understand about statistical significance is that it is a benchmark which gives scientists some insight into the uncertainty of their study. To evaluate statistical significance, scientists first specify a threshold, often 0.05. Then, they measure a “p-value.” A p-value measures the likelihood of getting data that looks like the actual data (or even more extreme), assuming there is actually no effect. If the p-value is less than the threshold, then the study is considered statistically significant.
The use of statistical significance and p-values are associated with a particular approach to science called “hypothesis testing.” In hypothesis testing, the role of scientific investigation is to decide between competing hypotheses. For example, let’s say you’re testing whether a new cancer drug for melanoma is better than the current leading treatment option. A hypothesis testing approach decides between two options: the null hypothesis, that the drugs work the same (i.e., there is no difference between the drugs);1 and the alternative hypothesis, that there is a difference between the drugs. A hypothesis testing approach differs from other approaches which, for example, may care more about how much of a difference there is between the two drugs, rather than merely whether there is a difference.
Support of a hypothesis is always only provisional. One can never be 100% certain what the correct answer is — that’s what statistics helps with, quantifying uncertainty. But, nothing can tell a scientist how much uncertainty they should be comfortable with, which is precisely the allure of statistical significance. For decades, statistical significance provided a decision rule in a number of social science fields for hypothesis testing. If the p-value was below the specified level, usually 0.05 (from a tentative statement by the famous statistician Ronald Fisher decades ago), then the results were statistically significant and the null hypothesis would be rejected. Moreover, in accordance with this decision rule came an implicit publishing rule: statistically significant results merited publishing, whereas statistically non-significant results were a waste of everybody’s time and should be stuffed in a file drawer.
There were simmering concerns about the pervasive and incautious use of statistical significance for decades, but shit hit the fan with the so-called “replication crisis,” an ongoing methodological crisis in science concerning the inability to achieve the same results of noted studies upon repeating the experiments. In 2012, scientists at the biotech company Amgen reported they could only reproduce the results of 6 out of 53 landmark cancer studies when they repeated these studies in their lab; and in 2015, a group of psychologists failed to successfully reproduce 70 out of 100 repeated social psychology studies. Quickly, statistical significance was identified as one of the main villains, not only because it is confusing and often misleading to both scientific and general audiences, but also because it contributes to very real methodological problems within science.
For one, alongside statistical significance came a preference towards “positive” results, meaning that the results posit an effect. Drawing from our example above, a positive result would claim that there is a difference between the new melanoma drug and the existing treatment.2 A negative result, that there is not a statistically significant difference between the new melanoma drug and the existing treatment, would simply not get published. And, while p-values help scientists quantify uncertainty, they are imperfect and will deliver a false positive result some of the time. Imagine that there is in fact no difference between the current leading treatment and the new melanoma drug. If 10 different research groups are comparing these treatments and the threshold for statistical significance is 0.05, the chance that one of them gets a statistically significant finding by chance alone is over 40 percent, and of course, that’s the group that gets the publication! This phenomenon has led to the so-called file drawer problem: erroneous, statistically significant results get published, but the contradicting evidence is unpublishable and never sees the light of day (although, scientists have made some strides with this problem).
The existence of a fixed benchmark that makes the difference between publishable and unpublishable work has also motivated scientists to be more creative with data analysis than would probably be preferred. To quote the economist Ronald Coase, “if you torture the data long enough, it will confess.” Scientists are faced with many, many decisions when analyzing their data, from which data to include, to which statistical tests to run, to how many statistical tests to run (the more tests, the more likely scientists find something by chance alone). The abuse of this flexibility to intentionally or unintentionally achieve statistical significance when it is not merited is known as “p-hacking.”
The file drawer problem and p-hacking are widely recognized to have had a problematic influence on certain scientific fields, such as psychology and biomedical research. Scientists and statisticians are still negotiating how these negative implications of statistical significance should be dealt with. Some, as we saw, are trying to eliminate statistical significance completely. Others seek various degrees of more modest reform, such as providing venues to publish negative results and employing more nuanced statistical education.
However, for all these problems, the debate also highlights one of the most positive features of the scientific community. Unlike other communities, science openly discusses and seeks to resolve its methodological difficulties rather than simply hiring a slicker PR firm.
1 Technically, the null hypothesis refers to the hypothesis to be nullified (i.e., rejected) and not necessarily the hypothesis that there is no difference. However, in practice, the null hypothesis is usually framed as the hypothesis of no difference.
2 More precisely, it would “reject the null.” That is, there is not no difference.
Acknowledgements: I’d like to credit the philosopher Deborah Mayo for her fantastic appellation “Statistics Wars” (and for her excellent work on the topic).