• Skip to Content
  • Skip to Sidebar
IU

Indiana University Bloomington Indiana University Bloomington IU Bloomington

Menu

ScIUConversations in Science at Indiana University

  • Home
  • Home
  • About ScIU
  • Write with Us!
  • Contact ScIU
  • The Writers and Editors of ScIU
  • ScIU in the Classroom
  • Annual Science Communication Symposium
  • Search

Dispatches from the statistics wars

Posted on August 24, 2019 by Evan Arnet

 Four soldiers side by side in front of a low-flying helicopter.
The statistics wars are in full swing. Photo by Somchai Kongkamsri and licensed by Pexels.

We recently took a guided tour of statistical significance, in which we focused on how the media often fails to correctly interpret statistical information. But, journalists are not the only group that is tripped up by statistics. The scientific community itself has been engaged in deep debate about the proper use of statistical methodology. These debates have been so intense that some have referred to them as the “Statistics Wars.” The latest salvo was a statement in Nature, the world’s premier scientific journal, signed by over 800 scientists and statisticians calling for the end of statistical significance.  

Just what is it about statistical significance that gets scientists so worked up?

You can find a more detailed explanation via the link in the first sentence of this post, but the key thing to understand about statistical significance is that it is a benchmark which gives scientists some insight into the uncertainty of their study. To evaluate statistical significance, scientists first specify a threshold, often 0.05. Then, they measure a “p-value.”  A p-value measures the likelihood of getting data that looks like the actual data (or even more extreme), assuming there is actually no effect. If the p-value is less than the threshold, then the study is considered statistically significant.

The use of statistical significance and p-values are associated with a particular approach to science called “hypothesis testing.” In hypothesis testing, the role of scientific investigation is to decide between competing hypotheses. For example, let’s say you’re testing whether a new cancer drug for melanoma is better than the current leading treatment option. A hypothesis testing approach decides between two options: the null hypothesis, that the drugs work the same (i.e., there is no difference between the drugs);1 and the alternative hypothesis, that there is a difference between the drugs. A hypothesis testing approach differs from other approaches which, for example, may care more about how much of a difference there is between the two drugs, rather than merely whether there is a difference.

Support of a hypothesis is always only provisional. One can never be 100% certain what the correct answer is — that’s what statistics helps with, quantifying uncertainty. But, nothing can tell a scientist how much uncertainty they should be comfortable with, which is precisely the allure of statistical significance. For decades, statistical significance provided a decision rule in a number of social science fields for hypothesis testing. If the p-value was below the specified level, usually 0.05 (from a tentative statement by the famous statistician Ronald Fisher decades ago), then the results were statistically significant and the null hypothesis would be rejected. Moreover, in accordance with this decision rule came an implicit publishing rule: statistically significant results merited publishing, whereas statistically non-significant results were a waste of everybody’s time and should be stuffed in a file drawer.

A giant skeleton labeled “The need to get p<0.05” bears down on a small boy labeled “reasonable, conservative data analysis.”
Original painting: “Night Walking” by Boris Groh.

There were simmering concerns about the pervasive and incautious use of statistical significance for decades, but shit hit the fan with the so-called “replication crisis,” an ongoing methodological crisis in science concerning the inability to achieve the same results of noted studies upon repeating the experiments. In 2012, scientists at the biotech company Amgen reported they could only reproduce the results of 6 out of 53 landmark cancer studies when they repeated these studies in their lab; and in 2015, a group of psychologists failed to successfully reproduce 70 out of 100 repeated social psychology studies. Quickly, statistical significance was identified as one of the main villains, not only because it is confusing and often misleading to both scientific and general audiences, but also because it contributes to very real methodological problems within science.

For one, alongside statistical significance came a preference towards “positive” results, meaning that the results posit an effect. Drawing from our example above, a positive result would claim that there is a difference between the new melanoma drug and the existing treatment.2 A negative result, that there is not a statistically significant difference between the new melanoma drug and the existing treatment, would simply not get published. And, while p-values help scientists quantify uncertainty, they are imperfect and will deliver a false positive result some of the time. Imagine that there is in fact no difference between the current leading treatment and the new melanoma drug. If 10 different research groups are comparing these treatments and the threshold for statistical significance is 0.05, the chance that one of them gets a statistically significant finding by chance alone is over 40 percent, and of course, that’s the group that gets the publication! This phenomenon has led to the so-called file drawer problem: erroneous, statistically significant results get published, but the contradicting evidence is unpublishable and never sees the light of day (although, scientists have made some strides with this problem).

The existence of a fixed benchmark that makes the difference between publishable and unpublishable work has also motivated scientists to be more creative with data analysis than would probably be preferred. To quote the economist Ronald Coase, “if you torture the data long enough, it will confess.” Scientists are faced with many, many decisions when analyzing their data, from which data to include, to which statistical tests to run, to how many statistical tests to run (the more tests, the more likely scientists find something by chance alone). The abuse of this flexibility to intentionally or unintentionally achieve statistical significance when it is not merited is known as “p-hacking.”

The file drawer problem and p-hacking are widely recognized to have had a problematic influence on certain scientific fields, such as psychology and biomedical research. Scientists and statisticians are still negotiating how these negative implications of statistical significance should be dealt with. Some, as we saw, are trying to eliminate statistical significance completely. Others seek various degrees of more modest reform, such as providing venues to publish negative results and employing more nuanced statistical education. 

However, for all these problems, the debate also highlights one of the most positive features of the scientific community. Unlike other communities, science openly discusses and seeks to resolve its methodological difficulties rather than simply hiring a slicker PR firm.

1 Technically, the null hypothesis refers to the hypothesis to be nullified (i.e., rejected) and not necessarily the hypothesis that there is no difference. However, in practice, the null hypothesis is usually framed as the hypothesis of no difference.
2 More precisely, it would “reject the null.” That is, there is not no difference.

Acknowledgements: I’d like to credit the philosopher Deborah Mayo for her fantastic appellation “Statistics Wars” (and for her excellent work on the topic).

Edited by Riddhi Sood and Taylor Nicholas

Print Friendly, PDF & Email

Related

Filed under: General Science, Scientific Methods and TechniquesTagged history and philosophy of science, psychology, publishing, Statistics

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Additional Content

Search ScIU

Categories

Tag cloud

#Education #scicomm animal behavior anthropology astronomy astrophysics Biology biotechnology Black History Month brain cannabinoids Chemistry climate change conservation coronavirus COVID–19 diversity Diversity in Science diversity in STEM Ecology environment evolution geology history and philosophy of science infectious disease Interdisciplinary Interview Mental Health methods microbiology neuroscience outreach physics Plants primates psychology Research science science communication science education Science Outreach science policy Statistics STEM women in STEM

Subscribe

Receive a weekly email with our new content! We will not share or use your information for any other purposes, and you may opt out at any time.

Please, insert a valid email.

Thank you, your email will be added to the mailing list once you click on the link in the confirmation email.

Spam protection has stopped this request. Please contact site owner for help.

This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Current Contributors

  • Log in
  • SPLAT
  • ScIU Guides

Indiana University

Copyright © 2022 The Trustees of Indiana University | Privacy Notice | Accessibility Help

  • Home
  • About ScIU
  • Write with Us!
  • Contact ScIU
  • The Writers and Editors of ScIU
  • ScIU in the Classroom
  • Annual Science Communication Symposium
College of Arts + Sciences

Are you a graduate student at IUB? Would you like to write for ScIU? Email sciucomm@iu.edu


Subscribe

Subscribe By Email

Get every new post delivered right to your inbox.

This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

 

Loading Comments...