The Problem With Statistical Significance
March 28, 2019 | In an article published late last week in the acclaimed journal Nature, three scientists argued for the “entire concept of statistical significance to be abandoned.” Garnering 250 concurring signatories in the first 24 hours after releasing a draft, the article now has over 800 signatures from academics, clinicians, statisticians, psychologists, and biologists from around the world in support of the call. This paper comes on the heels of 43 other articles on the same topic that appeared in a special issue of The American Statistician; these testimonies have been met with both controversy and support.
The authors of these reports argue that statistical significance, once used as a tool, is now used as a tyrant, dominating over all other statistical techniques and heralded as the end-all, be-all of research. They explain that the reliance on statistical significance has biased the literature and led to the dichotomization of other statistical measures. Importantly, one advocate articulated the call as a “surgical strike against thoughtless testing of statistical significance.” But what exactly is statistical significance?
Seen as a sort of statistical monopoly, the p-value determines statistical significance by either accepting or rejecting the null hypothesis that there is no relationship between two variables. For those who never made it to Statistics 101 in college, a study result that produces a p-value of p = 0.04, for example, tells us there is only a 4% chance that those data were related by random sampling error. At first glance, a 4% chance of error seems pretty small; there are a number of other factors, however, that influence the p-value such as insufficient power, unreliable measures, and other biases (e.g. sampling) that threaten internal validity.
Indeed, not all statistical significances have the same worth even if they produce the same number: a strong p-value in a study that uses 15 subjects is not equal to the identical p-value that uses 1,500 subjects, and the p-value of a larger spread of data is not equal to the p-value from a dataset with a reduced standard deviation. For these reasons and more, researchers must consider confidence intervals, effect sizes, and other statistical analyses in efforts to understand the implications of their data.
The most pressing issue in producing p-values revolves around the fact that it is currently a divisive and dichotomous measure. Many scientists throw out promising study results due to weak p-values even if there are salient trends. Some scientists overstate statistically significant results despite smaller powers (false positives). Almost ironically, in the effort to find an agreed upon truth, we have eliminated the aspect of uncertainty that inherently exists in nature. Why should a result be either statistically significant or irrelevant? Why is one margin enough to make or break a research study?
The world of science in academia is so cut-throat now that if a result does not reach statistical significance, then it is very difficult to publish that study in a journal; if scientists are not able to publish, then it is much more challenging to be awarded federal grants; and if an academic does not get grants, then it is nearly impossible to conduct independent science in an academic institution at all. In a system where statistical significance is perhaps the supreme factor in shaping an entire career, it is no wonder that the allure of playing around with data to reach a significance threshold is too enticing for some scientists to reject.
Many worry that these pressures have spurred years of questionable science, then leading to the recent replication crisis. Reproducibility in science is important in order to confirm and validate previous findings, but a recent five-year study that sought to replicate 21 studies between 2010-2015 were only able to successfully duplicate 13. Why are some of the most cited and esteemed papers in leading journals such as Nature and Science unable to be reproduced?
The problem is not that researchers regularly commit acts of fraud (though that does happen), but rather that they overestimate effect sizes, overlook false positives, and rely on statistical significance to convince colleagues of their findings. An impressive result that has not yet been cited in the literature is a much better story and will attract more funding than a study that simply reiterates a previous finding. This is a major problem in science. Federal agencies, journals, and scientists alike should remember that validation studies are equally as important as flashy studies. Journals should publish research even if it does not meet the statistical significance threshold, and agencies like the NSF and NIH should fund projects that seek to corroborate previous research.
Above all, individual researchers should remain incredibly skeptical of their data, and it should take extensive analyses, tests, and not just statistical significance to convince the self—let alone the lay public—of a finding. This level of diligence and caution would improve the standard of science and its reputability around the world.
Stephanie Reeves is a Research Assistant at Schepens Eye Research Institute in Boston and a graduate of the neuroscience program at Connecticut College.
The views expressed in this piece do not necessarily reflect the views of other Arbitror contributors or of Arbitror itself.
Photo in the public domain.