The p-value is simply a statistical probability. It is a measure of the value of experimental evidence for or against the null hypothesis of no effect in the real world. It is calculated on the assumption that the null hypothesis is true. Hypothesis testing is traditionally what we do in research to narrow or eliminate a possible explanation for an observation, whether health-related or environmental, using experimental data.
Calculating a p-value is a test of the null hypothesis that there is no effect of a proposed factor on a health problem in the real world. The p-values we obtain from analysis of our research data tells us how rare the results from our sample would be if the null hypothesis is true ― the smaller the p-value, the less likely the null of no effect is true. The p-value therefore only tells us whether or not we should stick with the null hypothesis, or question it.
P-values are one of the most misused, abused, maligned, misunderstood, yet overly trusted statistics in biomedical research, and have generated much controversy and commentary for at least a century. Rather than assess the importance of the research ourselves, we have turned over the decision process to this one statistic. P-values have the power to influence our belief in therapies, make public health policy, drive business decisions, and a plethora of other matters, including what constitutes a successful research career.
The inappropriate focus on p-values is partially responsible for the substantial rise in reporting statistically significant results that cannot be replicated, particularly in biomedical literature. Leading experts have proposed lowering the p-value threshold (traditionally 0.05) in the hope that this improves reproducibility. This is not a novel idea, as several scientific fields have already taken this approach. More than a decade ago, researchers in human genetics adopted a p-value threshold of 5×10-8 (a probability of 0.00000005 or 1 in 20 million) for statistical significance in genome-wide association studies. This is appropriate when you consider the complexity of the human genome and how many variants are tested in genome-wide association studies, and the likelihood of hundreds of false positives. Over time, these studies have proven to be highly reproducible.
“Lowering the threshold…would work as a dam that could help gain time and prevent drowning by a flood of statistical significance while promoting better, more-durable solutions.” (JPA Ioannidis, JAMA 2018)
At the heart of the p-value debate is the simple question of what constitutes solid evidence. Leading journals have reported on the reproducibility crisis and how best to fix this problem so that today’s discoveries do not become tomorrow’s ‘fake news’. Considerable blame has been laid at the feet of the ‘publish or perish’ culture in academic institutions, but this is a multi-faceted problem that needs to be addressed. A debate in public health literature on the usefulness of p-values occurred in the mid 1980s, but the paradigm shift has been slow to take effect.
In 2015 the editors of the journal Basic and Applied Social Psychology announced they would not publish papers containing p-values because they had become a “crutch for scientists dealing with weak data,” and that it was too easy to pass the p<0.05 bar. Responses were mixed, some declaring ‘awesome’ while others felt it was ‘throwing away the baby…’ Admittedly this may seem rather extreme — but in the short term it may be better to stem the misuse of p-values than for most published research to go down into science infamy.
The p-value threshold of 0.05 goes back to the early 1900s when Ronald Aylmer Fisher designed agricultural experiment to take into account the natural variability of crop yields. The idea of significance testing was borne out of assessing the influence of manure on crop yield. In his paper entitled ‘The arrangement of field experiments’ published in 1926, Fisher casually remarked that he “prefers to set the low standard of significance at the 5% point, and ignore entirely all results which fail to reach this level.” Fisher made few friends in the statistics community at the time, and since then scientists have grappled with the implications of Fisher’s logic, which was not based in mathematical theory.
Significance testing and p-values…“is an attempt to short-circuit the natural course of inductive inference…is surely the most bone-headed misguided procedure ever institutionalized in the rote training of science students” (William W. Rozeboom,1997).
One of the most common misuses of p-values is the assumption that it represents the probability that the study hypothesis is true.
Strictly speaking, the p-value is the probability of obtaining data that is at least as extreme as what is observed in the study sample, if the null hypothesis, i.e. there is no effect, was true. That’s a mouthful!! Here’s an attempt to try to deconstruct this.
Every research program should start with a research question or hypothesis.
Hypothesis testing is like having two little emoticons, one on each shoulder; on the left is Null (for the null hypothesis) who says ‘nope, there is no effect’. On the right is Alte (for the alternative hypothesis), who says ‘Oh but there is!!’
The question may be as follows: Can deodorants and antiperspirants increase your risk of breast cancer? (For those whose interest I’ve piqued, don’t throw away your deodorants!)
Null (H0) says: There is no difference in risk of breast cancer among people who use antiperspirants, compared to those who do not.
Alte (H1) says: There is a difference in risk of breast cancer among those using deodorants/antiperspirants, compared to those who do not.
In answering this question, the researcher will need to consider how best to design such a study, what data she needs, and how large a sample will adequately answer this question. This will involve sample size calculations for each study group based on assumptions about the effect size and the direction of effect, i.e. (does deodorant use increase or decrease your risk), the likelihood of finding a true positive, and the likelihood of a false positive. I’ve written on issues to do with sampling, random error and the play of chance in an earlier article.
P-values, which most statistical software will output along with effect estimates, basically tell us the probability that the study sample suggests there is a relationship between antiperspirant use and breast cancer, when there really isn’t one in the real-world population.
Translated, p-values tell us the likelihood that the study sample misled us or not—the smaller the p-value, the less likely it is that the study sample gave us something hugely different from what is going on in the real world.
If the p-value is small (traditionally less than 0.05), it tells us that the result we got from our sample would rarely happen by chance alone, and serves as evidence to reject the null.
If the p-value is large (traditionally more than 0.05), then there is a reasonable chance the effect observed in the study sample is a fluke, and serves as evidence to accept the null.
It is worth emphasizing again, a p-value less than 0.05 only tells us there is a slim chance of seeing an effect at least as big as what you saw in your data, if there is no real effect. P-values are all about the null hypothesis — that little skeptic on our left shoulder. It assumes that the null hypothesis is true.
Even if you set out to prove the alternative hypothesis, the p-value only gives you evidence for or against the null hypothesis. It does NOT allow you to accept the alternative hypothesis. That can only be done if you consider the real-world implications of your research, and assess all aspects of the study question. Accepting the alternative hypothesis essentially comes down to a causality question, which I have reviewed in a previous blog.
The jury remains out on the best approach to the problem of p-value misuse, and whether lowering the threshold is the answer. Suffice it to say, p-values have a place in science, but they have risen to a level of prominence that cannot be justified.
For those reading the scientific literature, p-values should not be used to make decisions or conclusions about the utility of the research, although a cursory regard for the p-value is in order. I would pay attention to the effect size and the confidence intervals, which are particularly important in reports of clinical trials and meta-analyses, and far more relevant to decisions about public health.
It is also important in reviewing a research paper to assess whether the authors clearly stated their aims, provided details of the design of the study and how large or representative the sample is in comparison to the stated conclusions. This would at least tell us most of what we need to know about the potential for real-world application, more so than the p-value.
At SugarApple Communications we can help you find the best way to analyse and interpret your important data, and communicate it to your intended audience. Get in touch today and let’s talk.