• The P-value debate―lowering the threshold, or not

    April 22nd, 2018 | by
    read more

    The p-value is simply a statistical probability. It is a measure of the value of experimental evidence for or against the null hypothesis of no effect in the real world. It is calculated on the assumption that the null hypothesis is true. Hypothesis testing is traditionally what we do in research to narrow or eliminate a possible explanation for an observation, whether health-related or environmental, using experimental data.

    Calculating a p-value is a test of the null hypothesis that there is no effect of a proposed factor on a health problem in the real world. The p-values we obtain from analysis of our research data tells us how rare the results from our sample would be if the null hypothesis is true ― the smaller the p-value, the less likely the null of no effect is true.  The p-value therefore only tells us whether or not we should stick with the null hypothesis, or question it.

    P-values are one of the most misused, abused, maligned, misunderstood, yet overly trusted statistics in biomedical research, and have generated much controversy and commentary for at least a century. Rather than assess the importance of the research ourselves, we have turned over the decision process to this one statistic. P-values have the power to influence our belief in therapies, make public health policy, drive business decisions, and a plethora of other matters, including what constitutes a successful research career.

    The inappropriate focus on p-values is partially responsible for the substantial rise in reporting statistically significant results that cannot be replicated, particularly in biomedical literature. Leading experts have proposed lowering the p-value threshold (traditionally 0.05) in the hope that this improves reproducibility. This is not a novel idea, as several scientific fields have already taken this approach. More than a decade ago, researchers in human genetics adopted a p-value threshold of 5×10-8 (a probability of 0.00000005 or 1 in 20 million) for statistical significance in genome-wide association studies. This is appropriate when you consider the complexity of the human genome and how many variants are tested in genome-wide association studies, and the likelihood of hundreds of false positives. Over time, these studies have proven to be highly reproducible.

    “Lowering the threshold…would work as a dam that could help gain time and prevent drowning by a flood of statistical significance while promoting better, more-durable solutions.” (JPA Ioannidis, JAMA 2018)

    At the heart of the p-value debate is the simple question of what constitutes solid evidence. Leading journals have reported on the reproducibility crisis and how best to fix this problem so that today’s discoveries do not become tomorrow’s ‘fake news’. Considerable blame has been laid at the feet of the ‘publish or perish’ culture in academic institutions, but this is a multi-faceted problem that needs to be addressed. A debate in public health literature on the usefulness of p-values occurred in the mid 1980s, but the paradigm shift has been slow to take effect.

    In 2015 the editors of the journal Basic and Applied Social Psychology announced they would not publish papers containing p-values because they had become a “crutch for scientists dealing with weak data,” and that it was too easy to pass the p<0.05 bar. Responses were mixed, some declaring ‘awesome’ while others felt it was ‘throwing away the baby…’ Admittedly this may seem rather extreme — but in the short term it may be better to stem the misuse of p-values than for most published research to go down into science infamy. 

    The p-value threshold of 0.05 goes back to the early 1900s when Ronald Aylmer Fisher designed agricultural experiment to take into account the natural variability of crop yields. The idea of significance testing was borne out of assessing the influence of manure on crop yield. In his paper entitled ‘The arrangement of field experiments’ published in 1926, Fisher casually remarked that he “prefers to set the low standard of significance at the 5% point, and ignore entirely all results which fail to reach this level.” Fisher made few friends in the statistics community at the time, and since then scientists have grappled with the implications of Fisher’s logic, which was not based in mathematical theory.

    Significance testing and p-values…“is an attempt to short-circuit the natural course of inductive inference…is surely the most bone-headed misguided procedure ever institutionalized in the rote training of science students” (William W. Rozeboom,1997).

    One of the most common misuses of p-values is the assumption that it represents the probability that the study hypothesis is true.

    Strictly speaking, the p-value is the probability of obtaining data that is at least as extreme as what is observed in the study sample, if the null hypothesis, i.e. there is no effect, was true. That’s a mouthful!! Here’s an attempt to try to deconstruct this.

    Every research program should start with a research question or hypothesis.

    Hypothesis testing is like having two little emoticons, one on each shoulder; on the left is Null (for the null hypothesis) who says ‘nope, there is no effect’. On the right is Alte (for the alternative hypothesis), who says ‘Oh but there is!!’ 

    The question may be as follows: Can deodorants and antiperspirants increase your risk of breast cancer? (For those whose interest I’ve piqued, don’t throw away your deodorants!)

    Null (H0) says: There is no difference in risk of breast cancer among people who use antiperspirants, compared to those who do not.

    Alte (H1) says: There is a difference in risk of breast cancer among those using deodorants/antiperspirants, compared to those who do not.

    In answering this question, the researcher will need to consider how best to design such a study, what data she needs, and how large a sample will adequately answer this question. This will involve sample size calculations for each study group based on assumptions about the effect size and the direction of effect, i.e. (does deodorant use increase or decrease your risk), the likelihood of finding a true positive, and the likelihood of a false positive. I’ve written on issues to do with sampling, random error and the play of chance in an earlier article.

    P-values, which most statistical software will output along with effect estimates, basically tell us the probability that the study sample suggests there is a relationship between antiperspirant use and breast cancer, when there really isn’t one in the real-world population.

    Translated, p-values tell us the likelihood that the study sample misled us or not—the smaller the p-value, the less likely it is that the study sample gave us something hugely different from what is going on in the real world.

    If the p-value is small (traditionally less than 0.05), it tells us that the result we got from our sample would rarely happen by chance alone, and serves as evidence to reject the null.

    If the p-value is large (traditionally more than 0.05), then there is a reasonable chance the effect observed in the study sample is a fluke, and serves as evidence to accept the null.

    It is worth emphasizing again, a p-value less than 0.05 only tells us there is a slim chance of seeing an effect at least as big as what you saw in your data, if there is no real effect.  P-values are all about the null hypothesis — that little skeptic on our left shoulder. It assumes that the null hypothesis is true

    Even if you set out to prove the alternative hypothesis, the p-value only gives you evidence for or against the null hypothesis. It does NOT allow you to accept the alternative hypothesis. That can only be done if you consider the real-world implications of your research, and assess all aspects of the study question. Accepting the alternative hypothesis essentially comes down to a causality question, which I have reviewed in a previous blog.

    The jury remains out on the best approach to the problem of p-value misuse, and whether lowering the threshold is the answer. Suffice it to say, p-values have a place in science, but they have risen to a level of prominence that cannot be justified.

    For those reading the scientific literature, p-values should not be used to make decisions or conclusions about the utility of the research, although a cursory regard for the p-value is in order. I would pay attention to the effect size and the confidence intervals, which are particularly important in reports of clinical trials and meta-analyses, and far more relevant to decisions about public health.

    It is also important in reviewing a research paper to assess whether the authors clearly stated their aims, provided details of the design of the study and how large or representative the sample is in comparison to the stated conclusions. This would at least tell us most of what we need to know about the potential for real-world application, more so than the p-value.

    At SugarApple Communications we can help you find the best way to analyse and interpret your important data, and communicate it to your intended audience. Get in touch today and let’s talk.

  • Let the data speak truthfully: chance findings

    April 28th, 2017 | by
    Let the data speak truthfully: chance findings - SugarApple Communications read more

    The dreaded topic of statistics is one that has both confounded and fascinated me as a scientist who enjoyed research and discovery, but tolerated the ‘number-crunching’ part of it as a necessary evil. I have had to come to terms with a subject that I avoided earlier on in my high-school days, i.e. statistics. In fact I still remember absolutely loathing it, but as with so many of life’s ironies, it came back to haunt me, because in scientific research, if you are ‘fair dinkum’ about your work, you need to get acquainted with different statistical tests, what they tell us, and how to translate them into something meaningful. 

    A prominent and well-respected statistics professor of mine during my PhD candidacy, who was both feared and respected by students and faculty for his candour, bluntness and militant adherence to scientific rigor and discipline in health sciences research, is still one whose example I draw upon when considering the analysis output of any given project that I’m preparing for publication. I will not bore you with statistics-speak in this article, but will try to put into ordinary everyday language what the main statistics mean when we write articles for any audience, whether our scientific peers or the general public. 

    In medical research, unless you are able to identify every single individual with the condition that you are studying, everywhere in the world, get their consent to join your study, and get every piece of information you need, including information you don’t know you need but suspect you might – then your research is essentially sample-based.

    As a researcher, after you have decided what the medical condition is that you wish to study, and what new knowledge is needed, you then need to decide who you will study. You may have a wide range of subsets of the population with the medical condition that you can draw from, and your choice will depend on the research question.

    A major compromise of sample-based research is that the individuals you choose to study (collectively your study sample) could have a range of characteristics that are widely different from other samples studied by other researchers, and therefore could generate different results.  This variability, which is the differences in measurements across different samples of the same general population, can be the start of a problem that we will call ‘sample error’. The key question then remains, how do I select a study sample that minimizes ‘sample error’?

    When a researcher chooses a sample to study, she can use a number of schemes to select them. She can impose any number of restrictions according to gender, ethnicity, age, geographical location etc. But it is important to realize the only purpose in studying a sample is to represent the larger population we are interested in. So we must decide at the start of the research what relationships we want to identify between patient characteristics and the medical condition we’re studying. If our sample is distorted in any way and not representative of the larger population, then our results will likewise be distorted.

    We must therefore choose a sample that provides the clearest view of the population we want to study. Random sampling tends to be the least biased selection process. This means applying a scheme that gives each eligible individual the same chance of being selected for the study. However, it is not fool-proof, and even the best random sampling scheme can generate a ‘bad hand’, meaning what we see in the sample is not reflective of the general population. How do we decide whether we have been dealt with a ‘bad hand’?

    The p-value answers this question. This is a measure that we see in almost every research effort. Put simply, the p-value gives us the probability that we do not have a sample that is representative of the population. In statistics-speak, it is also known as alpha error or the probability of type I error. The p-value is the probability that the relationship we identified between certain characteristics of the sample population and the medical condition we are studying, was there just through the play of chance. In other words, our study sample misled us.

    The threshold that most studies use as the level at which we decide that the finding is significant (statistically) is 5%, i.e. p<0.05 is considered significant. Most analysis approaches will automatically generate this statistic along with other measures of the relationship, which we will deal with in another article. P<0.05 is quite arbitrary and more of a tradition that goes back to the days of Sir Ronald Fisher (1890-1962). However a researcher can and should set this threshold independently and during the design stages taking into consideration the number of statistical tests she intends to carry out, and what she plans to do with her research findings, i.e. apply it to medical practice or use it as a clue to search even further in a larger sample to find confirm these initial findings.

    P-values are only one element of what the sample tells us, and by no means the most important. It simply gives us a gauge of how good our study sample is. The interpretation of it is also dependent on whether we followed our study protocol as outlined at the start of the study.

    Overall, the estimates our analysis gives us are what we must be careful to interpret in light of what we set out to look for, assuming that we did not mid-way through the study, decide to shift course because of how interesting the data itself looked. In the latter case, our p-value would in actual fact be uninterpretable, riddled with random error, and essentially meaningless. This is where study rigour and discipline is paramount to the validity of the research findings.

    We will deal with the other aspects of data analysis that are relevant to research in upcoming articles.

Unfog the science…ensure quality, clarity and accuracy.