Replication Crisis: Using P(Double False Positive) and P(Failed Replication) to Calculate Risk of Confusion
Some have claimed that using Bayes’ theorem instead of null hypothesis testing can help solve the replication crisis, and, the bad news is, just shifting over to using Bayesian statistical inference away from null hypothesis testings does NOT solve the replication problem.
The good news is that under both null hypothesis significance testing and Bayesian inference it is possible to adopt strategies that will minimize the double false negative problem (false replication error) and maximize the likelihood of correct replications within the data analysis decisions of a study and a pair of studies.
I’ll publish the findings of the Bayesian inference framework in a later blog post.
The replication crisis exists due to many problems, such as low power (small sample sizes), analysis-to-result (data dredging), and publication bias (publication of only significant results). These factors can seriously mislead a line of research focus. For example, let’s say ten studies are conducted, 8 of which have Power >0.8, and two of which have Power <0.4. None of the 8 find a significant result, but they are never published. The two studies that ARE published are in the literature, and I think they are telling me something important. The studies did not report nor compute a prior power, so I have no idea what the power is; once a positive (significant) is found, post-hoc power analysis is no help. So I decide to try to replicate the studies. I power my own study sufficiently, and the results fail to replicate. I wasted my time, I wasted resource dollars.
A Bayesian analysis is mislead by the missing prior information from the unpublished studies, and so a simple shift to it will not help. In fact, 100% of the published studies tell me that the null hypothesis is not true, and so I might set my prior probability to 0.9, or 0.95, when in reality, 80% of the evidence (rawly considered; I have a Corroboration Index that is less naive) tells me the null hypothesis cannot be rejected, and my prior might be closer to 0.2. So clearly, simply jumping into Bayesian analysis cannot resolve issue with replication. Some would say it could make it worse, because individuals bring their subjective prior probabilities to the game; others say that’s a benefit because at least the priors are made explicit.
It can be difficult in either Neyman-Pearson/Fisherian NHST or in Bayesian analysis to escape the effects of publication bias. But let’s assume your field of inquiry is improving; journals are starting to require a priori power analyses, and starts publishing negative results, and you have decided to proceed in a new direction, with a new question. What is the optimal way to proceed with a new question that both allows you to (1) minimize the probability of replication error (the double false negative problem) AND (2) maximize the probability of correct (eventual) replication. Many people think these are the same question, and they are not, and that is part of the problem and why I spend my time #UnbreakingScience.
NHST Itself is Not to Blame for the Double False Negative Problem
One thing that could be done rather easily for any published study is to determine for reasonable effect size and the fixed sample size of the study, N = N1+N2, (let’s use t-test as an example, and a reasonable Cohen’s d for the purpose of discussion), one can calculate not only the Power at a fixed alpha (traditionally ∝ = 0.05); one can only determine the alpha at which the null hypothesis has an 80% probability of being rejected if the alternative hypothesis is true (i.e., if the assumed value of Cohen’s d for a study is correct). This would be reported as follows:
“The t-test would have 80% power at our sample size N had a px100% chance of being rejected at the assumed effect size (Cohen’s d= 0.05)”
Cohen’s d would vary based on available data.
The benefits of this explicit reporting is that the study “owns” the actual probability of rejecting the null limited by the study’s sample size, hypotheses are less likely to persist due to the selective publication of only significant results.
Here’s a table that might help illustrate (again, for the t-test, Cohen’s d=0.05):
|P(Type 1 Error)||Power||P(Type 2)||Sample Size per group|
Let’s simplify and re-arrange the table for the purpose of making the inference about the Type 1 error a function of the sample size:
|Sample Size per group||P(Type 1)|
Assuming P>0.8, and given N, the resulting Probability of a Type 1 Error is then known (for a given study, given the a prior effect size estimate).
The probability of a double false positive for a study given the P(Type 1 Error) given the sample size from an existing study and the same parameters for a validation study is the product of the P(Type 1 Errors), or
P(Double False Positive) = P(Type 1 Error of Study 1)P(Type 1 Error of Study 2)
This P(DFP) can be mapped in the planning phase of study to determine the sample size needed to keep P(DFP) below 0.05. For example, if a study was published with a per group sample size of N1=N2 = 102, P(Type 1) = 0.1. If we then explore the use of any of the sample sizes in the table, we immediately exonerate NHST per se as the problem of replication for every initial study over N=204, because all P(DFP) < 0.02!
|P(Type 1 S1)||P(Type 1 S2)||P(DFP)|
In fact, even at N1=N2=86, we would need a very high probability of a Type 1 error in the first study P(Type 1 Error S1) of >0.75 to risk a P(DFP) > 0.05.
Thus, we can conclude that NHST itself is not the problem. Something else, and many something elses, are driving the P(Type 1 Error S1) higher. Many studies are published with very low samples, and I think this is because statistics professors teach studies that as long as N>30, the t-test is “robust”. This is simply not true. Even at N=46 (per group), P(Type 1 Error) = 0.4, or 40% for the t-test with Cohen’s d=0.05. At N=30, the t-test has P(Type 1 Error) = 0.4 with Power=0.7.
We need to change the way Stats 101 is taught.
John Ionnidis and others found that many other things than NHST per is amplifying the Type 1 error rate. In their analyses, they cite publication biases, favored hypotheses, over-analyzing the data repeatedly to achieve significance, and other factors all conspire to drive P(Type 1 Error S1) higher than it should be.
But there are other types of replication errors, too. Consider that when studies are formally powered, they are typically powered so Power >0.80. That means that as many as 20% of the studies that are conducted will suffer a Type II error rate. Assuming Cohen’s d=0.05 is correct, we then factor in the Probability of False Failed Replication. It might appear that
P(FFR) = (1-Power of Study 1)*(1-Power of Study 2)
But this is incorrect. In reality,
P(FFR) = max(1-Power of Study 1, 1-Power of Study 2)
Because only one of the studies has to fail for replication opportunity to be missed.
P(FFR), too, can be calculated during the planning phases of a study- and a replication study. How to consider simultaneous the risk of P(FFR) and P(DFP) is tricky, because in areas of sampling effort in which studies are conducted, P(DFP)<<P(FFR). So we must, as society, determine not which risk we prefer, but instead, how much do we value minimizing P(DFP) relative to P(FFR). Other than their shared cause of sample size, they are independent (one does not determine the other). And so to minimize both would be a good goal; however, we have to determine which risk harms us more.
I would argue that P(DFP) harms society more than P(FFR) (per event), even though DFP events are rare relative to FFR events. The effect of two or more false positive findings in the literature has a lasting effect because it tends to create a paradigm (in Kuhnian terms) that may be reluctant to falsification, whereas an FFR will eventually be overturned if multiple attempts to replicate the initial result are repeatedly re-tested. Also, even in the presence of no publication bias, scientists, the media, and the public are more impressed by “significant” results; two falsely significant results will give an idea duration. Because DFPs are rare, we also do not expect them; thus our (at least historical) enhanced confidence when we see replication.
For this exploration I will assume (somewhat arbitrarily) that every DFP event costs society 10,000 x more units of cost (whatever units you care to imagine) than one FFP event. What can we say about minimizing overall risk with this assumption?
Pairing studies with all combinations of Power from the previous tables at all levels of a (0.9, 0.9, 0.05; 0.09, 0.08, 0.05… etc), and assuming each DFP is 10,000 times more important, we can rank the paired study designs (Study 1 and Study) by the summed weighted risk
SWR + P(DFP)w + P(FFP)w
Assuming the first study set ∝ = 0.05, here are the results in Excel for you to play with; you can change weights, for example.
Here are the results graphically showing (not surprisingly) that the largest combination of samples across the pairs of studies explore rank lowest in summed risk:
On the left in the graph are, not surprising, the best-powered pairs of studies of 0.9, 0.9; 0.9, 0.8; 0.9, 0.7 at the larger sample sizes. Note there is no fixed linear relationship between sample size across all of these study conditions. In the top 25 study combinations, the minimum sample size (both groups, both studies) was 342, but importantly, the median sample size (both groups, both studies) was 480.
We can learn from this some rules of thumb:
(1) Validation studies that give Power > 0.7 at ∝<0.025 have a lower downstream cost to society by minimizing error propagation.
(2) There is no intermediate; any study or pairs of studies where in one or both have Power < 0.7 and ∝>0.025 will be likely to produce confusing results alone, and in the context of replication have an higher risk of conflicting results due to chance.
(3) Initial studies could publish recommended designs for replication efforts based on their power, effect size and significance testing strategy, and those planning validation studies can also use P(FFR) and P(DFP) formally in their planning to insure they are minimizing the weighted summed risk.
In sum, NHST does not necessarily lead to a replication crisis. Type 1 error inflation is due to a factors outside the realm of best practices of NHST. These factors are very much under the control of the researcher, including those considering studies to replicate.
I am not attached to the rules of thumb that emerged from this exploration; in particular the results shared are dependent on a 10,000:1 relative preference for DFP relative to FFP. The Excel file is shared so others can manipulate the weighting on Sheet 1. Cut and past the re-weighted results into new sheet and rank sort to see if your weighting scheme gives different results.
Happily, the goal of minimize both risks is aligned, so any step taken in the NHST framework to minimize one will tend to minimize the other (generally speaking).
James Lyons-Weiler, PhD
Allison Park, PA 15101