B Reliably Predicting Reproducibility
Those who are concerned about the current state of science invariably cite the statistical arguments of John Ioannidis, Katherine Button, and their colleagues that science is suffering a Reproducibility Crisis.
What gets less attention is that the work of these statisticians also implies that hypothesis-based work should as a rule be more reproducible than less structured research programs.8. B.1 Positive Predictive Validity
Physically replicating studies to find out if they are reliable can be a formidable undertaking, as we've seen. As an analytical tool for forecasting reproducibility, statisticians introduced the concept of positive predictive value (PPV),1 which is a number that ranges between 0 and 1 (higher is better) and estimates the likelihood that a result will be reproducible. PPV combines three familiar factors: the p-value, statistical power, and probability. The conclusion that we're going to arrive at is that having a focused experimental hypothesis with only a few alternative hypotheses gives a much higher PPV than, say, gene search strategies where investigators are testing thousands of alternative “hypotheses.” This means that the chances of reproducing hypothesis-based experimental outcomes are much higher than reproducing the results of relatively unconstrained searches. I think it will be helpful to outline the reasoning behind the PPV, but if you're only interested in the bottom line, you can skip to Section 8.B.2. and take up the argument there.
Suppose that you've compared the heights of two groups of people and found a significant, p < 0.05, difference between them (we'll put aside objections to null hypothesis significance testing mode [NHST]2 for the moment). The p-value only tells you the probability that your result is an accident of random chance and that you are wrong in thinking that there is a real difference between the experimental groups.
However, the 5% (p < 0.05) chance of being wrong does not mean that you have a 95% chance of being right. That is a major shortcoming. You want to know the truth: Do the two groups differ in height or not? And you can't get that from the p-value alone. Enter the PPV.To make valid predictions, you want to know if your result is going to be reliable. The PPV depends on the p-value and the statistical power of your test. Power, remember, is the probability that you will correctly reject the null hypothesis when it is false and correctly conclude that there is a difference between the experimental groups. The more powerful the test, the more likely you are to correctly decide that one group of women is in fact taller than the other and so on.
In addition to the p-value and statistical power, to calculate PPV you need to have a value of the pre- study probability or pre- study odds that a result you get will be correct. (In contrast, the PPV itself gives you the post-study odds of being correct.) The pre-study odds are analogous to the prior probability of Bayesian statistics. As with the Bayesian priors, you can't determine pre-study odds exactly; luckily ballpark estimates are frequently good enough.
To estimate pre-study odds, you can work from informed guesses about the number of “true” (i.e., provisionally true) hypotheses that you might find (T) and the number of all hypotheses that you will evaluate (the universe of all true and false ones, T + F). Assuming that each hypothesis is either T or F, then the simple frequentist probability that your hypothesis is true is T/(T + F). (Yes, Karl Popper would no doubt frown at this whiff of “probable truth,” but never mind.) Obviously, when you have a focused hypothesis with one potentially true result and only a few false alternatives, T/T + F will be much greater than when you're doing an experiment where there is only one potentially true result and thousands of false ones. In that case, T + F will be huge and pre-study odds, T/(T + F), tiny.
Pre-study odds will be critical because PPV depends directly on them.There is a slight wrinkle; T/T + F gives the odds of occurrence of true results, not the odds of detecting true results. What's the difference? Because all tests are imperfect, you cannot detect all of the true results that might occur, but only the fraction that the power of your test allows you to detect. So this is why you need to know about statistical power. If your statistical test had a power of 0.8, which would be very good, you could theoretically detect 80% of the true results, not 100% of them. PPV goes down if power goes down.
8. B.2 Calculating PPV
When you put it all together, the formula for PPV is:
PPV = Pre-study odds
Pre-study odds+(p-value/statistical power) Equation 8.1
The formula is an arrangement of the three factors we've been discussing. It states that PPV increases toward 1 (the maximum) as statistical power or prestudy odds increase. PPV also increases as p-value decreases because then the fraction, p-value/statistical power, goes toward 0. And, of course, combinations of these factors can occur.
The formula makes intuitive sense. Remember that we're trying to figure out the probability that experimental results will be reproducible. If you know going into an experiment that there is a good chance that you'll detect a true result because one hypothesis is based on a lot of prior information, then the pre-study odds will be high. If the power of your test is high, you'll be likely to reject the null hypothesis correctly. Finally, if the p-value is small, you'll be less likely to mistakenly reject the null hypothesis. All of these factors will make the PPV greater.
Here is a numerical example: suppose that you suspect that a certain disease is caused by a genetic mutation, and you plan to screen 1,000 genes to try to find a genetic marker for the disease (i.e., a gene whose expression is reliably associated with the disease). You have no prior information, and you are blindly sifting through the group of genes, comparing their expression in normal and diseased individuals.
You're looking for changes that are significant at the p < 0.05 level, and, to keep the illustration simple, let's assume there is just one true marker gene for this disease. In a sense, you are evaluating 1,000 distinct “hypotheses” (this is how the statisticians analyze the experiments) and trying to find the right one by eliminating the rest. Notice—and this the crucial point—that the pre-study odds that any particular hypothesis is true are extremely low; just 1/1000 = 0.001!Now, at the significance level ofp < 0.05, if you screen 1,000 genes, you expect to get about 50 positive hits in total (we'll pretend that the number is exactly 50 to keep it simple). That is, we expect about 50 of our tests to be significant at p < 0.05. If one of the hits is the right gene, then the rest (49/50) will be false positives; you'll be wrong 98% of the time. The odds of detecting the true gene are heavily stacked against you. What would PPV be in such a case? With pre-study odds of 0.001, p-value of 0.05, and, again to keep the numbers simple, assuming a medium statistical power of 0.5, then PPV, the probability of reproducing your finding, would be 0.01, or 1 in 100. Even if the power of your test were maximal, 1.0, the PPV would only be 0.02. Your finding would stand only a 2% change of being replicated under the best of circumstances.
You could tweak the outcomes a bit by increasing your significance level: The number of false positives would go down if you opted for p < 0.01, for example, but this would decrease your chances of finding true positives (i.e., your statistical power would go down). And if you were to do a search of 10,000 genes, the PPV would be much lower.
These are the main concepts in working with the PPV. The formula combines all the evidence we have about p-value, power, and pre-study probability and puts them together. The PPV is a driving force for loannidis's claim that “most research results are false,”3 and you can see why he's pessimistic about the reproducibility of this kind of experiment.
8.