C The Reproducibility Project: Psychology
The RPP was conducted by an international group of scientists calling themselves the Open Science Collaboration and was led by Brian Nosek. It was a meticulously planned and executed effort to assess the reliability of psychological science.
The Open Science Collaboration consisted of numerous teams of researchers who tried to duplicate the experimental findings reported in 100 papers published in top-line psychology journals in 2008. For the most part, the teams stuck closely to the conditions reported in the original papers, checking with the original investigators when necessary; thus, they mainly tested for direct reproducibility, although they did allow some variations in procedure, which contributed to the subsequent controversy about the RPP's own conclusions.An especially importantly methodological feature of the RPP, one that I'll return to several times, was its selection of the particular experiments that the teams would try to replicate. Each paper they reviewed included an average of about 3 “studies,”36 and each study comprised several experiments that all contributed to the main conclusion of the paper. Each experiment was summarized by a single, statistically derived number (usually either a p-value or effect size.37,38 Obviously, for practical and logistical reasons, it was out of the question for the RPP teams to try to replicate all of the experiments in all of the papers, so the teams focused on only one “key result” per paper and, for the sake of uniformity, took the last one “based on the intuition that the first study in a multiple-study article (the obvious alternative selection strategy) was more frequently a preliminary demonstration.” As the story unfolds, we'll see that this apparently sensible strategy limits the applicability of the RPP findings.
What did the RPP teams learn? That precise reproducibility of original experiments was disappointingly hard to come by.
For example, the RPP teams had estimated that they should have been able to reproduce 89% of the original experiments,39 yet they could actually reproduce only 36-47% of them (depending on the particular metric of reproducibility they used). In addition, the replicated effects were, on average, only half as large as the original effects, and 80% of the replicates were numerically smaller than the originals. In the worst cases, an RPP repeat study obtained effects that were opposite to the reported ones (e.g., if the original report stated that X treatment significantly improved memory, the team found that X made memory significantly worse). There were a few bright spots: the more impressive the original results (smaller p-values, larger effect sizes), the more likely they were to be replicated. Nevertheless, the RPP approach was meticulous, and its results appear to constitute a serious indictment of the way science is currently done.The RPP report is sobering, but does it imply that half of the original research papers came to invalid conclusions? That they were worthless? Bogus? Does the report fatally undermine their credibility or suggest that the money spent on them was wasted? Perhaps not. There are alternative interpretations of the RPP data and of the relationship of reproducibility to science than the one the RPP authors favor.
A published commentary by a group of psychologists led by Daniel Gilbert takes issue with the RPP report's main conclusions.40 Gilbert et al. looked for analytical reproducibility. According to Gilbert et al., the RPP teams overestimated how many studies should have produced positive results; in addition to the number expected to fail because of random sampling error, “infidelities” between the RPP teams' procedures and the reported ones precluded precise replicability in some cases. Furthermore, not all of the replicate studies were overtly testing for direct reproducibility; some were testing for systematic or conceptual reproducibility which, as I've argued, don't count as forms of reproducibility at all.
Gilbert et al. also point out that when the RPP teams' procedures got the original investigators' full stamp of approval, the rate of reproducibility went up, which hints that slight, perhaps easily overlooked differences in original and replicate protocols may have caused problems. I'll return to this issue shortly, and it will turn out that even apparently insignificant procedural discrepancies can doom efforts to reproduce published results. In summary, Gilbert et al. calculated that if all of the sources of variability, both random and systematic, between the original and replication studies of the RPP teams were completely accounted for, then 34% of the replicates should have failed, which is not far from the number of failures that the RPP actually found (36-47%). For Gilbert et al., the RPP results are no cause for alarm.
The RPP group rebutted Gilbert et al.'s charges,41 pointing to shortcomings in the critics' own reasoning but, in the end, conceded that the RPP data could support both pessimistic (theirs) and optimistic (Gilbert's) conclusions about the question of reproducibility in psychology research, adding that there is “no such thing as exact replicability.” Two subsequent statistical reanalyses of the RPP came to differing assessments about the RPP report: one held that “It really looks like [the RPP's statistical] power was very high,”42 and the other that “apparent failure of the RPP to replicate many target effects can be adequately explained by overestimation of effect sizes”43 (i.e., the reproducibility studies themselves were not always above reproach).
There is no disagreement that science needs to tighten up its standards and find ways to improve the reproducibility of its results. It is also the case, however, that reproducibility is not always of transcendent importance, and insisting that it is can generate its own problems. How can this be? A cautionary note from the RPP report provides a clue: “It is too easy to conclude that successful replication means that the theoretical understanding of the original finding is correct.
Direct replication mainly provides evidence for the reliability of a result,” not that we have achieved good understanding of it.The significance of any scientific project depends both on what it has found (what the data actually show) and what its findings mean (what it tells us about the world). Reproducibility is a valuable characteristic of a scientific result, and, ultimately, true results must be reproducible. However, reliability also involves a judgment about theoretical understanding. Basic science tries to learn the truth about the world, and raw reproducibility is at best a leading indicator of the truth value of a scientific result. According to the RPP authors, none (“Zero.”) of their positive replications establish the truth of any of the findings that they tested.
We should consider that reproducible results may be unreliable; they may in fact be untrue. Suppose you got a cold and several observations suggested that you contracted it from your cousin at the annual family barbecue: he was there; looked ill; gave you a big, breathy hug; and sneezed frequently. But the sneezing might have been brought on by his hay fever allergy and have had nothing to with his cold. Perhaps, your aunts, uncle, and sister all witnessed your cousin sneezing and drew the same conclusion from it as you did: he's got a cold. Despite the fact that your observation was reproducible, it was unreliable because it was based on your common ignorance of his hay fever allergy. Likewise, the countless reproducible observations that supported Newton's theory of mechanics did not guarantee that the theory itself would be the last word on the subject, as Einstein's theory showed.
What should we make of all this? Two lessons will carry forward: one is about the investigations that the RPP repeated. Given the quantity, quality, and variety of the complaints about irreproducible results, it is likely that there are genuine problems, although blanket claims of a “crisis” seem unwarranted.
It is worth recalling in this context the wide range of estimates of reproducibility made by respondents to the Nature survey44: from 80% (physics) to 30% (“other” sciences, presumably including psychology), which suggests that the extent of the problem varies across scientific fields.However, another lesson from the RPP controversy is that we shouldn't take its conclusion, any more than we take the conclusion of any scientific report, at face value. How do we decide who is right—the RPP or its critics? 'Ihis is the kind of intellectual skirmishing that goes on all of the time in science. Science is an inherently adversarial method of knowledge generation; ideas are treated roughly and have to prove themselves worthy of respect by standing up to severe challenges. In empirical sciences, better evidence, rather than nuanced technical debate, determines the outcomes of controversies. A comment supposedly made the physicist Ernest Rutherford, “If your result needs a statistician, then you should design a better experiment,” is apropos.
What else can we learn from studying the RPP? In Chapter 1, we saw that all scientific projects have to assume that certain unproven ideas—implicit hypotheses—are true in order to proceed. In the next few sections, I will examine how some of the assumptions that the RPP scientists made could affect the applicability of their conclusions.
7. C.1 '1 he Reproducibility Project as the Subject of Meta-Science
The RPP was a large, controlled experiment, meaning that it attempted to keep as many factors constant as possible to reduce the influence of random variability. Usually, controlled experiments are highly desirable, although their advantages can be lost if they can hide meaningful variability. I mentioned the crucial tactic of the RPP to select a single result from each paper chosen for replication. It is important to stress that the principal “finding” of a multipart scientific paper is hardly ever based on one experiment; typically, the paper pulls together evidence from several experiments into an interpretation that makes sense of all of them.
Note that although the main conclusion of a paper depends on a group of interlinked experiments, one experiment may be absolutely essential and others
secondary. The result from this experiment gives rise to the central hypothesis of the paper, and, if it were absent, then the secondary experiments might not stand alone, or at least they would not command the same degree of attention as the whole group would. Indeed, if the crucial result was not repeatable then, depending on its potential impact, many people would lose all interest in the report. And, to turn the argument around, no matter how crucial, a single experiment by itself is scarcely ever granted publication in a respected journal in most fields of biological or social science; the secondary experiments are generally tests of predictions of the central hypothesis, and, in my experience, reviewers and editors can be quite keen to have a few of these predictions tested before they’ll approve a piece of work.
The conclusion of each paper selected for replication by the RPP was supported by the outcomes of an average of three multipart experiments. How could the decision to focus on one experimental result affect the relevance of the RPP’s conclusions to everyday science? We’ll consider some possibilities next.
7. C.2 Reproducibility: All-or-None?
Reproducibility is not a binary condition: a scientific report may not be all right or all wrong. One part of a multipart report may not replicate, while other parts do. Or a replicated result might match the original result qualitatively though not quantitatively, as was the case in many of the RPP experiments. It is often hard to classify reports as “reproducible” or “irreproducible.”
“Reproducibility” may also lie in the eye of the beholder. In the blog Science Based Medicine, Managing Editor and physician-scientist David Gorski recounts his attempts to duplicate a finding made by the renowned biomedical researcher, Judah Folkman.45 Folkman and his team had found that a chemical, later named angiostatin, was secreted by solid primary tumors and almost completely suppressed the development of secondary tumors, metastases, by preventing the outgrowth of blood vessels necessary to feed the metastases. This was a novel and powerful hypothesis, with numerous clinical and theoretical implications. However, Gorski and colleagues observed that angiostatin was “never... as potent” as Folkman’s group had found. Still, they did find that angiostatin retarded secondary tumor growth. Did they reproduce Folkman’s results? Yes and no. For a drug company hoping to discover a new anti-cancer therapy, with a multimillion-dollar R&D project hanging in the balance, the answer could well be “no.” Yet angiostatin slowed tumor growth and, as Gorski and colleagues later showed,46 it augmented the benefits of radiation therapy.47 This effect of angiostatin might have remained undiscovered had it not been for their systematic extension of Folkman’s prior work. So, depending on the definition of reproducibility you're using, the answer to the question could be “yes.” Science is quite often like this: more nuanced than can be fit into a simple yes-or-no, reproducible or not-reproducible, type of pigeon hole.
7. C.3 Reasonable Assumptions and Unexpected Consequences
Let's take another look at the RPP group's rationale that choosing the last experiment in each paper was safer than choosing the first one. Again, this sounds sensible, but might be problematical. When scientists are just talking, you occasionally hear mutterings about “the last experiment” in a paper they've read, because they're suspicious that it was added at the request (insistence?) of one of the anonymous peer reviewers of the article.48 (It was once explained to me by a senior luminary in my field that the appropriate posture for an author seeking acceptance of a paper into a top journal is that of “a dog lying on its back, exposing its belly”—i.e., abject submissiveness.) However this may be, it is uncontrover- sial to say that investigators generally feel that they have little leeway; if one of the omnipotent peer reviewers of their work asks for more information, usually investigators will comply, even it means tacking on an extraneous experiment. It is possible, therefore, that the last experiment in a paper is less essential to the main message than the others. At the other end of the spectrum, some investigators will say that they save the “best for last”; that is, they put their strongest experiment at the end of their paper. Finally, authors sometimes simply add on an experiment that was an interesting tangent from the main stream of the paper, not an integral part of it. My point is that last experiments may not be representative of the rest. Indeed, the RPP authors themselves caution that “it is not necessarily the case that the identified effect [i.e., the one from the last experiment] was central to the overall aims of the article.” Still, their conclusion as to whether or not a given report was reproduced was deliberately based on the last experiment.
None of this is meant to cast aspersions on the approach taken by the RPP group, merely to emphasize that their plausible strategy depended on an unproved assumption that could be wrong. A corollary, that all experiments in a paper are equally likely to be reproduced, is also unproved. It remains to be determined whether such assumptions affect the overarching conclusions coming out of the RPP.
7. C.4 One Bad Result May Not Be Fatal
Because the main message of a multipart report rarely hinges on one experimental result, the failure of one result does not completely undermine the report's validity. The fact that your cousin's sneezing was an unreliable sign of his cold does not invalidate your conclusion that you caught your cold from him.
In science, good individual tests of a hypothesis rest on independent background assumptions. Indeed, the primary reason that scientists do multiple tests49 in the first place is that, while they all have weaknesses, they are unlikely to have the same weaknesses! If the results of several truly independent tests support the same conclusion, you feel more confident that it is not the spurious result of any single test.
While your confidence in a scientific hypothesis based on multiple lines of evidence might be shaken if one of its predictions turned out to be wrong, you might not entirely lose faith in it. Wishing to avoid the negative consequences of basing conclusions on single tests is partly why the RPP authors warn us not to overinterpret their single-result replications. The authors don't emphasize that normally science relies on tests of multiple predictions for strong positive reasons as well.
7. C.5 Do We Know Enough to Be Able to Replicate Experiments?
A critical implicit assumption made by the RPP, and indeed, any scientist, is that you can correctly guess which environmental variables are going to affect your experiments. If you knew all of the factors that affected the original results, then you could take them into account when you try to reproduce those results. Though guessing the relevant variables sounds simple enough, in reality it may be anything but.
For instance, recently, a research group conducting behavioral experiments on laboratory rats found that a male researcher got a different result when he did an experiment than a female researcher got when she did the identical ex- periment.50 And they were in the same lab, studying the same animals, with the same equipment! The first hypothesis was that the experimenters themselves somehow influenced the rats; maybe the males were rougher with the rats? But this hypothesis predicted that actual people would have to be present to get the effect, and it turned out that clean, new T-shirts, each worn overnight by either a male or a female and placed near the rodents, were equally good at causing the alterations in rat behavior. What was going on?
The experimenters tested the hypothesis that something transferable to the T-shirts, probably a chemical, affected the animals. They tried various chemical agents, several species of non-human—as well as human—males, behavioral measures of stress, hormone analyses, etc. Only after testing and eliminating numerous hypotheses did they hit on the present explanation—secreted male hormones affect rat behavior by inducing stress responses in the animals. Among the many lessons from this elegant study is one that should make all scientists stop and think: How good we are really at intuiting what the important experimental variables are? The lessons for science in general, and reproducibility in particular, are obvious. If, you didn't know about the male-specific effect, and men and women had alternated animal-testing duties, you'd miss detecting the hormonal effect. If two separate labs, one made up of men and one made up of women, had independently conducted the studies, the results of each group would have been irreproducible by the other.
Worse yet, nature can throw up road blocks even we when our do our best to anticipate and avoid them, as a report by Hines et al.51 reveals. These researchers describe the frustration they experienced in attempting to generate reproducible results. Two laboratories, one in Berkeley, California, and one in Cambridge, Massachusetts, were working on a long-distance collaborative project, and both were thoroughly experienced in all of the technical methods. But they were initially stumped when they could not duplicate each other's observations. Their project required them to identify the various kinds of cells in human breast tissue, and, to do that, both groups had to isolate individual cells from surgically obtained tissue samples. Colored antibody molecules were used to label specific proteins in the cells, and the plan was to use the antibodies in pairs to identify the target proteins in all cells. Each kind of cell had a distinctive combination of the proteins, and the kinds of cells could be sorted into groups according to their color patterns. You could use the same method to identify the different states in the United States according to their ethnic make-up. For example, 46.3% of the population of New Mexico is of Hispanic heritage,52 while Hispanics constitute only 2.5% of the population of North Dakota. Conversely, German descendants account for 9.8% of New Mexicans and 46.8% of North Dakotans. You could unambiguously identify each state by its ratio of residents of Hispanic heritage to those having German ancestors.
The scientific problem was that the two groups of investigators kept getting completely different proportions of the proteins in their cells despite starting with the same biological tissue. It was as if one group came up with the Hispanic/ German ratio typifying New Mexico, while the other got the North Dakota ratio when they were both studying the New Mexicans. After trying for more than a year to pin down the discrepancy, it came down to a technical detail: to isolate the cells in their laboratory, the Cambridge group used a device that swirled their tissue samples in flasks, the Berkeley team gently rocked theirs in a mechanical laboratory shaker. When they both used the same method, they got the same results. Now, just why the difference in devices had such big effects is a mystery; however, the lesson is plain. What looked like a minor difference in technique had major consequences for the results. Had the research teams been working independently, then, again, each team would have obtained perfectly valid results that the other team would have considered irreproducible.
The stories of the cell-sorters and the male and female experimenters are two of countless instances in which subtle and easily overlooked minutiae of the experimental environment could lead to wildly divergent results. Nobody knows how common such dramatic effects are, but probably they’re not million-to-one outliers. While more extensive lists of experimental details in scientific papers will alleviate some of the strains caused by irreproducibility, the examples highlight the potential subtleties of the problem. Will we really think to report the genders of our laboratory team members, for example? Even if extreme levels of disclosure were to prove necessary (and societal objections would surely arise at some point), it will be impossible to guess and correct for every conceivably significant variable.
These considerations alone should give us pause before accepting conclusions that most science is irreproducible and that therefore someone needs a scolding. And what about the implication that the replication team’s results are more valid than the originals? How do we know, if someone did make a mistake, that it was the originator, not the would-be replicator of the research?
Is irreproducible science invariably bad science? The dataset obtained from swirled tissue is not obviously inferior to that obtained from gently rocked tissue. The experimental examples show that irreproducibility may follow impeccable lab procedures carried out by careful investigators. We must keep in mind that, at first, an irreproducible result is just another observation in search of an explanation.
Finally, the examples raise other deep questions: If a result is so fragile that it is dramatically altered by apparently trivial details, how significant can it be? How can we tell which procedures give us the best insight into the natural systems that we are most interested in? What are the underlying mechanisms of the unanticipated effects—why do male hormones affect rats; how does the swirling or rocking affect human breast tissue cells? Irreproducible results that do not represent error, incompetence, misconduct etc., can be signs of potentially fertile scientific ground, or, in Stuart Firestein’s evocative phrase,53 they may represent areas of “productive ignorance.”
7. C.6 Multiple Tests
Intuitively, we feel that conclusions backed up by multiple tests are on firmer ground than those supported by only one test. Let’s say that you’re trying to improve rats’ ability to learn how to run a maze. Your hypothesis predicts that a drug, SmartRat, should be effective, so you give it to one group of animals and a harmless placebo to another group, and you compare how fast the groups learn to get through the maze. If the SmartRat group is significantly faster at the p < 0.05 level, you'd be tempted to infer that the drug was beneficial. However, there is a chance that SmartRat didn't do anything and you've been fooled by randomness.54 You wouldn't want to launch a costly drug development program if there were a 5% chance that you'd lose a lot of money.
How would doing multiple tests change things? Suppose you use SmartRat and a chemically distinct drug, RatHiQ, that has the same predicted effects on the rat nervous system. You give SmartRat to one group and RatHiQ to another and, again, compare them to placebo-dosed control animals. If you find that both SmartRat and RatHiQ boost the rats' performance, you feel more confident than when you had only the one-drug result. Your confidence could still be misplaced, but the odds of getting significant effects with two drugs by random chance surely would be much smaller than they would be with one. The more independent tests that a scientific hypothesis passes, the more well-corroborated it is, and the less likely its successes resulted from chance. (In Chapter 8, I introduce a way of combining individual test results so you can quantify the net probability of getting two [or more] of them.)
The case of the male and female researchers who got different results when they were studying rat behavior is an extreme example of multiple testing. A rough count indicates they did 17 distinct tests before reaching their final conclusion. Even if every result was only significant at the p < 0.05 level, the chance of doing 17 tests that would all be consistent with the hypothesis would have been extremely low. While it is true that this was an unusually thorough study, conclusions of most reports rely on multiple tests and are therefore probably more reliable than you would guess from considering only one. Yet, again, much of the concern about reproducibility is based on the examination of one result from multipart studies.
What about the RPP itself? Did the RPP teams try a multiple-tests approach? That is, did they use more than one method to test the generality of their own findings? Not exactly, but another group effort, the Many Labs Project also led by Brian Nosek, did. In a sense, the Many Labs Project was the converse of the RPP. Rather than having many individual teams each test the replicability of a single published result, the Many Labs Project had a group of laboratories all try to replicate the same small number of published results. The results of the two studies were poles apart. Whereas the RPP suggested a reproducibility rate as low as 36%, the Many Labs Project concluded that 77% (10/13) of the studies that they looked at were reproducible. Although there is a dispute about the significance of this difference,55-57 its existence indicates that meta-science projects are not exempt from the uncertainties that plague ordinary science.
We have been looking broadly at the RPP as a prototypical scientific investigation and, in these last few sections, reviewing background assumptions that the RPP scientists made. When it comes to background assumptions, we always hope that they are innocuous and won't affect the outcomes of our experiments. Science continually faces a choice of whether to devote indefinite amounts of time to keep from being tripped up by an unjustified assumption, or to take a chance, make the assumption, and hope for the best. Science usually manages to strike a balance.
There are one or two other lessons about scientific thinking that we can draw from reviewing the reproducibility studies. They have to do with another kind of implicit assumption that is also quite common in science: that we know what we should do and why we should do it.
7. C.7 Truth, Reproducibility, and the RPP
Although the RPP authors were all well-versed in the principles of scientific thinking, in their report they didn't draw the distinction between seeking truth and seeking reproducibility as cleanly as they could have. The RPP was about data (i.e., direct reproducibility; getting the same numbers as the original report). The RPP authors would agree that the pinnacle of scientific achievement is not spreadsheets packed with reproducible numbers. They say, “Accumulation of evidence is the scientific community's method of self-correction and is the best available option for achieving that ultimate goal: truth” and “Understanding is achieved through multiple, diverse investigations that provide converging support for a theoretical interpretation and rule out alternative explanations.” Simply put, what science knows goes beyond mechanical, direct reproducibility to the weighty conceptual issues related to the hypothesis.
7. C.8 Bayes and the Reproducibility Crisis
The RPP group asked a straightforward question—were the original published results of 100 psychology studies replicable or not?—but the answer was not straightforward. Arguments on both sides of the reproducibility debate were based on conventional statistical testing methods, which have their own problems, as we saw in Chapter 5.
Against this backdrop, a paper by Etz and Vanderkerckhove offers a novel perspective.58 Rather than engaging the topic of reproducibility directly, they asked if the original and replicate studies yielded the same degrees of evidence for or against the hypotheses of interest. Let's suppose that each study, original and replicate, was testing an experimental hypothesis, H1, against a null hypothesis, H0 where H0 is that the experimental treatment had no effect. If there was a failure of reproducibility because of experimenter sloppiness, incompetence, or worse, you'd expect the original data to report strong evidence for H1, and the replicate studies to reject H1 in favor of H0. Is this the case? Etz and Vanderkerckhove turned to Bayesian methods to find out.
From a Bayesian perspective, each study in the RPP amounted to a comparison of the strengths of H0 and H1. Therefore, you can evaluate each study by its Bayes' factor: that is, the ratio of the likelihoods that H0 or H1 accounted for the reported data (D); that is, the Bayes factor is [p(H0/D)/p(H^/D)]; see Chapter 6, Box 6.2. Etz and Vanderkerckhove calculated Bayes' factors for all of the studies—originals and replicates—and rated a Bayes' factor of >10 as strong evidence in favor of one hypothesis over the other and a Bayes' factor of and tried to achieve high power levels in their experiments. Still, because their analyses were based on the overly large reported effect sizes, the replication sample sizes were too small (you don't need as many subjects to detect a large effect as you do a small one). In fact, Etz and Vanderkerckhove noted that the larger the study, the higher the Bayes' factor, regardless of whether it was an original or a replicate. Hence, despite the best efforts of the RPP teams, the replication studies were underpowered, meaning they were insensitive to detecting real effects.
In a final touch, Etz and Vanderkerckhove accounted quantitatively for essentially all of the difference between original and replication studies by taking into account publication bias “without recourse to hypothetical hidden moderators.” This implies that the RPP doesn't support the more alarmist concerns about questionable scientific practice as a cause for irreproducibility.
Details aside, Etz and Vanderkerckhove's work represents a “good-news, bad- news” scenario. The good news is that science is not heavily tainted by careless error or sleazy behavior. The bad news is that many published studies are neither true nor false, but simply inconclusive. Science is suffering not so much from a “reproducibility” crisis as from a “producibility” crisis; it's not that we can't replicate solid findings, it's that we can't produce solid findings in the first place. Etz and Vanderkerckhove favor policy measures to combat publication bias, and they call on scientists to pay greater attention to the importance of statistical matters than they frequently do.
And there is still the caveat that I mentioned earlier: the RPP groups only tested one single, statistically significant result in each paper they examined. The analysis of Etz and Vanderkerckhove necessarily suffers because of this defect as well.
7.