B What Is Reproducibility?
The frantic tone of the newspaper headlines implies that reproducibility is the highest goal of science—that the main purpose of scientific inquiry is to generate reproducible results and that the more reproducible the result, the better the science.
This isn't true. As science tries to learn about the world, its objectives include acquiring and interpreting data, explaining its findings, and making predictions of how future experiments will turn out. Reproducibility is a scientific virtue, a core principle that helps science achieve its goals; it is a highly desirable property of research; it is not the main thing. Let's start by defining “reproducibility.”7. B.1 What Are We Trying To Reproduce?
As the Director of the National Institute of General Medical Services at the National Institutes of Health (NIH), Jon Lorsch has noted, “ ‘Reproducibility' is short-hand for a lot of problems.”9 In this discussion, I assume that we're referring to the ability of investigators to repeat findings reported in a published scientific paper. Since the immediate practical concern is that irreproducible science is bad science, we'll have to be very specific about what reproducibility means. A task force of the American Society for Cell Biology (ASCB)10 distinguished four kinds of reproducibility: analytic reproducibility, which tries to duplicate original conclusions by reanalyzing original data; direct reproducibility, which tries to get the same experimental results using the same experimental conditions as in the original report; systematic reproducibility, which tries to get the same results as the original study under different experimental conditions than the original ones; and conceptual reproducibility, which uses new experimental approaches and aims “to demonstrate the validity of a concept or finding using a different paradigm.”
Analytical and direct reproducibility efforts try to confirm the basic facts of your study; they ask whether or not someone else would observe the same effects that you did.
Analytic and direct reproducibility are probably what come to most people's minds in this context, and indeed, these are the sources of the greatest anxiety. In contrast, systematic and conceptual reproducibility tests really seek to extend the applicability of your findings, not to duplicate them. The following example illustrates the distinctions among these concepts.Imagine that, while investigating the neurophysiological basis of emotion in mammals, you discover a drug, CalmDown, that reduces anxiety behavior in mice. That is, if you give animals CalmDown, they readily venture out into the middle of a brightly lit (i.e., potentially dangerous) open area, which they wouldn't normally do. You report your findings and put forward the hypothesis that CalmDown decreases behavioral stress in mammals. If another investigator tries to check your conclusions by reanalyzing your raw dataset (which you dutifully posted online), she is testing it for analytic reproducibility. Analytic reproducibility trials want to see if you did your calculations correctly and that your data support your conclusions. Indeed, one view of scientific “evidence” or “facts” is that they represent interpreted data, and reanalysis can help ensure that your interpretations of them are reliable. A major part of the controversy regarding the Reproducibility Crisis hinges on questions of analytic reproducibility. We should also note here that interpretation almost always has a subjective dimension, and two people can disagree about what data mean without implying that somebody did something wrong.
The data themselves are the focus of interest when it comes to direct reproducibility. If another investigator gets a group of mice and a supply of CalmDown and duplicates your exact procedures as faithfully as possible, he is doing a test of direct reproducibility. Direct reproducibility is the most critical for the validity and integrity of the scientific record; it causes the biggest problems when it fails because the reasons for its failure may be subtle and elusive.
If yet another investigator were to test CalmDown on hamsters instead of mice, as you did, and find that it didn't affect hamsters, it would be a failure of systematic reproducibility. Basically, she would have tested a prediction of your hypothesis that CalmDown would affect all mammals and falsified it. Although this result would be disappointing if youd dreamed of having discovered the next Valium, it would not show that your finding was irreproducible. After all, the new experiments did not duplicate your conditions: hamsters are not mice, and it is entirely possible that the drug would affect one species and not another. Your hypothesis would have failed a test of generality, but there is nothing blameworthy in that.
Finally, imagine that, reasoning from your hypothesis that CalmDown reduced mouse stress, other investigators confirmed your behavioral experiments on mice, and then measured the animals' stress hormone levels, inferring that CalmDown should have lowered them. This would test your hypothesis of CalmDown's actions for conceptual reliability. If CalmDown did not alter the stress hormone levels, the results would falsify a prediction of the hypothesis, suggesting that the hypothesis was wrong. Again, this would not mean that your study was irreproducible; indeed, in this scenario, your data were actually reproduced. Falsifying your hypothesis would not be a sign that there was anything amiss; rather, it might set off a new inquiry to find out how CalmDown did affect the mice. Maybe it caused stressed-out mice to behave recklessly, without reducing their anxiety, for example. Again, nothing to get upset about.
Science progresses by testing both the generality and the scope of its hypotheses, which is what systematic and conceptual testing do. Because these tests aim to extend original findings by testing hypotheses, their failures probably mean that nature is more complicated than we thought it was. Their failures give us positive new information—our brilliant idea is incorrect!—they point to opportunity, not error, as Stuart Firestein emphasizes (Chapter 10.A.).
Thus, there is no contradiction—in fact, no problem—if we succeed in directly reproducing a result while failing to stretch it systematically or conceptually. The distinctions between the various meanings of “reproducibility” have an impact on the debate, and I will follow the ASCB's lead and concentrate on direct reproducibility when trying to decide whether there's a crisis or not.7. B.2 Reproducibility: A Multipronged Challenge
A failure of direct reproducibility does not imply scientific misconduct,11 although that distressing issue is always with us.12 Remarkably enough, irrepro- ducibility can frustrate scientists trying to reproduce their own work.13 There are many causes of irreproducibility, and learned societies, including the American Association for the Advancement of Science, the National Academy of Science, and the Federation of American Societies for Experimental Biology, have convened panels and proposed strategies to attack the problems.14 Others have offered technical recommendations for addressing weaknesses in analytical laboratory standards, testing methods, reporting requirements, and so on.15,16 Not all of the issues are relevant to scientific thinking, which is my main focus, so I propose to group and triage them.
Broadly speaking, the concerns fall into three categories: material, qualitative, and procedural. The material category includes the physical items that science uses and which, if unreliable in consistency, validity, quality, etc., can hamstring reproducibility efforts. Two frequent sources of material problems are antibodies17 and “cell lines.”18 Antibodies are biological molecules that, ideally, stick with great specificity only to their target proteins and are often used to pinpoint the location or interfere with the function of these proteins in biological tissue. Unselective antibodies stick to the wrong proteins, which leads to incorrect and irreproducible results.
Likewise, many biological experiments use in vitro populations of particular cells (cell lines), say kidney cells, which are sometimes obtained from tissue banks or commercial suppliers. Experiments done on contaminated or misidentified cell lines can be irreproducible because, for example, kidney cells and heart cells do not respond in the same way to experimental treatments. If you did experiments on heart cells while believing them to be kidney cells, your reported findings could hardly be replicated. Material- related problems are predominantly quality control and documentation matters that do not affect scientific thinking, and we can set them aside.Reproducibility is also hampered by qualitative issues that include such intangibles as the “culture of science,” the “scientific reward system,” and cognitive biases, as well as matters of scientific practice governing how we do experiments and publish our reports.19,20 Even if identifying the remedies for some of these problems were straightforward—and it isn't—putting them into place would require the concerted effort of the scientific community. For example, when investigators do not include enough experimental detail in their reports, their colleagues naturally have difficulty reproducing their data. But, given that scientists have to apply for the money to do their experiments and satisfy editors who publish their papers, granting agencies and scientific journals can bring about change by demanding more information; and, indeed, many are beginning to do so. Scientists, especially mentors, peer reviewers, and journal editors,21 will exert a major influence on how quickly and thoroughly the community recognizes and corrects well-defined problems like this one.
Other qualitative problems present much greater challenges, either because we do not know much about them or because they are so tightly enmeshed with political and social considerations that they might not be solvable. This category includes various biases—confirmation and publication bias are frequently talked about—t hat affect how scientists carry out and communicate about science. Bias is unquestionably important, but since it mainly involves psychological and other cognitive factors that are not uniquely linked to the problem of reproducibility, I'll put off discussing them until Chapter 11.
This leaves procedural issues, which are closely related to scientific thinking and reproducibility, for us to consider here. These are matters relating to the evidence—the data and theoretical analyses—that can help determine whether reproducibility is a crisis, a problem, or something else.22 Grappling with procedural issues will give us a better understanding of the Reproducibility Crisis and also illustrate some of the subtleties of science that affect how we think about it.
7. B.3 Is There a Problem?
While there is universal agreement that scientists cannot always reproduce published reports, there is disagreement about the causes of irreproducibility and how common it is. The experiences of the Bayer and Amgen groups are worrisome but might not be typical: confidentiality agreements between the companies and investigators in the original research laboratories meant that all of the details were not disclosed to the public.
How big is the perceived problem? To get a sense of it, I did a PubMed search of scientific literature between January 2014 and March 2018 on “reproducibility crisis or irreproducibility crisis or replicability crisis” and found 156 articles that mention at least one of the terms. The same search on the website of the premier science journal Nature (www.nature.com) came up with 261 items (articles, editorials, commentary, etc.). In addition, in May 2016, Nature conducted a survey of its readers that drew 1,576 responses.23 When asked, “Is there a “Reproducibility Crisis?” 90% of the respondents said that there was either a “significant” (52%) or “slight” (38%) crisis, and only 7% felt that there was “no crisis at all.” (Nature’s definition of “reproducibility” was essentially “direct reproducibility,” as defined earlier.) The readers' specialties ran the gamut from physics and chemistry to biology, medicine, and “other.” When asked to estimate the percentage of published work in their field that was reproducible, physicists and earth scientists said 80%; chemists and biologists, 70%; biomedical scientists, 50%; and other, 30%. When I posed the identical question to hundreds of biological scientists in an online survey in late 2017, I also found substantial, though slightly lower, levels of concern (Chapter 9). Nature is aimed at a scientifically trained audience, and I queried members of scientific societies, so even though neither sample was rigorously controlled, the results probably convey a fair sense of the community attitude: namely, that many scientists feel that a significant fraction of the published work in their field is not reproducible.
The Stanford University statistician John Ioannidis is convinced that 85% of published research is wrong24 and cites claims that the maj ority of research dollars are “wasted.”25 Statistician Katherine Button, with loannidis and colleagues, analyzed neuroscience studies and concluded that as many as 75% of them reported using procedures that were not rigorous enough to support the reported conclusions26; in particular, that the statistical power of the studies was much too low. loannidis and others argue forcefully that a crisis exists, yet their studies are not above criticism.27 For instance, Camilla Nord and colleagues28 reexamined the work that Button et al. had reported on and found that it was not a homogeneous group characterized by low statistical power. Nord et al. concluded that, although “on average” statistical power was too low, there was large and systematic variability in the degree of power, with some studies (notably “candidate gene studies”) having quite low power, while “many studies show acceptable or even exemplary power.” Nord et al. caution against broad-brush condemnations of scientific findings.
In fact, if you took the direst implications of the work of loannidis, Button, and colleagues at face value, youd wonder how biological science can possibly know as much as it demonstrably does know. Take, for instance, brain studies. Basic neuroscience findings about the brain inform human brain surgery procedures, and, while brain surgery is not a perfected art, it is not as hit or miss as you might imagine it to be from the criticism of basic neuroscience research.
There are those who question the concept of a Reproducibility Crisis,29 pointing out that it can be extremely difficult to duplicate the exact conditions of a published study. For instance, in fields such as psychology, the precise social context constitutes an amalgam of potentially crucial features—interactions among the individual investigators, the particular environmental setting, etc.— that are probably impossible to duplicate. In general, cutting-edge research must necessarily confront unforeseeable complexity, and it's no surprise to find that scientists at the frontier can't at first guarantee that they've identified all of the pertinent variables. Moreover, an irreproducible result may lead the way to new discoveries, so irreproducibility is not invariably a barrier to scientific advance, as the example in Box 7.1 shows.
In any case, it is scientists, not nonscientists, who discover and rectify the errors that scientists make. This is the self-correcting property of science; finding and fixing mistakes are signs that all is well. Last, complaints about reproducibility are nothing new; they have been around for hundreds of years,34 and yet nobody denies that we have amassed a huge and rapidly expanding trove of genuine, reliable knowledge about the world. Many commentators think there is nothing to worry about.
So there are two sides to the debate, and we're left with the question, “Is there is a crisis or not?” What actual evidence supports the conclusion that there is one? How is the reproducibility of scientific findings best assessed? The vantage point for approaching such questions is a branch of science that studies itself. This is
Box 7.1 Irreproducibility Can Lead to Better Science
Among many other things, neuroscience wants to know how sensory systems (vision, hearing, touch, etc.) encode and decode information about the world. A group of sensory neuroscientists studies the rat muzzle whisker system. Rats are nocturnal animals. They have poor vision and detect the location and surface properties of objects by sweeping (“whisking”) their whiskers around as they explore their environment. This sensory system is accessible, well-organized, and analogous in some ways to the human visual and touch systems. Hence, despite humans' lack of active, controllable facial hairs, we can learn about sensory information gathering and processing by studying rat whiskers. When they are whisked across an object, the whiskers vibrate more or less vigorously according to its roughness or smoothness. The sensory touch receptors associated with the whiskers send information via trains of electrical impulses (“action potentials,” somewhat like the 1's or 0's of a computer code) up to a specialized information processing center in the brain (the “barrel cortex),30 which puts together an image of the animal's surroundings.
Researchers study the impulses that are triggered when the whiskers move, and a key question is, “How does the brain make sense of the impulses?” In particular, how does it decode the information about surface texture that is carried by the impulse trains? One prominent hypothesis31 held that the brain has a system of miniature neuronal oscillators, basically groups of neurons that act like tiny biological clocks. By comparing the frequency of the incoming impulses to the time kept by the internal clocks, the brain could figure out how fast the whiskers vibrated as they moved over an object's surface and, hence, what its texture was.
Intending to follow-up on the oscillator hypothesis,32 Asaf Keller and colleagues could not observe the predicted correlations between whisker movements and impulse patterns33 (i.e., their test of direct reproducibility failed). Keller et al. then did systematic and conceptual tests of the predictions of the oscillator hypothesis. For instance, the hypothesis predicted that the tiny neural clocks should provide timing information whenever the whiskers were moving—not only when they contacted a rough surface. But Keller's group found that whiskers waving freely in the air did not activate the oscillators. This test of a key prediction of the hypothesis was much more informative than the test of direct reproducibility had been, and it implied that the hypothesis was false. This example shows that basic scientists, when confronted with irreproducible findings, are not at a dead end as the applied scientists at Amgen and Bayer had been.
Divorcing specific results (data) from a bigger picture (meaning) happens frequently in basic science because greater value is attached to being right in a larger sense than in merely making repeatable measurements. Indeed, at times it seems as if opposing groups avoid repeating each other's experiments directly. Perhaps it is more cost-effective not to, or perhaps they think that it will be more productive to get on with new experiments. In Box 7.2, I'll briefly review a famous basic science struggle in which reproducibility was not a deciding factor in the outcome.
“meta-science,” the scientific investigation of science. But meta-science is a form of science and, as such, is subject to critical examination (“meta-meta-science!”). How does background knowledge affect the ways in which a meta-science project is carried out or color its conclusions? What unproven assumptions must it take for granted?
The Reproducibility Project: Psychology35 (RPP) was a systematic attempt to inquire into the pervasiveness of irreproducible scientific findings. If the reports by the Bayer and Amgen scientists ignited the controversy over reproducibility, the conclusions of the RPP and by Ioannidis and colleagues have provided the fuel that sustains it.
7.