B Scientific Versus Statistical Hypotheses
The scientific hypothesis tries to explain a natural phenomenon. The statistical hypothesis is nonexplanatory; it is part of a mathematical, inferential, decision-making procedure that determines what your experimental results mean for your clever scientific hypothesis (i.e., whether the results could have occurred by random chance alone or whether they are consistent with the hypothesis).
The term “statistical hypothesis” is unfortunate because scientific and statistical hypotheses are so unalike. Why is this a problem? After all, words usually have multiple meanings, and this usually doesn't create difficulties. Typically, however, context helps to sort things out so that, for instance, the record industry executive, mob boss, boxer, and baseball player can all refer to a “hit” unambiguously. Scientists test statistical hypotheses and scientific hypotheses in contexts that overlap and make it hard to tell them apart.
Statistical and scientific hypothesis share a couple of features: both are definite statements about the world that can be tested, and neither can be proved to be true, only false (or left in limbo). On the other hand, statistical and scientific hypotheses diverge in purpose, conceptual content, and in the ways in which they are tested.
How are scientific and statistical hypotheses related? Hierarchically. The scientific hypothesis is a conjectural explanation of a natural phenomenon that strives for generality, even universality. We assess its truth indirectly by experimentally testing the logical predictions it makes, and, if they're false, we reject it. Scientific hypotheses predict how tests of statistical hypotheses will turn out.
Statistical hypotheses are tied to particular circumstances and datasets, not to universal truths. Hence, statistical hypotheses are subordinate to scientific hypotheses; in fact, statistical hypotheses are akin to scientific predictions.
We design an experiment to test a prediction, adopt a statistical hypothesis, collect data, run an appropriate statistical test, and interpret the results. Once its practical job of helping us evaluate the scientific hypothesis is over, the statistical hypothesis goes back into the analysis toolbox. Here's a quick example.5. B.1 Statistical Versus Scientific Hypothesis: Example
For this example, I'll assume that we're in the null hypothesis significance testing mode (NHST) because it is in wide use and most of us are acquainted with it enough to get the gist of the argument. Soon I will step back and reexamine NHST and its shortcomings from a perspective you won't get from statistics class. (If a refresher of NHST would be helpful, please consult any introductory statistics textbook.)
Suppose that you've heard about the obesity epidemic that is sweeping the country1 and are worried about the effects of excess body weight, “metabolic syndrome,” on people's health. When you learn that there is a roughly parallel rise in obesity and in doughnut consumption,2 the scientific hypothesis occurs to you that “Obesity is caused in part by excessive doughnut consumption.”
To test your hypothesis, you recruit doughnut-seeking males (DSMs) from lists of “top consumers” at doughnut shops. (“Top consumers” amass bonus points—redeemable for more doughnuts—as rewards for their purchase records. I'm making this up.) For comparison, you recruit doughnut-avoiding males (DAMs) from “doughnut-eaters anonymous,” a support group for recovering doughnut addicts. One prediction that follows from your hypothesis is that “DSMs will weigh more than DAMs.”
To test your scientific hypothesis, you develop a statistical hypothesis, a null hypothesis, H0 that the “DSMs weigh about the same as DAMs.” Basically, you reformulate your scientific prediction as a statistical hypothesis. There is nothing explanatory about it. It concerns one possible consequence of doughnut eating and addresses the question: Are doughnut-eating habits associated with differences in male body weight or not?
We'll assume that your DAMs come from the general population of American males over the age of 20 and therefore have an average body weight of 195.7 ± 0.93 lbs (mean ± standard deviation).
If your DSMs weighed 275 ± 10 lbs and only approximately 5% of DAMs weigh that much or more, the chance of getting a group of such heavy DSMs by random chance alone would be small; p < 0.05. In line with the usual convention for biology, you'd reject H0.In rejecting H0, you are saying that your statistical hypothesis of no significant difference between DSMs and DAMs was wrong. According to your test and the convention for significance testing that you adopted, the groups do differ in weight. And that's all you get from the statistical hypothesis; it tells you nothing specific about your scientific hypothesis. The information from your statistical test does allow you to conclude that the prediction of your scientific hypothesis was correct and that, therefore, your data were consistent with your scientific hypothesis.
Incidentally, in putting forward this problem, I assumed that eating too many doughnuts would lead to weight gain (i.e., DSMs would be heavier than DAMs). It made sense given what I suspect about the nutritional properties of doughnuts. Therefore, I considered a one-tailed significance test, sometimes called a directional test, that essentially bets that any deviation from H0 can only be either greater than or less than the mean assumed by H0. In this case, we planned to reject H0 only if DSMs weighed more than DAMs. However, it is possible that compulsive doughnut consumption would trigger biochemical or neurological reactions (e.g., loss of vital nutrients, chronic nausea and diarrhea) that would have the opposite effect: DSMs could weigh less than DAMs. A one-tailed test of significance directed toward detecting only heavy DSMs would miss abnormally skinny DSMs. If you want to keep an open mind as to which way the chips will fall, you can do a two-tailed test that is sensitive to significant deviations in either the plus or minus direction with respect to H0. Basic scientists generally favor two-tailed tests, so they don't miss something unexpected.
Note that we test the statistical hypothesis directly. We test the scientific hypothesis only indirectly, by testing the predictions that it makes. This is not hairsplitting; it is one of the chief distinctions between them, as it illustrates the linkage between the statistical hypothesis and the scientific prediction. Because it is long-established custom, I will refer to “statistical hypotheses” throughout the chapter; still, we should keep this distinctions between the two kinds of hypothesis in mind. There are at least three further distinctions between them that we need to know about.
5. B.2 Essential Empirical Content
Statistical and scientific hypotheses also differ in their empirical content. Their empirical content refers to the things in the world that they refer to. The empirical content of a scientific hypothesis is essential because it determines how we understand and evaluate it; empirical content is not the same as numerical content. This point is probably obvious but is worth emphasizing. Consider the scientific hypothesis, “Obesity is caused in part by excessive doughnut consumption.” If you were to switch “excess” to “occasional,” “consumption” to “avoidance,” or “doughnut” to “hot dog,” you'd have an entirely new hypothesis. Apart from substituting synonyms, changing any word in a scientific hypothesis alters what the hypothesis asserts, what predictions it implies, how we should test it, and, finally, how we decide whether to reject or accept it. The pivotal role of word meaning is what makes empirical content essential for a scientific hypothesis.
The statistical hypothesis, as part of a practical mathematical testing process, has no essential empirical content. The distinction between statistical and scientific hypotheses is like the distinction between numbers and the things you count with numbers. Once we have data and are ready to test a statistical hypothesis, its empirical elements—the terms that tell us what it is about—become irrelevant.
You can strip away the word meanings and still do the testing.In the example, you can answer questions such as, “do their body-weights (of DSMs and DAMs) differ significantly or not?” without knowing anything about weights. You can freely alter the external referents of the words without affecting either what tests would be appropriate for making the statistical comparison or for the results themselves. If you took the same numbers and declared that they represented the counts of green scales on the bodies of space aliens from Planet Zeflon as compared with those from Planet Yorlou, your statistical test would dutifully spit out a significance level for the difference. You could state with probabilistic certainty whether or not Zeflonians are scalier than the Yorlouese. Of course, while the concepts expressed in the statistical hypothesis are inessential for testing it, or for interpreting the results, they do determine how seriously we take the outcome.
The saying “garbage-in, garbage-out” originated in the early days of the computer revolution to remind users that powerful computers give silly answers when loaded up with silly data or silly programs. But the warning also applies to ordinary statistical testing. If the initial assumptions or the numbers are faulty, or the wrong test is chosen, statistical testing will generate a wrong answer as readily as a right one. And this, too, is because the statistical hypothesis itself is devoid of essential empirical content.
5. B.2 Testing Scientific and Statistical Hypotheses
A corollary to the preceding argument is that a statistical hypothesis can only be tested mathematically. No independent, nonmathematical criteria can be applied to evaluate its truth value, unlike a scientific hypothesis, which has an irreducible empirical foundation and logical consequences that open it up to testing in a myriad of ways. To test the doughnut hypothesis, we could assign groups of people randomly to include or avoid doughnuts in their diets, construct longitudinal dietary histories, compare the metabolic status of doughnut-seekers and doughnut-avoiders, perform controlled laboratory experiments to measure biochemical responses to doughnut consumption, and so on.
5. B.3 Relationship to the External World
In a sense, because of its fundamentally mathematical nature, the statistical hypothesis exists in an abstract world apart from the external world of the scientific hypothesis. You might object that the scientific hypothesis often refers to unobservable underlying mechanisms. Aren't these abstractions, too? They are. The distinction becomes clearer when we break down the process of testing a statistical hypothesis. As we'll review in detail later, when testing our statistical hypothesis, we compared the measured body weight data of DSMs and DAMs to imaginary, theoretical populations of body weights to look for possible untoward effects of habitually overindulging in doughnuts. We asked if both experimental sample groups were likely to have come from this same imaginary population. And, regardless of the outcome of testing the statistical hypothesis, we wouldn't learn anything about the external world reasons for any differences between the experimental groups.
In contrast, the true test of a scientific hypothesis is how well it performs in the world and what it tells us about actual physical phenomena or events, not about mathematical abstractions. The scientific hypothesis may refer to unobservable entities or processes—electrons, force fields, “chemical imbalances in our brains,” etc. Nevertheless, as scientific realists, we believe that the unobservables in a scientific hypothesis do exist, and the test bed, the court of last resort, for a scientific hypothesis is empirical testing.4
If testing the statistical hypothesis revealed that DSMs were heavier than DAMs, the result would be consistent with our hypothesis. We might, therefore, be inspired to go beyond the data and infer something about the world. Since abnormally heavy body weights are associated with numerous health problems, maybe excessive doughnut-eating isn't good for you. Nonetheless, this conclusion would not follow from testing the statistical hypothesis alone. Instead, we would have to interpret the test results in a separate cognitive process that connects back to the scientific hypothesis. This is one example of how judgment enters into statistical testing—it is not a mechanical procedure that we can casually offload to a computer.
5.