F The NHST Program: A Little of Both and a Lot of Shortcomings

By the middle of the twentieth century, many scientists realized that they needed a systematic, objective method of evaluating their experiments, but the conceptual foundations for Fisher's and for Neyman and Pearson's programs were barely compatible.

Rather than choose between them, textbook authors and researchers gradually settled on a merger of the two, which became NHST, even though NHST had critics from the start (one wanted to rename it “Statistical Hypothesis Inference Testing”²⁴; emphasis added).

From Fisher, we get null hypothesis testing, but we apply it for purposes that it was never intended for, and we do it in a rote, unthinking way that Fisher abhorred. We learn NHST as a “ritual,”²⁵ as a series of steps to be memorized and slavishly followed, not as a flexible tool for fostering informative insights. We presume that its dictates are meant to be obeyed, not judiciously considered. For these reasons, many of us lack a sophisticated appreciation of the strengths and limitations of statistical testing. Worse than that, because we learn to carry out NHST as an empty ritual, we are subject to a range of “delusions” that can contribute to the Reproducibility Crisis, as we'll see.

Many statistics textbooks directed at scientists essentially define H₀ as a nil hypothesis, which was not Fisher's intent; as noted earlier, we can specify a non-zero difference between groups as H₀. We use null hypothesis testing to make substantive “yes-no” decisions that it was not designed to make. And by misinterpreting Fisher, NHST ratifies the reliance on small samples; users of NHST seldom hesitate to draw sweeping conclusions from what he meant to be pilot studies. We ignore the Neyman-Pearson demand for large sample sizes and the need to estimate certain critical parameters. For example, we might guess that the variability in the distributions of the weights of DAMs and DSMs is the same, but since we don't know much about DSMs, our calculations may be only as accurate as the guess is right.

We rarely take the concepts of statistical errors and power seriously, and, in fact, many of us don't learn about statistical power or its importance.

In any case, we cannot meaningfully calculate the probabilities of a and p errors or statistical power without having substantive alternative hypotheses or adequate sample sizes. When it comes to alternative hypotheses, we have at most “H,” which simply states that “H is wrong.” When we're testing a null hypothesis against an empty alternative, we're avoiding the deeper analysis of the scientific question and the need to formulate explicit alternative scientific hypotheses. If we only test a hypothesis of the deleterious effect of doughnuts by comparing DSMs and DAMs with H of “no difference” versus H of “difference,” we may be ducking the underlying biological (or social) issues involved. Testing H₀ might be a reasonable place to start if we have no information whatsoever, but we shouldn't consider it an end unto itself.

Null hypothesis testing can easily generate nonsensical conclusions because there is no such thing as absolute equivalence between two groups. With a sufficiently fine-grained analysis and large enough sample sizes, you can almost always find statistically significant differences between groups. Paul Meehl²⁶ first called attention to the flawed conclusions that can result from overlooking this fact. For example, do you think girls and boys have the same IQs? Get big enough groups of kids and test the null (nil) hypothesis with the most rigorous, finest-grained analysis possible and you are mathematically guaranteed to reject H₀ because their IQs will not be absolutely identical. Even if your hypothesis is directional—“Girls have higher IQs than boys”—you have a 50% chance of being right²⁷ since H₀ is certain to be rejected. So much for p < 0.05!

Conflating Fisher's and Neyman-Pearson's theories makes it hard to understand critical statistical principles.

Take, for example, the concept of the p-value itself; what exactly does it refer to? The p-value is the percentage of the imaginary distribution of values that you expect to observe if H₀ is true; convention condones rejecting H₀ if the sample values are at least as extreme as the p-value. If the mean weight of DSMs is 230 lbs, and this is significantly different from the population mean of 195.7 lbs at a p-value of the substantial effect size between them, 0.52, is suggestive; you'd be tempted to think that they do differ. To take the next step, you can determine the 95% confidence interval for effect size—the calculation is less obvious than for the mean³⁸ but not difficult—and, in this case, you'd find that it ranges from 0.41 to 0.63. This means that you can be 95% confident that the true effect size is within this interval (i.e., that there is a noteworthy difference between American and Swedish women's heights).

Note that this confidence interval does not include zero. This fact is significant because it means that you can be 95% confident that there is indeed a genuine difference between the groups. If zero fell within your confidence interval, you could not exclude the possibility that there was zero difference between them (i.e., that the two groups are really the same). This is like rejecting the null hypothesis of no difference between the groups at p < 0.05.

The grand conclusion here is that you can get the same kind of information from effect sizes and confidence intervals as you get from the typical p-value tests while steering clear of the deficiencies of NHST. And, as a bonus, you get an estimate of the actual magnitude of the effect you're interested in. Nevertheless, even with effect sizes and 95% confidence intervals, you have a 5% chance of being wrong. You could reduce this risk—analogous to an a error—by widening the confidence interval, say to 99%, which would decrease the chance of wrongly concluding that the groups differ (“a error”) to 1%, but, as you expect by now, this would simultaneously increase the chance of failing to pick up a genuine difference between them (“P error”). There are no magic shields to protect you against error.

In summary, there are problems with, and possible fixes for, the frequentist testing program. Given the inertia built up over decades of using NHST, it will probably remain the default standard for a time, although there are moves afoot to diminish its prominence, if not abolish it altogether.³⁹ As an alternative to NHST, Bayesian methods, are making inroads into experimental sciences in addition to sociology, as we'll see in Chapter 6.

<< | >>

↑

Source: Alger Bradley E.. Defense of the Scientific Hypothesis: From Reproducibility Crisis to Big Data. Oxford University Press,2020. — 449 p.. 2020

F The NHST Program: A Little of Both and a Lot of Shortcomings

More on the topic F The NHST Program: A Little of Both and a Lot of Shortcomings: