E Frequentist Statistics: The Rocky Road to NHST

Scientists are taught the NHST strategy as a natural and intuitively obvious statistical framework; they are not exposed the messiness at its core.¹³ Although we learn it as a coherent subject, the NHST program is an “inconsistent hybrid” system^14-16 that was kludged together from two rival schools of thought.

The founders of these schools, Sir Ronald A. Fisher, and Jerzy Neyman and Egon Pearson, disagreed sharply about fundamental principles. Their unresolved dispute continues to cause trouble for our comprehension of scientific procedures today.

5. E.1 Fisher and Null Hypothesis Testing

Fisher (1890-1962), who invented the concept of testing the null hypothesis, was a geneticist who was initially interested in answering questions such as, “does this manure really help the potatoes grow?” He was at a well-established agricultural station and had extensive records to work with. At the time, there was no systematic way of figuring out if a particular fertilizer mixture made a difference to plant growth. Fisher's insight was to put forward and test a null hypothesis, H₀ (i.e., the statistical hypothesis that there is nothing beneficial about a given mix). Basically, H₀ said that the same amount and quality of potatoes would grow in both treated or untreated fields. The null hypothesis, the one that may be nullified by your experiment, is technically distinct from a nil hypothesis (i.e., one that says that there is zero difference between your sample and the population you are testing it against). In fact, the null is not necessarily a nil (you could test for a definite, non-zero difference between the two groups); however, it often is, and we won't distinguish between them.

If you were following Fisher's procedure, you would start with the assumption that some biological parameter of the potatoes, say their weight, is distributed according to a normal distribution in an ideal, imaginary population of untreated potatoes.

You estimate the properties of this ideal population by taking a sample of untreated potatoes and compare it with a treated sample. You randomly sample both groups to be sure that potentially influential variables would be scattered around and not bias the outcome. Your H₀ would be that your experimental sample of manure-treated potatoes could have come from the untreated population. If the treated potatoes weighed about the same as the untreated ones, you would not reject H₀ because you'd have no reason to think that the manure did anything. In contrast, if you'd expect to encounter potatoes as heavy as your sample's in an untreated population only very rarely, you'd suspect that the fertilizer did something and you'd reject H₀.

Fisher invented the significance test to answer the question, “should we reject H₀ or not?” In his scheme there is no explicit alternative hypothesis, only H₀, and what we reject is the conclusion that H0 can explain our data. The ideal population of values assumed to exist under H₀ wouldn't change whether we reject H₀ or not. Despite there being no explicit alternative hypothesis, the implicit alternative is “not Hfi If we reject H₀ we might decide to do a full-scale investigation. Testing H₀ for Fisher is, effectively, a pilot study, a screening procedure that you do with small samples, just to see if there might be anything worth investigating, not to answer meaty research questions.

How do you know whether or not a result is unusual enough to justify rejecting H₀? In the doughnut-eating example, we asked whether DSMs and DAMs differed in their body weight. H₀ was that they wouldn't differ. Fisher said we should pick a level of improbability that seems “significant”; he called the significance level “sig,” but nobody else does anymore: it is the “p-value.” Since the DAMs' weights are distributed according to a normal distribution, and we know a lot about the normal distribution, we can choose a range of values away from the mean as our significance level.

Fisher believed that values above or below 95% of the population would be interesting for many purposes.

For Fisher, there was nothing sacred about a significance level; how to interpret the results of a significance test was a judgment call, not something that followed a fixed rule. Even though encountering a group like your sample of heavy DSMs by random chance alone would be unusual, it would not be impossible. Under Fisher's theory, rejecting H₀ simply means that you should consider the possibility that doughnut-seeking affects adult male body weight.

Each branch of science could adopt a convention for a significance level to reject H₀, but strict adherence to conventions should be avoided. Indeed, Fisher felt that we should report exact p-values (e.g., p = 0.0432) so that each reader could judge how important the results were. The exact probability is not the same as your significance level—if you base your decision to accept or reject H₀ on p < 0.05, then p < 0.05 is your significance level, not p = 0.0432. The smaller the p-value, the more unlikely the result would be and therefore the more significant the result. Fisher encouraged post facto analyses of the data because he considered data interpretation to be so important. For him, this would not introduce unhealthy bias because significance testing was only a preliminary exploratory step.

5. E.2 Neyman-Pearson Decision-Making

Neyman and Pearson thought that null hypothesis testing was meaningless; it was far less valuable to know that a statistical hypothesis was wrong than to know if it was right. Their program is about decision-making. In their scheme, there are always at least two hypotheses—usually H₁ and H₂—and you assume that one of them is correct before you do any testing. Statistics helps you decide which one it is. You test just one hypothesis, the one that you think is most likely to be right, and, given that one of the two has to be right, after the test you know by elimination which.

If you reject H₁, you accept H2.

Let's look at the excessive doughnut-e ating problem from the Neyman- Pearson perspective. Now your H₁ might be that “the weights of DSMs and DAMs have the same mean weight, 195.7 ± 0.9 lbs” and H₂ could be that “on average, DSMs weigh 30 lbs more than DAMs (225 ± 0.9 lbs).” These hypotheses presuppose that the body weights of males fall into the distributions of doughnutseekers or of doughnut-avoiders, which characterize the two populations. In testing H₁ versus H₂ you are deciding whether your subjects came from the same or separate populations. Under Neyman-Pearson rules, you know the difference in the means of the two underlying distributions, which is called the effect size¹⁷ (we'll take up effect size in Section 5.G). If your statistical test shows the groups differ, then, unlike the case in Fisher's program, you also know by how much they differ on average.

5. E.2.a Decision-making according to Neyman-Pearson

Significance testing for the Neyman-Pearson program determines which hypothesis is likely to be correct. When you know which hypothesis to reject, you can then act as if the other is true. You do not have to believe it is necessarily true in every aspect or that it constitutes the whole story, but it is reasonable to act as if it were true. If you find that doughnut-eating is associated with heavier body weights, then you should take this into account when making health policy recommendations, even if you suspect that other factors might also be at work. Unlike Fisher, who saw significance testing as a preliminary tool, Neyman and Pearson made it a key factor in the conduct of research and decision-making. As such, the small sample sizes that sufficed for Fisher are unacceptable in their program; Neyman-Pearson theory demands large experimental samples to assure greater confidence in the results.

5. E.2.b Statistical Errors

Neyman and Pearson also originated the idea of analyzing errors and their consequences.

Statistical error, like sin, comes in at least two forms. There are errors of commission, commonly known as a, alpha, or Type I errors, and errors of omission, f, beta, or Type II errors). The a level determines the chance of wrongly rejecting your main hypothesis, H₁. Ifyou reject H₁ when you shouldn't, you make an error of commission; you mistakenly conclude that the variable that you were investigating (e.g., seeking doughnuts) is associated with increased body weight when it isn't. The a level is superficially similar to Fisher's significance level or p-value but differs in crucial ways. Neyman-Pearson theory assumes that the experimenter will be doing multiple measurements and tests, and, therefore, a is the likelihood of committing the error of wrongly rejecting H₁ in the long run.

The converse error, the one you make if you don’t reject when you should, is a p error: you conclude that doughnut-seeking is not associated with heavier body weights when it actually is. A major feature in Neyman-Pearson experimental design is that these error levels must be decided on before doing the experiment because you're going to base meaningful, practical decisions on the outcome of the testing.

5. E.2.C Statistical Power

Neyman and Pearson thought that we should know if a test could correctly identify and accept a true H2. The probability of correctly accepting H₂ is called statistical power, or just power, and is defined as 1 - p. This makes sense: the probability of incorrectly rejecting H₂ which is p, plus the probability of correctly accepting H₂ which is power, must equal 1. You've got to do one thing or the other—accept H₂ or reject it—so, p + power = 1, and, rearranging things, power = 1 - p.

To assess statistical significance, you have to take into account how big a difference you're trying to detect and the chance of error you're willing to tolerate. And how much you can tolerate depends on what the error will cost you. The problem is that you can't avoid error altogether, and reducing the chance of making one kind of statistical error automatically increases the chance of making the other.

Although you can't escape statistical error, you do care which one is more likely because they come with varying costs. Fortunately, you are not totally helpless; you can influence how much each error will affect the results.

You could think of the situation like this: imagine you are sitting across from a colleague at one of those tiny cafe tables. Two hypotheses can account for control of the table top: H₁—“your space,” and H2—“his space.” To claim the area you believe you are entitled to under H₁, you stake it out by subtly arranging your utensils, water glass, coffee cup, etc., along a border between you and him. Your colleague, acting under H2, does the same until you both reach an equilibrium. The border is like your significance level; you accept that space beyond the border as not being within H₁. On the other hand, it is a small table; if he were not there, it would all be yours. With him there, your spatial claims overlap. Not staking out enough territory is like committing an a error: if, for example, you set your boundary so that you are 95% certain of claiming of your actual territory, there is a 5% chance that you’re giving up control over area that is rightfully yours. In effect, you risk erroneously rejecting H₁. To reduce the chance of this mistake, you might expand your claim, nudging your utensils, water glass, etc., slightly forward, to encompass what you are 99% certain should be within your domain. This would decrease the possibility of understating the true extent of your H₁, and yet stretching your boundary could encroach on territory that is legitimately part of his H2, thus potentially provoking his annoyance and retaliatory moves. Wrongfully claiming area for H₁ is like committing a p error by erroneously failing to accept H2. Obviously, the harder you try to avoid making an a error, the greater your chance of making a P error. The balance you strike will depend on a cost-benefit analysis of the entire situation, including, in this case, the importance of remaining on speaking terms with your table mate.

Let’s see how this works experimentally. At the a < 0.05 level, you have a 1 in 20 chance of being wrong in concluding that abnormal doughnut consumption has no downsides. If this risk is unacceptably high—maybe denouncing doughnut-eating as unhealthy would threaten your state’s doughnut tax revenues, decrease doughnut industry employment, or unfairly stigmatize DSMs—and you want more assurance before drawing it, you can reduce the significance level of your test to a < 0.01; then your probability of incorrectly rejecting H₁ drops to 1 in 100. With this more stringent level you would expect to encounter fewer heavy DSMs by random chance alone. If your sample of DSMs weighed in with the heaviest 1% of the population, you could be pretty confident that they did not reflect normal variability. But your low a level means that you could miss detecting a genuine problem with doughnut-eating, a p error. If DSMs do gain weight, yet not enough to put them into the top 1%, then you’d incorrectly accept H₁ and infer that eating lots of doughnuts doesn’t affect body weight. If p error goes up, statistical power goes down because power = 1 - p. So, decreasing a error also decreases statistical power and, with it, your ability to identify a true H₂.

5. E.2.d Errors, Power, and Convention

To be sure, there is still no such thing as a free lunch. While increased statistical power by itself would be a good thing, when power goes up, generally the chance of “finding” something that isn't there also goes up. When power drops, the chance of missing a real effect goes up. The best way to increase power while minimizing error is to increase the sample size, n. This, too, involves a balancing act, however. Increasing n is comparatively easy for physicists hunting the Higgs boson at the Large Hadron Collider who collect and analyze enormous masses of data automatically. It can be more difficult and costly for small-scale biology studies to increase sample sizes enough to meet optimal power requirements. In Chapter 7, we'll see that critics of neuroscience research¹⁸ blame tests having with low statistical power for many of the woes of the Reproducibility Crisis.

Parenthetically, we note that Benjamin and colleagues¹⁹ propose decreasing the p-value in, for example, biological sciences to p < 0.005 for “new discoveries” (not for “confirmatory or contradictory” effects). Chief among the anticipated advantages of this change would be decrease in the false positive rate (i.e., the number of erroneous claims of discovery due to too low a significance level). Lakens et al.²⁰ respond that the drawbacks of decreasing the significance level across the board outweigh its advantages. Drastically reducing the p-value (a level) will, among other things, increase the false negative rate (i.e., the number of genuine effects missed because of a too-conservative level). Rather than a one- size-fits-all approach, Lakens et al. favor flexibility, exhorting scientists to focus instead on rigorous, well-thought-out experimental design that encourages them to “justify your alpha” rather than see it as a goal in itself.

However this debate is resolved, it is clear that the level of a error that a given branch of science generally adopts is not purely arbitrary; there are strategic considerations. Much basic biological and social research tolerates a 1/20 (p < 0.05) chance of a error, whereas particle physicists won't announce the discovery of a new subatomic particle unless their chance of making an a error in identification is approximately 1/3,500,000 (“5 sigma”; i.e., p < 0.0000003). Biological scientists don't want to risk missing a new finding by being overly cautious at first, and the cost of an a error is often relatively low in their laboratory experiments. They may not want to miss possible harmful effects of, for example, doughnut consumption on human health, and consider the potential costs of erroneously telling people not to eat so many doughnuts to be acceptable.

On the other hand, particle physicists can't take the chance of erroneously accepting the existence of a phenomenon that could wreak havoc with their beautiful quantitative theories. As the Higgs boson is a cornerstone of the Standard Model of physics, physicists could not afford to be overeager in thinking they'd found it.^21,22 The cost of an a error for physicists is high, and they choose an extremely low p-value.

Box 5.1 Two Statistical Standards for Quality Control

Assume that a precision rotator gear is a key component in the small ultralight aircraft called “gyrocopters” that can be used by political protestors to land on the US Capitol lawn.²³ And assume that defective gears can fail without warning, thus imperiling pilots' lives and that, like every manufacturing process, gear-making is imperfect. Now, requiring acceptable gears to be perfect would be economically infeasible so, to stay in business, gear manufacturers balance the costs of defective products against the costs of achieving perfection. In this example, H₁ says that a given gear is good (i.e., that its properties are well within the variability of the population of acceptable gears); H2 says that the gear is bad. The Quality Control Department in the gear manufacturing plant, encouraged by the attorneys in the Product Liability Department, may dictate a high a error rate for H₁, say a < 0.2 (two-tailed), and reject 1 gear in 5. In other words, to be on the safe side, they'll be conservative and reject some gears that would probably be all right in order to catch a high percentage of gears that are genuinely no good.

H2 is set to identify definitely bad gears. Here, the Quality Control folks might call for a low p error rate, say < 0.05, which translates to a power (1 - P) of the test of 1 - 0.05 = 0.95 (i.e., quite high given that a power of 0.8 is considered reasonable).²³ This high power would mean that there is an excellent chance of accepting H2—deciding that the gear is bad—when the gear really is bad. The properties of normal and defective gears still overlap, but having two standards for keeping bad gears out of the aircraft would build in a greater margin of safety thus saving lives, not to mention money for the gyrocopter company owners. And maybe they'll throw in a free parachute with each purchase, just to be on the safe side.

Note that the Neyman-Pearson approach, with its two independent hypotheses, H₁ and H2, allows you to have separate a and p levels for each one and this, in turn, lets you design especially powerful tests. An example is given in Box 5.1.

5. E.2.f Summary: Fisher Versus Neyman-Pearson

Fisher:

• Null hypothesis testing for exploratory studies on small experimental samples

• Significance testing only on H₀, which you can only reject

• No alternative hypothesis

• Calculate and report exact probabilities so readers can determine the significance of results; lower p-values may signal more important outcomes

• Judgments based on experimental results; significance levels are only guidelines

• Post hoc analyses of the data are OK (studies are preliminary)

• No provision for a error, p error, or effect size

Neyman-Pearson:

• Decision-making program; significance testing determines which hypothesis to accept

• No null hypothesis; both hypotheses are substantive; one is assumed to be correct

• Testing reveals magnitude of the difference between hypotheses, the effect size

• Requires large samples and assumes repeated sampling

• Calculates the magnitudes of a, p errors, and statistical power of tests

• Details of analyses and a, p, and significance levels are set in advance of testing

• No post hoc analysis allowed

<< | >>

↑

Source: Alger Bradley E.. Defense of the Scientific Hypothesis: From Reproducibility Crisis to Big Data. Oxford University Press,2020. — 449 p.. 2020

E Frequentist Statistics: The Rocky Road to NHST

More on the topic E Frequentist Statistics: The Rocky Road to NHST: