C Why Statistics Matters

When you measure something, the measurement includes the true value that you're interested in, plus unknown variable factors that are usually combined and called error or noise. Remember trying to measure the pH of a solution in high school chemistry lab? You had to worry about whether the pH electrode was clean, what the room temperature was, whether you'd calibrated the meter correctly, whether your shaky hands had pipetted the correct amounts into the test tube, etc.

Variation in any of these factors could throw off the measurements from one trial to the next or explain why your lab partner's results did not agree with yours.

Scientists never get to study the entire population of the objects or events that they're interested in; they can only measure samples from the population and hope that their samples fairly represent the whole. Some fields have to cope with much wilder noise than others. In addition, the living things that biologists study vary intrinsically. Even identical twins are not exactly alike. Experimental physicists can exert great control over their experimental conditions and reduce noise to small values, which lets them make precise and accurate measurements. Regardless of the circumstances, noise is an unavoidable fact of scientific life.

Fortunately, much of the noise is not utterly unpredictable. It follows patterns, and we can often deal with it tolerably well with probability theory and statistics, the branch of mathematics that deals with noise. We use statistical reasoning to decide what a test shows and what its results mean. But that's getting ahead of the story. Let's start with one of the primal ideas in statistics: probability. Probability is a fundamental concept although it is not an easy one to grasp.

5. C.1 Frequentists and Bayesians

There are two major schools of statistics, the frequentists and the Bayesians, and they differ both practically and philosophically.

What you learned in science classes in school was almost certainly frequentist statistics: measures of central tendency (mean, median, mode), distributions of data, variance and standard deviation and how to compute these parameters, how to do t-tests, and the NHST procedures that I've mentioned. We are not going to worry about the procedural details, although they are important and are not always understood or properly used by scientists.⁵ Instead, we are going to consider the statistical concepts in the context of bigger ideas.

Gerd Gigerenzer⁶ makes the distinction between chance as “risk” or chance as “uncertainty.” (Be aware that “uncertainty” here has a narrower sense than it does elsewhere in the book.) Risk is what we face when we have a fixed set of alternative outcomes and can calculate their probabilities. Gambling in casinos is the classic case of engaging risk: you do not know exactly what will happen on any given turn of the wheel or roll of the dice, but, in the long run, you are guaranteed to lose money at a rate determined by the odds inherent in the particular game and, usually, also by the local government, which decides in advance how much it wants to be paid by the casino. After the government gets its take, the rest is left for the casino owners' profits and players' “winnings.” In West Virginia, for instance, odds of winning at the slot machines are adjusted so that enthusiasts can be assured of parting with, on average, about $20 of every $100 worth of quarters that they pour into the machines.⁷ The $80 is returned to them in the form of “jackpots” that the state, of course, then taxes. To calculate risk, we use the frequentist interpretation of probability.

In contrast, in many daily circumstances, there are no standard odds, no “risk” as defined above, and we are in a state of “uncertainty.” Consider this classic example: “There is a 30% chance of rain tomorrow.” What does that even mean? Or, supposing you knew what “chance of rain” meant, how would you go about determining what that uncertain chance actually was? It seems that we have to think about “probability” in other ways when dealing with risk and uncertainty.

There are statistical ways of interpreting and reckoning with uncertain statements, and they exist mainly within the branch of mathematics called Bayesian statistics that we'll get to in Chapter 6. For now, we'll stay with the frequentist interpretation and see where it leads.

5. C.2 Probability: A Frequentist Viewpoint

For frequentists, probability is a way of characterizing the real world. Events recur with a frequency which, in the long run, is their probability of occurrence. Properties of the world have definite values that, because of variability, are generally unknown. Under the frequentist interpretation, you can often calculate the exact value of a probability. If you know all of the elements of a group— the numbers of spots on the sides of a pair of dice, the percentage of females in a population, the incomes of American workers—you can calculate precise probabilities involving the elements. The chance of rolling a seven with a pair of dice (i.e., a total of seven spots showing on their upper surfaces when the dice come to a halt) is equal to the number of ways that the sum of the spots on the two dice can add up to seven—there are six ways with two ordinary dice—divided by the total number of possible outcomes of rolling them—36 possibilities—so the chance of getting a seven with two dice is 6/36 (1/6). In this case, the sources of variability include countless unknown factors: the detailed contours of each die, the velocity and angle of the toss, the collisions the dice undergo, what they’re made of and what they land on, etc. Or, the chance of randomly picking an American worker with a very high income is equal to the total number of workers in the very high income category divided by the total number of American workers.

It is important to stress, though, that you can only calculate precise probabilities in the long run (i.e., essentially, the infinitely long run); there is no guarantee that on any given trial or group of trials you’ll get an expected outcome.

Rolling two dice six times does not mean that a seven will show up on one of those rolls, and assuming that a seven is “due” just because you haven’t rolled one in a while, is called the gambler’s fallacy. In the limit, however, sevens should show up on one-sixth of the total throws.

While frequentists aspire to be entirely objective in their calculations, they do take some things for granted, and subjective judgment enters into their procedures. Frequentists assume that variability (in measurement, sampling, noise) is distributed according to a bell- shaped curve, meaning they follow a mathematical Gaussian distributions or, as it is commonly known, a normal distribution. Statisticians know all about the mathematical properties of a normal distribution, and, if values are distributed normally, statisticians can carry out tests that are based on the distribution. This is especially handy for biologists since many biological parameters do follow a normal distribution. If you assume a normal distribution of human heights, for example, you can infer that the average professional basketball player in the National Basketball Association, who is 6'7", is taller than 99.95% of Americans.

It is true that your assumption of normality will often rest on sketchy information. Fortunately, frequentist statistics is backed up by the Central Limit 'Theorem of Statistics,⁸ which allows you to tap into the benefits of the normal distribution even if the underlying data that you are interested in aren't truly normal.

As an example of how the Central Limit Theorem simplifies things, assume that many people in the small town of Averageville, USA, are of Dutch descent, and, since the Dutch are a very tall people, the distribution of heights in the town is skewed toward taller than average (i.e., it is not a normal distribution). The Central Limit Theorem says that if you take many sample groups of Averagevillians, calculate the mean of each group, and plot all of the means, the plot will tend to follow a normal distribution.

The Central Limit Theorem is a pillar upholding many frequentist procedures, although we tend to take its applicability for granted if we're aware of it at all.

In summary, frequentists assume that a real-world parameter has a single, fixed-though-unknown value. They think about probability because variability obscures the true values that they want to find. To draw their inferences, frequentists compare actual observations with observations predicted by their scientific hypotheses. If this sounds familiar, it is because the frequentist school of thinking was the default mode in your education. The frequentist viewpoint is intuitively appealing and powerful in many contexts, such as when we want to how fast the casinos will drain our wallets or testing the predictions of a scientific hypothesis. As Karl Popper is our touchstone for this topic, we should step back and look at the subject of probability in hypothesis testing as it first appeared to him.

<< | >>

↑

Source: Alger Bradley E.. Defense of the Scientific Hypothesis: From Reproducibility Crisis to Big Data. Oxford University Press,2020. — 449 p.. 2020

C Why Statistics Matters

More on the topic C Why Statistics Matters: