B The Big Data Mindset

What exactly is a Big Data Mindset? It is not a necessary consequence of working with Big Data. As we saw in Chapter 4, Big Data per se is compatible with the hypothesis testing procedures of conventional science.

In both Big and Small Science settings, researchers use Big Data try to interpret the world through causal relationships and explanations. We're constantly warned that “correlation is not causation” in traditional contexts. In contrast, advocates of the Big Data Mindset believe that, while we may succeed in finding useful predictive correlations among variables, given millions or billions of variables, with trillions of conceivable interactions among them, it is futile to hope to get beyond correlation to causation.

And that's just fine in Chris Anderson's world where “Petabytes allow us to say, ‘correlation is enough.'” In this world, we give up on our conviction that science should help us understand nature, as well as on the quaint belief that we need to understand it. (A saying found on hippie T-shirts from the 1960s, “Reality Is a Crutch,” comes to mind.) For many scientists, it would be hard to accept such a seismic shift, but Anderson argues that scientific theories are likely to be wrong anyway, so why worry about not having them? Cut right to the chase: “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”

Amazon, Inc., recommends books for you to read based on its vast storehouse of data on books that you and people like you have bought and liked in the past and algorithms to translate those data into specific recommendations. Driven by its Big Data Mindset, Amazon's recommendations are frequently helpful, despite being unguided by sophisticated literary advice. “If it produces usable results, what else is there?,” could pretty well serve as the motto of the Big Data Mindset.

The Big Data Mindset has ripple effects that affect science in other ways than the decision- making process. The concept of “scientific data” itself will change because the Mindset does not demand the organized, sets of precise measurements that scientists have always relied on.⁴ Vast quantities of messy data will come from all over: cameras, laboratory instruments, telescopes, web searches, on-line communications, etc. Sensors for every kind of energy will be embedded everywhere, creating and vacuuming up data where none existed before. Almost anything imaginable can be “datafied” and sent to computers for sorting out. For the Big Data Mindset, data don't have to be exact, neat, or carefully curated in order to be useful as long there is a lot of it. Quantity trumps quality.

To get a feel for the advantages and disadvantages of the Big Data Mindset, we'll start with the notorious case of Google Flu Trends (GFT) and its “epic failure.”⁵

15. B.1 The Rise and Fall of Google Flu Trends

Google is not a medical services provider, it does not conduct research in epidemiology or virology, and yet it created a computer algorithm, GFT,⁶ that, for a time, predicted the spread of the major 2009 flu epidemic in the United States weeks ahead of the Center for Disease Control (CDC), the US government agency that is charged with making such predictions. Google did not try to identify actual or potential sick people. Rather, it sifted through the masses of online search data that it routinely collects every day (40,000 searches per second; 3.5 billion per day in 2018) for patterns correlated with outbreaks of illness during the previous five flu seasons. Google scientists assumed nothing about the correlations they might find: instead, they looked for relationships among the search terms that people used and the spread of flu (as reflected in the corresponding CDC records of flu-related doctor visits). The Google scientists took the 50 million most common search terms that they found and tested them, one at a time, to identify the terms that were most highly correlated with previous flu epidemics.

They combined the 45 top search terms to create an algorithm to predict future outbreaks (algorithms that included more predictors were not as good). In the end, they tested 450 million different mathematical models to find the one that most accurately predicted (or “postdicted”; see Chapter 4) the spread of flu during the earlier years, and they called it “Google Flu Trends.” They did not worry about why the correlations between GFT and flu existed: predictions for the near future, based on data in the recent past, nowcasting—not understanding—was their goal.

How did GFT do? At first, very well. It predicted the seasonal 2009 outbreak and tracked the 2010 and 2011 outbreaks as well as the CDC data did. The fact that it was not better than the CDC, however, was disconcerting, but things really began to fall apart in late summer 2011. For the next two years, from August 2011 onward, GFT overpredicted the number of flu cases in 100 of 108 weeks. And it overshot by more than double the real number during the 2012-2013 season. Google engineers repeatedly tweaked their algorithm to no avail, and eventually the company pulled the plug on GFT. Today, a visitor to the website https://www.google.org/flutrends/about/, reads that “Google Flu Trends... are no longer publishing current estimates of Flu.... It is still early days for nowcasting and similar tools for understanding the spread of diseases like flu.”

15. B.2 GFT: What Went Wrong?

According to data analyst David Lazer and colleagues⁷ there were many problems with GFT, and they named “[B]ig [D]ata hubris” and “algorithm dynamics” as the most insidious ones. Big Data hubris (hubris is “excessive pride or selfconfidence”) gives rise to the fiction that Big Data analyses of massive amounts of data alone, unconstrained by concerns for how to conduct investigations, is sufficient for making scientific progress. Blinded by hubris, you may downplay the validity of a test (i.e., whether it measures what it is supposed to measure) or not worry about whether your interpretation of the test results is consistent with well-established theoretical principles.

The problem caused by algorithm dynamics is that the initially high correlation between users' search behavior and actual illness may fade. For instance, searches for “flu” might bring up other terms (e.g., “sniffles”) that prompt people to do follow-up searches on “sniffles” which they had not initially done. If “sniffles” had been on the original 45-term master list that Google used in its algorithm (we don't know because Google never made it public), this would have created positive feedback. The new searches on “sniffles” could artificially inflate the number of unique flu-related searches that users initiated and cause overestimation of the prevalence of flu.

Spurious correlations were also a concern because, with 50 million potential candidate search terms, the probability of finding correlations that were “structurally unrelated” to flu and hence misleading was quite high, according to Lazer et al. A potential source of spurious correlation that the Google scientists excluded from their algorithm was the term “high school basketball.” High school basketball games, like the flu, tend to occur during the winter months, and a correlation between high school basketball and flu would be spurious. An algorithm that included “high school basketball” would have been a “part-flu, part-winter detector.” Some spurious correlations are obvious, but you don't know how many others remain hidden within the model. Indeed, GFT missed an unusual out-of-season flu outbreak in 2009, conceivably because another “winter detector” remained buried in the algorithm.

Big Data hubris, algorithm dynamics, and spurious correlations are not the only sources of concern stemming from Big Data Mindset strategies like GFT: overfitting is a major one.

15. B.3 Overfitting and GFT: The Bias-Variance Dilemma

You recall the problem of overfitting from Chapter 4: we imagined developing a model to account for the pattern in which a handful of coins tossed into the air landed on the ground.

Because of overfitting, a model that successfully postdicted one pattern would do a terrible job of predicting the landing pattern of a future group of tossed coins. In general, overfitted models fail because the exact data values used to build them are influenced by random variability that won't be duplicated in new datasets, and Big Data strategies are susceptible to this problem. How do we interpret our data in a simple and robust way without running the risk of overfitting?

15. B.4 The Bias-Variance Tradeoff

The bias-variance tradeoff (Chapter 11) is the statistical Achilles heel of mathematical models that try to go from correlations found within enormous datasets to predictions of future behavior. The bias-variance tradeoff first came to prominence as computer scientists began to develop machine learning algorithms,⁸ essentially computer programs that used feedback to detect regularities in data and to describe these regularities by rules. The algorithms were first tried out on samples of data, training sets, that followed a rule, and they would try to fit the data with a mathematical function—basically, to guess the rule. Guided by feedback on how well its guess matched the true rule, the algorithm would make adjustments and try to make better fits.

We saw (Chapter 12) that bias reflects a scientist’s decision about how to characterize data by a certain mathematical function. To get a good fit, she has to weight some points more heavily than others, in effect ignoring variability and introducing bias in the form of her expert intuition of the underlying reality, in order to make a better predictive model.

In the machine learning context bias has a slightly different meaning; it is the quantitative difference between an algorithm-generated rule and the natural rule that generated the data in the training sets. If your algorithm is too sensitive to small fluctuations in the training sets, it will overfit the data in later test sets.

(These two kinds of sets, incidentally, are often created by splitting one large, original dataset into two halves and using one half for training and the other for testing.)

Problems arise even if you don’t actually force your model to account for every hiccup in the data. Let’s take the GFT case as an example. Assuming that there is a true relationship between an individual’s web search terms and his state of illness this relationship will be subtly different for each individual, and therefore your web-sifting correlational hunt will average across the searches of all users. This averaging introduces a kind of unpredictable bias error even as it reduces the variance error due to individual differences. The bottom line is that there is a tradeoff between variance and bias, and, since it can’t eliminate error, the machine learning solution is to minimize the total error from both sources. While reducing error is always advantageous, this process is primarily useful for applied science—for example, helping Amazon sell books by predicting ones you’ll like—rather than understanding nature in the deep sense.

15. B.5 Could “Good Bias” Help Fix the GFT Program?

I’ve alluded to the idea that when scientists throw out information by fitting messy experimental data with a crisp mathematical function they are exercising “good bias.” That is, de-emphasizing or ignoring some aspects of your data does not invariably reflect a moral or ethical lapse. You might wonder if you could improve the overall performance of GFT and similar programs by building in good bias that decreases the influence of the correlations between individual web search terms and observed flu cases. You'd loosen the ties between current data and past flu cases in order to improve your ability to predict future cases.

Loosening these ties, however, would contradict the way the Big Data Mindset approached the creation of GFT in the first place: after all, the Google scientists created GFT by deliberately seeking out search terms that were highly correlated with the numbers of flu cases. And in fact, tweaking the GFT algorithm with apparently reasonable fixes—essentially what we are considering “bias”—did not help at all. The predictive validity of GFT continued downhill despite active intervention. In a sense, this is unsurprising once we remember the impetus behind the Big Data movement: there is much too much complicated data for us to be able to grapple with rationally.

15. B.6 GFT and the Need to Understand

David Lazer et al., argue that a major drawback of the GFT approach is its focus on improving predictions, whereas “What is more valuable is to understand the prevalence of flu.” Big Data is not useless: to the contrary, it can markedly enhance the value of online surveys and health reporting, but its goal should be on providing “a deeper, clearer understanding of the world” rather than trying to supplant understanding with correlations. Of course, attempting to provide explanations and understanding is what hypothesis-based science is all about. In other words, the value of the GFT approach might be increased by merging it with traditional scientific methods (although we need to be clear on what GFT's scientific goal was; see Box 15.1). But is the Big Data Mindset compatible with conventional reasoning?

The GFT algorithm was not trying to explain sickness by looking at web searches. At best, you might argue that GFT was based on a prediction of the implicit hypothesis that sick people (or people worried about becoming sick) were looking for information that would help them get or feel better. This interpretation is dubious. As Lazer et al. point out, the failure of the GFT algorithm was due in part to the fact that it was not a valid measuring device and therefore couldn't say anything about either the prediction or the hypothesis. If your pH meter is broken, it can't tell you anything one way or another about your acid rain hypothesis.

Likewise, the Google algorithm was not a valid detector of flu. The Google team did not construct GFT by looking for meaningfully “flu-related” terms,

Box 15.1 Was GFT Testing an Implicit Hypothesis or a Prediction?

The authors of Google Flu Trends (GFT) don't take a stand on whether their work has anything to do with a hypothesis; the word doesn't appear in their paper. “Prediction” occurs only once, in “prediction intervals,” hence this prediction is also unrelated to a hypothesis. However, a Nature editorial on GFT says that it “tested the hypothesis that people will more frequently search the Internet using flu-related terms when they get sick.”⁹ if you accept the definitions of these terms in Chapter 1, then you may disagree. A hypothesis is an explanation. A prediction is a statement about a future state of the world that may or may not follow from a hypothesis. Even when it does, it does not explain anything.

The Nature editorial says that the GFT is testing the “hypothesis” that people will do flu-related searches when they get sick, but this is a prediction, not a hypothesis. It says only what the investigators expect sick people to do. Here's an analogy: “People with coughs will buy cough medicine.” This is a prediction about sick people's behavior. The statement that “People buy cough medicine because they have coughs” is a hypothesis that explains a particular kind of purchasing behavior. It predicts that people with coughs will buy cough medicine. Probably hypotheses like this seem so obvious that we don't bother to state them—they are implicit—nevertheless, it's better not to confuse an unstated hypothesis with a prediction.

but merely terms that were correlated with the incidence of flu. You can see that these objectives are not equivalent if you re-consider the correlation between searches for “high school basketball” and number of flu cases. This was plausibly accounted for as a spurious correlation—both high school basketball and flu occur in the winter months—but the fact that the correlation exists at all makes the point: terms that were useful in constructing GFT are not in any obvious way “flu-related.”

The blind searches for correlations among variables within massive datasets that are driven by the Big Data Mindset are not themselves tests of hypotheses. Claims that the Big Data Mindset is about “millions of hypotheses” suggest that they, like GFT, are conceptually advancing scientific understanding of the world, when instead they only are trying to detect regularities in it. This is just a futuristic update of familiar old inductivist methods.

A hypothesis-friendly application of GFT-type Big Data searches would go from correlations to testable hypotheses. Essentially, the data analysis could serve as an “inference engine” that could aid scientists in generating hypotheses. Methods developed by Judea Pearl and his colleagues may help create such devices.

15.

<< | >>

↑

Source: Alger Bradley E.. Defense of the Scientific Hypothesis: From Reproducibility Crisis to Big Data. Oxford University Press,2020. — 449 p.. 2020

B The Big Data Mindset

More on the topic B The Big Data Mindset: