E Big Data: What Is It?

While the origins of the term “Big Data” are murky, it gained academic respectability in 2003,⁴⁶ although in 2001 Laney⁴⁷ had already defined the “bigness” of data by the famous “3 Vs”: volume, velocity, and variety.

The alliteration provoked a stream of imitators, and the number of V's describing Big Data grew to 4, 7, and 10 within a few years, all the way to the current record: “The 42 V's of Big Data and Data Science”⁴⁸ a list that includes the plausible “visualization” and “veracity” as well as the cynical “vagueness” and “varmint” (the number of software bugs increases with program size), apparently indicating that enough is enough. The point is that Big Data is not solely about dealing with large quantities of data.

Big Data is now a “phenomenon and a discipline.” Big Data has transformed science in both\ concrete and abstract ways. In this chapter, I want to take a quick look at how Big Data is influencing conventional science practice; in Chapter 15, we'll examine its disruptive, abstract influences on science and the philosophy of science.

4.E.1 Hypothesis-Based Big Science/Big Data

Although the Human Genome Project typifies Discovery Science in the context of Big Science and Big Data, Big Science is not confined to the realm of Discovery Science. The search for the Higgs boson carried out at the Large Hadron Collider is a good example of Big Science/Big Data testing of a quantitative hypothesis: the Standard Model of physics. Another recent example is the test of the prediction of the existence of gravitational waves predicted by Einstein's General Theory of Relativity,⁴⁹ but these examples get a lot of popular coverage so they are probably well-known to many readers. The employment of Big Data in Small Science projects is probably not as familiar.

4.E.2 Hypothesis-Based Small Science/Big Data

Small Science also uses Big Data to test hypotheses.

Psychologist Morteza Dehghani and colleagues⁵⁰ were interested in the social “glue” that holds communities together and hypothesized that shared moral values, especially “moral purity,” which is related to feelings of disgust or uncleanliness, caused people to form social networks. That is, their hypothesis predicted that the closer two people were in their views on moral purity, the closer their social bonds. To test predictions of the hypothesis, the authors collected and analyzed 700,000 Twitter tweets to determine the degree of homophilily (“self-liking”) among groups of people. They investigated tweets occurring around the time of the government shutdown brought about by Congress in 2013 because of political disagreements over healthcare (“Obamacare”) funding. The public debate took place along party lines, and, indeed, Deghani's group, with the help of a “community-detection algorithm,” identified clusters of tweets representing the two major US political parties. The investigators then used copious computer processing of the data to determine the social distances among individuals and the “moral content” of their tweets (identified by key words). People who were closely connected socially shared moral purity values but did not necessarily share other moral concerns (e.g., on fairness, loyalty, or authority). The authors also found that moral bonds were tighter than political ones and that, within political parties, people also sorted themselves into groups of shared value systems. Finally, the authors tested other predictions of their hypothesis in behavioral “laboratory studies” (actually online surveys conducted via Mechanical Turk⁵¹) with a few hundred people. In the laboratory studies, the authors directly asked the participants about their views of moral similarity and differences along with their other personal traits and compared them. The laboratory findings corroborated the conclusions of the Big Data analysis.

This example shows that Big Data can find a place in a Small Science setting. It also demonstrates that, whereas Big Data greatly expanded the universe of this social science study, it did not alter the study's conceptual framework: the authors used Big Data to test a conventional hypothesis in a straightforward way.

4.E.3 Systems/Computational Biology

Complex systems have numerous component elements that interact in nearly limitless ways. You might know a great deal about an individual element—for example, a gene, a protein, an atmospheric aerosol particle—in isolation and still know nothing about how that element will behave when it is part of a huge population of elements, let alone how the population as a whole will behave. The “bland” reductionist assumption that all physical phenomena depend on simple physical entities doesn't mean that a constructionist program of building up to the more complex systems from simple ones will work. Novel behaviors emerge when the elements interact. “More is different,” says the physicist P. W. Anderson.⁵² Anderson, a committed reductionist, firmly rejects constructionism and uses a chemical example: both ammonia and sugars are made up of similar atoms; however, a detailed atomic-level comprehension of the atomic motions within the ammonia molecule does not help you understand the atomic motions in the larger sugar molecules.

Tackling the large assemblies of elements head-on is often referred to as a “systems” approach to emphasize that the goal is to understand the system as a whole. Systems biology (aka computational biology) tries to understand how its parts interact by developing computer models that reproduce the emergent properties of the system.

Does the interest in the system as a whole preclude a role for hypotheses? Is its success proof that science does not need hypotheses to advance? On the contrary, many systems biologists consider hypotheses to be integral parts of their field. For instance, the biochemists Daniel Beard and Martin Kushmerick⁵³state that the high-level unifying generalizations, the hypotheses, and Platt's program of Strong Inference (Chapter 2) provide the logical and scientific structure that are necessary for understanding the formidable system complexities of biology.

However, Beard and Kushmerick disagreed with Platt's downplaying the importance of mathematical models in scientific reasoning.

Platt felt that compelling logical models were more powerful; he wanted to have a chain of nonmathematical reasoning that would decide in favor of one hypothesis by eliminating others. For Platt, a mathematical model was simply a convenient way of summarizing your conclusions. The systems biologists argue that you can't get a meaningful grasp of an entire system without using powerful mathematical tools; that “the simple qualitative framework of an earlier age of molecular biology” is no longer up to the task. Nowadays, they say, reasoning is not based on the crude box-and-arrow diagrams of 50 years ago, but on sophisticated computer models, and “the fundamental key is to recognize that a computational model, however simple or complex is a hypothesis [emphasis added].” Hypotheses cast as quantitative models in their view are precise and explicit, as they must be so you can test and potentially falsify them. Falsification is important because “successful disproof is the key to progress.” Systems biology lets us solve enormously complex problems without abandoning the fundamental logic-based procedures of Popper and Platt.

In these past sections, I've reviewed kinds of science that have been proposed as alternatives to hypothesis-based science. Before leaving the topic, I want to consider one final kind of science. So far, we've been reviewing forms of modern science. One overlooked class of knowledge does not fit into this category because it was not generated by modern societies. It is Indigenous science.

<< | >>

↑

Source: Alger Bradley E.. Defense of the Scientific Hypothesis: From Reproducibility Crisis to Big Data. Oxford University Press,2020. — 449 p.. 2020

E Big Data: What Is It?

More on the topic E Big Data: What Is It?: