Could big-data hype be leading us astray?

by John Naughton

The Observer

Concepts of enduring utility rarely emerge from the market-research business, but the Gartner hype cycle is an exception that proves the rule. It is a graph that describes the life cycle of a technological innovation in five phases.

First, there’s the “trigger” that kicks off the feverish excitement and leads to a rapid escalation in public interest, which eventually leads to a “peak of inflated expectations” (phase two), after which there is a steep decline as further experimentation reveals that the innovation fails to deliver on the original — extravagant — claims that were made for it. The curve then bottoms out in a “trough of disillusionment” (phase three), after which there is a slow but steady rise in interest (the “slope of enlightenment” — phase four) as companies discover applications that really do work. The final phase is the “plateau of productivity,” the phase where useful applications of the idea finally become mainstream. The time between phases one and five can be several decades long.

As the “big data” bandwagon gathers steam, it is appropriate to ask where it currently sits on the hype cycle. The answer depends on which domain of application we’re talking about. If it’s the application of large-scale data analytics for commercial purposes, then many of the big corporations, especially the Internet giants, are already into phase four. The same holds if the domain consists of the data-intensive sciences such as genomics, astrophysics and particle physics: The torrents of data being generated in these fields lie far beyond the processing capabilities of mere humans.

But the big-data evangelists have wider horizons than science and business: They see the technology as a tool for increasing our understanding of society and human behavior and for improving public policymaking. After all, if your shtick is “evidence-based policymaking,” then the more evidence you have, the better.

So where on the hype cycle do societal applications of big-data technology currently sit? The answer is phase one, the rapid ascent to the peak of inflated expectations, that period when people believe every positive rumor they hear and are deaf to skeptics and critics.

It’s largely Google’s fault. Four years ago, its researchers caused a storm by revealing (in a paper published in Nature) that Web searches by Google users provided better and more timely information about the spread of influenza in the United States than did the data-gathering methods of the U.S. government’s Centers for Disease Control and Prevention. This paper triggered a frenzy of speculation about other possible public policy applications of massive-scale data analytics.

As the economist Tim Harford puts it: “Not only was Google Flu Trends quick, accurate and cheap, it was theory-free. Google’s engineers didn’t bother to develop a hypothesis about what search terms — ‘flu symptoms’ or ‘pharmacies near me’ — might be correlated with the spread of the disease itself. The Google team just took their top 50 million search terms and let the algorithms do the work.”

Thus was triggered the hype cycle. But in this particular case, the enthusiasm turned out to be premature. Nature recently reported that Google Flu Trends had gone astray. “After reliably providing a swift and accurate account of flu outbreaks for several winters,” reports Harford, “the theory-free, data-rich model had lost its nose for where flu was going. Google’s model pointed to a severe outbreak, but when the slow-and-steady data from the (U.S. government center) arrived, they showed that Google’s estimates of the spread of flu-like illnesses were overstated by almost a factor of two.”

So what went wrong? Simply this: Google doesn’t know anything about the causes of flu. It just knows about correlations between search terms and outbreaks. But as every GCSE student knows, correlation is quite different from causation. And causation is the only basis we have for real understanding.

Big-data enthusiasts seem remarkably untroubled by this. In many cases, they say, knowing that two things are correlated is all you need to know. And indeed in commerce that may be reasonable. I buy stuff for both myself and my kids on Amazon, for example, which leads the company to conclude that I will be tempted not only by Hugh Trevor-Roper’s letters but also by new releases of hot rap artists. This is daft, but does no harm. Applying the kind of data analytics that produces such absurdities to public policy, however, would not be funny. But it’s where the more rabid big-data evangelists want to take us. We should tell them to get lost.