NEW YORK – There’s a common thread to the dubious claims and irreproducible results that have plagued some fields of science. In most of these cases, scientists saw illusory patterns in the randomness of our world. It’s the same error that led people to be captivated by an octopus named Paul who, in 2010, correctly predicted the outcomes of eight German World Cup matches.
It might seem surprising that a cephalopod could make so many predictions from mere random guesses. But as statistician David Hand explains in his book “The Improbability Principle,” if enough people ask enough animals to predict the outcomes of enough sporting events, eventually it is likely that some will guess correctly a few times in a row.
Something similar happens when thousands of scientists go looking for weird, headline-grabbing results — a common practice in psychology. For a while, it looked like you could make yourself happier by holding a pencil in your teeth, become more empathetic by spending three minutes reading classic literature, or gain power by assuming special “power poses.” All have gone the way of poor, discredited Paul the Octopus.
The good news is that tools to separate real results from random noise exist. The bad news is that many researchers in afflicted fields never learned to use them correctly — a conclusion echoed by reformers from within, and reflected in a statement issued last year by the American Statistical Association.
Statistical analysis falls into two major schools — frequentist and Bayesian statistics. The frequentist school of thought hinges on the idea that the probability of something happening corresponds to the number of times it would happen given many chances. Roll a die often enough and five will come up exactly one-sixth of the time.
Back in 1865, mathematicians Benjamin and Charles Peirce, a father-and-son team, used the concept to help settle a dispute involving Hetty Green, who would later become known as the richest woman in America.
The story, told in detail by journalist Louis Menand in his book “The Metaphysical Club,” starts with the mathematicians getting called as expert witnesses to determine whether Green (then Hetty Robinson) had forged her aunt’s signature on an alternative will that would have bequeathed her a fortune of $2 million.
The signature in question was perfectly identical to a signature on the original will, suggesting that Robinson had traced it. Most authentic signatures vary a bit. What were the odds these two signatures would be identical by chance?
The Peirces noted that the signature had 30 down strokes. They found 44 other examples of the aunt’s signature, measured down strokes, and calculated that a given down stroke matched across two signatures 5 percent of the time. They calculated odds of one in 68 that three down strokes would match in two signatures, one in 144 that four would match, and odds of one in trillions that all 30 would match.
This calculation bears some resemblance to what scientists do to determine what they call statistical significance — which is expressed as something called a p-value. The technique was invented to help researchers separate real results from noise by giving them a sense of whether they should be surprised enough by their data to take a closer look.
But there, psychologists and medical researchers usually use an arbitrary cutoff point of .05 (1 in 20) to define what’s statistically significant — a standard far less stringent than the one-in-trillions calculated by the Peirces. This porous filter was originally intended to flag preliminary results that deserved a second look — not as a proxy for truth.
The problem with p-values goes beyond that. Gerd Gigerenzer, a psychologist and longtime science critic at the Max Planck Institute for Human Development, points to a survey published in 2002, which indicated that most professors of psychology don’t know what p-values represent. They think they know, but they don’t. (In an interview, Gigerenzer said he didn’t think the situation had improved in the last 15 years.)
And because they don’t understand it, they routinely calculate it incorrectly, he said, allowing the publication of lots of high-profile noise under a veneer of statistical rigor.
In a 2004 paper titled “Mindless Statistics,” Gigerenzer illustrated the crux of the problem with an anecdote from the writings of physicist Richard Feynman.
In exploring various fields of science and pseudoscience, Feynman was working with a psychologist who was putting rats in a T-shaped maze, testing whether they preferred to turn right or left.
The psychologist didn’t get the result he was looking for, but he did notice that the animals alternated directions, right-left-right-left, and asked Feynman if he could calculate the statistical significance of that result.
Feynman tried to explain that the psychology professor would have to set up a new experiment to test his new hypothesis. If you’re wondering why, consider the fact that drawing a pair of threes is just as unlikely as drawing a pair of kings or aces in your poker hand. But the odds of drawing a particular pre-specified pair is much lower than drawing any pair.
If researchers comb through their data fishing for weird things, that’s fine, but to calculate their statistical significance requires a separate experiment. Otherwise they’ll end up with the same problem that afflicted one of Paul’s successors, a koala named Oobi-Ooobi, who was fired last year after his sports prediction powers suddenly — and not so mysteriously — disappeared.
Faye Flam is a Bloomberg View columnist and science writer. She has written for the Economist, the New York Times, the Washington Post, Psychology Today, Science, New Scientist and other publications. She has a degree in geophysics from the California Institute of Technology, and has been a Knight-Wallace fellow at the University of Michigan.
In a time of both misinformation and too much information, quality journalism is more crucial than ever.
By subscribing, you can help us get the story right.