I just finished writing an article for Kansas Living about those times you can’t trust science. “Scientific Research: Sorting through fact and fiction” reviews a study from 2015 which concluded daily doses of dark chocolate help people lose weight faster.
Sounds awesome, doesn’t it? Spoiler alert: it sounds too good to be true because it’s not true. While the study was real, it was designed specifically for a documentary on junk science. [1]
Unfortunately, the flaws that tainted these results are distressingly commonplace. As Mark Twain paraphrased an unknown author, “There are lies, damned lies, and statistics.”
Familiarity with just a couple of key concepts can do wonders for helping us distinguish between legitimate statistics and the lying kind. No worries, though: I won’t ask you to do any math!
Ice Cream Causes Murder?
Early on in my Intro to Psych class, the professor presented a classic teaching example of correlation, when two factors are shown to be statistically connected.
Murder rates rise and lower in tandem with ice cream sales. More ice cream sales equal more murders. This is a real correlation, by the way.
Of course, eating ice cream doesn’t make people homicidal–at least as long as no one gets between me and my bowl of chocolate goodness. There’s something else at work here.
In hotter weather, there are more murders. In hotter weather, more people buy ice cream. So in researchspeak, the temperature could be considered a “confounding variable,” potentially impacting both the ice cream sales and murders.
It’s very tempting to infer one thing must cause another once you realize two things are connected. Many times, you’d even be right. That’s why you will sometimes hear, “correlation doesn’t imply causation” when people are discussing research. It goes against our natural inclination to avoid this assumption.
Observing a logical chain of events is often good enough for informal evaluation. It’s not good enough to call it “science,” though.
What is “Significant?”
A fundamental confusion interpreting research often arises from the word “significant.” In ordinary language, significance denotes importance. So findings heralded as being significant sound like a big deal. But in statistics, significance has a very specific definition.
In the 1920s, a statistician named Ronald Fischer suggested calculating how likely results could be attributed to random chance. This suggestion was meant to be a single, informal consideration among many to decide what results merit further examination. [2]
An experiment is constructed around a hypothesis, or essentially an educated guess. Scientists might consider current theory and look to isolate some factor suspected of having impact, the test variable. When expectations about a variable’s impact are stated in a way that can be directly tested, it forms a hypothesis. [3]
The counter to an experimental hypothesis is generally the null hypothesis, which assumes no difference between test conditions. Because no experiment can realistically include every potential subject, no hypotheses can be proven universally true. What’s true of one sample may not be true of another. But researchers can test results against the null hypothesis by considering statistical significance.
The statistical significance calculation asks, if the null hypothesis is true, what’s the probability of getting results as least as extreme via chance? This probability is called the P value. The smaller the P value, the greater the likelihood there is a meaningful difference between test groups. A P value of .05 or lower is considered “statistically significant,” albeit an arbitrary cutoff point. [2]
Problem is, the P value is vulnerable to misinterpretation, even among researchers. A P value of .05 does not mean there’s a 95 percent probability the original hypothesis is true, even though it sounds like it might. The P value cannot test the original hypothesis.
The hypothesis may or may not be based on sound reasoning. Test procedures may or may not genuinely measure what they were intended to measure. There may or may not be unknown factors at work besides what was tested. Research subjects may or may not be representative of the population being studied. No amount of massaging the math can make up for shortcomings in experimental design.
Instead, a P value of .05 suggests that, provided all the other assumptions are correct, there’s a 5 percent probability of getting the same results through random chance. It’s one indication of meaningful difference between test conditions, but it cannot identify the source of any difference. If you begin with a highly improbable hypothesis, statistical significance becomes even less informative. [2] A single calculation cannot begin to tell the whole story.
Additionally, a large number of measurements taken on a small group of subjects dramatically increases the chances of finding statistical significance. Think of it like rolling dice repeatedly hoping for snake eyes. Each measurement is another roll. The more rolls, the more opportunities for the desired outcome.
This is primarily how the aforementioned chocolate study was engineered. They took 18 different measurements on 15 people over the course of just three weeks. The small number of subjects and brief time span made it more likely anomalies would not be averaged out of the findings.
The practice of manipulating data to meet the standard of statistical significance, whether intentional or not, is common enough the term “Phacking” was coined to describe it and the American Statistical Association put out an unusual statement to discourage it. [4] So this issue has implications beyond the realm of what might be considered junk science.
Even legitimately derived statistical significance doesn’t indicate findings are meaningful in a practical sense. Statistically significant differences between test groups can be small enough to be imperceptible in real life, clinically insignificant. When statistical significance is cited, look for other statistical measurements to be reported in conjunction with it, and consider the actual size of any effect reported.
Keep Asking Questions
Maybe we’re trying to use science to make sound decisions. Even if not, it’s nice to follow science news with some understanding of what results actually mean. Interpreting research can be intimidating, but doesn’t have to be.
This isn’t everything there is to know, of course. It’s a start. Be willing to learn the basics of the scientific method if you want to be scienceliterate. Don’t accept presented conclusions as fact. Ask your own questions. Be open to new ideas for sure, but require sound evidence before adopting those new ideas as your own.
You can also congratulate yourself at this point. With just this bit of foundation, you’re better prepared to distinguish good science from junk science than most on the internet passionately arguing about it. Go forth and get your science on!
References

Bohannon, J. (2015, May 27). I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’s How. Retrieved from https://io9.gizmodo.com/ifooledmillionsintothinkingchocolatehelpsweight1707251800.

Nuzzo, R. (2014, February 12). Scientific method: Statistical errors. Retrieved from https://www.nature.com/news/scientificmethodstatisticalerrors1.14700.

Bradford, A. (2017, July 26). What Is a Scientific Hypothesis?  Definition of Hypothesis. Retrieved from https://www.livescience.com/21490whatisascientifichypothesisdefinitionofhypothesis.html.

Baker, M. (2016, March 7). Statisticians issue warning over misuse of P values. Retrieved from https://www.nature.com/news/statisticiansissuewarningovermisuseofpvalues1.19503.