P-Hacking

[Last Updated August 28th 2024]
P-hacking or data dredging refers to the practice of altering statistical data analysis to find or fit a desired pattern. Sometimes this can occur unintentionally, whereby a researcher explores and analyzes data in various ways, finds a desired result, and unconsciously invents a story to justify the legitimacy of their findings. Other times, this can be intentional, whereby a researcher might selectively choose data in order to elicit a desired result/outcome. Often, motivations may be less clear. For example, data mining software companies are incentivized to create products that can produce statistically significant results. Thus, one might inadvertently practice data dredging through the technology they choose, without acknowledging that the technology may not produce reliable evidence for a hypothesis on its own.

One might expect most researchers to be honest, but a study by John et al. (2012) that surveyed over 2000 psychologists found that many admitted to various forms of p-hacking (what they call “questionable research practices”). For example, 38.2% of psychologists admitted they had excluded data that impacted the results of a study. And 0.6% straight up admitted to falsifying data. Additionally, over 50% of those surveyed admitted to gray-area practices like not reporting all dependent measures or checking for statistical significance to determine whether or not more data should be collected.

In addition to p-hacking, issues like the statistical power of a study and bias can lead to false results. This is best explained by one of the most read academic essays in history titled “Why Most Published Research Findings are False” by Ioannidis (2005). Further, HARKing (Hypothesizing After the Results are Known), which is discussed later on this page, can also lead to problematic results and misrepresentations in science.

With these issues in mind, it is important to take a critical view of any single study. Rather, you should always look for replication, or for multiple studies that demonstrate a hypothesis through different methodology. Further, it is important to consider the methodology used, as well as the sample size and statistical power. As such, it is important to be critical of many blog posts that apply behavioural science to business practices, as they often rely on only one or two studies and may ignore other studies that did not find similar results. This allows for oversimplification and catchy blog titles that attracts views and feels useful to readers. But in turn, provides you with unrealistic strategies that often do not work, or require strict conditions for implementation and nuanced understanding of behavioural science that the authors do not discuss.

Some Examples of P-Hacking

Below we provide some examples of the most common p-hacking methods. For a more detailed summary that includes a greater array of p-hacking strategies and technical descriptions, we suggest reading Stefan and Schönbrodt (2023) which you can find here: https://royalsocietypublishing.org/doi/10.1098/rsos.220346 Alternatively, Pete Judo (2023) has created an amazing video on YouTube that discusses some examples of p-hacking, titled “6 Ways Scientists Fake Their Data” which you can view here: https://www.youtube.com/watch?v=6uqDhQxhmDg

Selective Reporting of Dependent and/or Independent Variables

Many researchers run various tests or collect various measurements for different dependent variables while conducting a study. They then pick-and-choose which dependent variables to report, in order to produce a statistically significant response in line with their hypothesis. For example, a researcher might manipulate the amount of light in a room, and have participants answer a number of different questions regarding their attention. Their intention may be to average out the results of those different questions for an overall “attention” score. But they might find that said score does not produce a statistically significant result. They can then re-analyze the data, combining it in different ways (e.g. only averaging 5 of the questions), or focusing on each individual question, until they find questions or combinations that are statistically significant. Subsequently when publishing their results, they might exclude the data or questions that were not useful. Similarly, a researcher might examine multiple independent variables. For example, a researcher looking at the effect of light in a room on attention might have five different experimental conditions with different amounts of light. Four of those conditions might not produce statistically significant results when compared to a control group. But if one of those conditions does have statistically significant results, the researcher might publish the findings without discussing the four conditions which did not have statistical significance.

Data Trimming / Excluding Outliers

Often when collecting data, there may be some data points that differ considerably from all other data points. Sometimes, removing these outliers is justified. For example, if a participant states that they made a mistake in respect to following instructions on a cognitive test, it may be justified to remove their results. However, many researchers remove outliers that are skewing their data, without explanation or legitimate justification. For example, a researcher studying the relationship between social disgust and political affiliation might measure self-reported feelings of disgust in respect to news articles participants are asked to read. However, some individuals do not experience social disgust and would thus rate everything as zero. If this helps a researcher find statistical significance they may leave it in. However, if it eliminates statistical significance, they might remove it, using the justification that the participant might have just hit zero to get through the study quickly (even if that is not the case). As researchers are responsible for what data they trim, they can find justifications (either consciously or unconsciously) to trim the data that harms their results, and leave the data that helps their results.

Optional Stopping / Data Peeking / Stopping Rules

Studies in psychology are often run over multiple weeks, with participants coming in at different time slots and being run through the experiment by a research assistant who is blind to the study’s methodology. Optional stopping occurs when a researcher “peeks” at data as it is being collected, and then chooses to stop collecting data once a statistically significant result has been achieved. Imagine a researcher intended to perform an experiment on 200 students. Every day, 20 students are brought in, and every night, the researcher analyzes the data up to that point. For the first three days, the researcher might not find statistical significance for their desired results. However, on the fourth day (after 80 students have performed the experiment), the researcher may suddenly have statistical significance and thus choose to end the study. However, this may just be a false positive, and if they continued to run the experiment, they may find that the data does not support their hypothesis.

HARKing

HARKing stands for Hypothesizing After the Results are Known and occurs when an individual creates a hypothesis for results that surprised them, pretending to have always had that hypothesis from the start. This often requires storytelling to justify why they had that hypothesis, which can be problematic as it may misrepresent the existing data in a field. Further, if unexpected results occur because of a false positive, this can poison the field by introducing faulty premises that may be carried into future hypothesis construction. To better understand the impact of HARKing, it’s best to read Kerr (1998) who popularized the term and described how it harms science, followed by Rubin (2022) who argues that HARKing may not be as problematic as originally thought.

Pre-Registered Studies

One of the most effective ways to combat p-hacking / data dredging and HARKing is to require pre-registration of studies. This involves submitting study methodology, hypotheses, and motivation prior to running a study. This submission is dated, peer reviewed, included with the final study, and can no longer be edited. Further, there is an expectation that the study will be published regardless of results. However, pre-registration does not guarantee ethical behaviour, and researchers can still partake in selective reporting or just re-run an experiment until they get the results they are seeking. For a more in-depth analysis of benefits and pitfalls of pre-registered studies, see Yamada (2018).

Can You Trust Psychology Research?

The issues with modern science are not limited to psychology. As a result of a highly competitive field and a constant pressure to publish, there are huge financial and reputational incentives to fake data. Academia as a whole is broken and has been for years. Luckily there is a growing momentum to push back against the institutional problems, and science seems to be moving in the right direction. In respect to psychology, research from a paper titled “Estimating the Reproducibility of Psychological Science” by Attwood et al. (2015) found that one third to one half of 100 studies published in 2008 were replicable, although often with weaker results. Thus, it is important to recognize that hypotheses only supported by one or two studies may not be reliable. However, one can be confident in findings if they base their understanding of behavioural science on studies that have been replicated multiple times, or hypotheses that are demonstrated though converging data (e.g. multiple studies using different methods). So to put it simply, yes, you can trust psychology research. But it’s important to have a nuanced understanding of the methodologies and history behind different hypotheses, theories, and ideas. Importantly, this means doing research that goes further than a blog post citing a single study, before attempting to apply behavioural science. Additionally, it may be wise to avoid using AI like ChatGPT, which has derived many of it’s answers from problematic blog posts and news articles.

Works Cited

Attwood, A., Bahnik, S., Beyan, L., Bosch, A., Braswell, E., Brohmer, H., Brown, B., Bruning, J., Callahan, S., Chagnon, E., Christopherson, C., Cillessen, L., Clay, R., Cleary, H., Cohoon, J., Costantini, G., Alvarez, L., Cremata, E., DeCoster, J., … Zeelenberg, M. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524-532. https://doi.org/10.1177/0956797611430953

Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Psychology Review, 2(3), 196-217. https://doi.org/10.1207/s15327957pspr0203_4

Pete Judo. (2023, December 9). 6 ways scientists fake their data. [Video]. YouTube. https://www.youtube.com/watch?v=6uqDhQxhmDg

Rubin, M. (2022). The costs of HARKing. British Journal for the Philosophy of Science, 73(2), 535-560. https://doi.org/10.1093/bjps/axz050

Stefa, A. M., & Schönbrodt, F. D. (2023). Big little lies: A compendium and simulation of p-hacking strategies. Royal Society Open Science, 10, 220346. https://royalsocietypublishing.org/doi/10.1098/rsos.220346

Yamada, Y. (2018). How to crack pre-registration: Toward transparent and open science. Frontiers in Psychology, 9, 1831. https://doi.org/10.3389%2Ffpsyg.2018.01831

Important Links

Discover More Knowledge Contact Us