How Psychological Study Results Have Changed Since the Replication Crisis Began

Doctors toasting with test tubes illustration

Science is often described as self-correcting, given its basis in empirical testing and because falsehoods put forth can, in principle, be dismantled through further experiments. Yet, when the platonic ideal of science confronts the human world—constrained by what is feasible and poisoned by perverse incentives—issues may arise.

Such concerns often serve to justify distrust in scientific research. The denial of vaccine efficacy is often supported by claims that pharmaceutical companies driven by profit will overstate the usefulness of their drugs. Climate-change denial occurs in sync with arguments about environmental researchers benefiting financially from ever-alarming findings.

To a degree, academic psychology contributed to this atmosphere of skepticism via its own intense period of self-scrutiny, widely known as the “replication crisis.” Beginning around 2012, widespread reports of failures to replicate previously established findings sparked growing unease among researchers about the overall reliability and reproducibility of published psychological research. Unlike many esoteric academic debates, this conversation struck a chord beyond university walls, reaching many laypeople who learned quips like “publish or perish.” 

Related content: Methods: A Little Help to “Self-Correction”—Enhancing Science After Replications

The field’s response to the replication crisis illustrates self-correction mechanisms. Over the past decade, researchers have embarked on a collective effort to strengthen the foundations of psychological science. This involved a multipronged approach: calls for larger sample sizes to increase statistical power; adjustments to statistical practices to avoid pitfalls that promote spurious findings; and changes to how studies are planned, aiming for greater systematicity with approaches like preregistration.

Have these efforts to increase replicability been effective? I investigated this issue in a new study examining 240,000 empirical psychology papers published from 2004 to 2024. For each paper, I calculated the percentage of its results crossing the traditional threshold for significance (p < .05) that only narrowly met this criterion (.01 ≤ p < .05). Historically, results based on p-values in this “barely significant” zone have displayed starkly lower rates of successful replication (e.g., Gordon et al., 2021). At an aggregate level, this simple measure provides a look at the robustness of psychological results, allowing analyses of how robustness has changed over time and how this relates to academic incentives (e.g., citations).

Widespread progress and a shifting culture 

Every subdiscipline within psychology demonstrates a clear trend toward reporting statistically stronger results today compared to the mid-2000s and early 2010s. That is, fewer p-values nowadays fall barely under traditional thresholds. This is, in part, linked to increases in sample sizes. In social psychology, where median sample sizes hovered around 80–100 participants for much of the period between 2004 and 2014, the median has since surged to approximately 250 participants. Other fields, such as cognitive, developmental, or clinical psychology, started at different baselines, given the increased costs associated with recruitment in these areas. Nonetheless, these areas also display marked increases in sample sizes, which are 50%–100% larger today than a decade ago. 

The study further shows that research reporting more robust results now tends to garner more citations and achieve publication in more competitive, higher-impact journals. Interestingly, even before the replication crisis began, papers reporting stronger findings tended to receive more citations, although this association was magnified over time. For journal placement, however, top journals historically tended to publish less robust results on average. Yet, this pattern has flipped in recent years. Nowadays, esteemed journals seem to display strict standards and require high evidential strength. These are critical developments, showing how the incentive structures within the field—what gets rewarded, recognized, and amplified—are increasingly aligning with the principles of replicable science.

Differences between research areas and challenges 

This self-correcting process following the replication crisis’ start is evident in every domain of psychological science. That being said, some areas of study have shown faster growth than others. I investigated this issue by examining how p-values are related to word usage in papers. For instance, the term “group” was often associated with weaker p-values, likely because experiments comparing groups tend to offer less statistical power than studies focusing on within-subject effects. This textual analysis altogether identified hundreds of words with links to strong or weak p-values. To help disseminate these findings, I developed a website that allows users to search words and see their prevalence in psychology papers, their links to p-values, and their potential associations with other variables (e.g., citations). 

Investigating word usage permits a finer-grained investigation of what scientific topics and methodologies are most associated with strong or weak results. For instance, studies on memory—my primary research area—usually display weaker results than questionnaire-based studies on personality. These differences likely stem from inherent difficulties in conducting robust memory research. The prototypical memory experiment involves showing participants a series of stimuli (e.g., pictures or words) and then later asking them to retrieve those stimuli. As memory retrieval is probabilistic, experimental designs must often contain many trials to achieve adequate statistical reliability. Memory research may also necessitate setting aside time for a retention interval to increase difficulty and avoid ceiling effects (e.g., conducting retrieval tests on a separate day than encoding). These requirements add labor and monetary costs to data collection, especially if a memory study cannot leverage online crowdsourcing platforms. Costly data collection may limit the sample sizes collected and, in turn, lead to less robust studies.

Few would argue, however, that these challenges mean psychologists should conduct less research into memory. Rather, this example highlights a broader pattern: Some valuable research faces inherent logistical and methodological hurdles that can make achieving high statistical power more resource-intensive than, say, large-scale survey studies. Further, some experimental directions—despite being limited in power—may be strong in other respects, such as permitting powerful causal claims or allowing clearer applications outside the laboratory. Hence, the question becomes how to best foster and support robust research practices within areas where the work is intrinsically complex and costly. Given that these areas are nonetheless displaying marked improvements in robustness relative to a decade ago, these critical challenges also will likely be steadily addressed. 

A bright future 

Discussions surrounding the replication crisis have sometimes veered into pessimism, occasionally portraying a narrative where systematic pressures will inevitably compromise scientific integrity. Such narratives, while highlighting real phenomena, risk painting an incomplete picture of modern scientific practice, and the meta-analysis described here demonstrates that problematic science is not an immutable destiny. Instead, the analysis showcases a scientific community actively grappling with its challenges and implementing fundamental, effective changes. Some dilemmas remain, particularly in finding optimal ways to support and conduct rigorous research in resource-intensive areas. Further challenges are also arising in other respects, such as pre-registrations never being made public (https://journals.sagepub.com/doi/full/10.1177/25152459241296031) or the adoption of many open-science practices remaining slow (https://journals.sagepub.com/doi/full/10.1177/25152459241283477). Nonetheless, the self-correcting trajectory over the last decade is clear and encouraging.

Feedback on this article? Email [email protected] or login to comment.

References 

Bogdan, P. C. (2025) One decade into the replication crisis, How have psychological results changed? Advances in Methods and Practices in Psychological Science 8(2). 

Ensinck, E. N., & Lakens, D. (2025). An inception-cohort study quantifying how many registered studies are publicly shared. Advances in Methods and Practices in Psychological Science, 8(1).

Gordon M., Viganola D., Dreber A., Johannesson M., & Pfeiffer T. (2021) Predicting replicability—Analysis of survey and prediction market data from large-scale forecasting projects. PLoS ONE, 16(4 ), Article e0248780. 

Hardwicke, T. E., Thibault, R. T., Clarke, B., Moodie, N., Crüwell, S., Schiavone, S. R., … & Vazire, S. (2024). Prevalence of transparent research practices in psychology: A cross-sectional study of empirical articles published in 2022. Advances in Methods and Practices in Psychological Science, 7(4).


APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.

Please login with your APS account to comment.