Powerful Myths: Common Misconceptions About Statistical Power

Illustration of an employee at a desk being pressed by a giant index finger.

In the wake of the replication crisis, statistical power has become one of the central issues in debates about the quality of research. The widespread use of tests with low power is seen as a key reason many studies fail to replicate (Ioannidis, 2005; Button et al., 2013). As a result, researchers increasingly assess power not only before conducting studies, but also when critiquing results after the fact.  

Yet, despite its importance, power remains a complex statistical concept that is often misunderstood. Here, we discuss three misconceptions or “myths” that, in our view, stand in the way of an informed discussion about power and hinder the constructive critique of low- and high-power studies alike. Though our discussion highlights complexities in assessing power, we unequivocally emphasize that studies should be adequately powered to detect plausible effects whenever possible.  

Terms to Know

Power: The probability with which a statistical test will produce a significant result, under the assumption that the tested effect exists. It depends on the expected effect size, the sample size, and the chosen significance level. What levels of power are deemed adequate will differ from situation to situation, but a power of 80% is typically seen as sufficient for most testing situations.  

Effect size: A value that quantifies the magnitude of the tested relationship between variables or the difference between groups in a study. Typical examples are the correlation coefficient and the mean difference between groups.  

Frequentist statistics: The dominant school of statistical inference in psychology. Important tools of frequentist statistics are significance testing, p-values, and confidence intervals. Philosophically, it is based on the idea that probabilities may only be assigned to repeatable events. Assigning probabilities to fixed states of the world, such as the absence or presence of an effect, is seen as meaningless. 

Bayesian statistics: A school of statistical inference that has grown in popularity over the last few decades. Important concepts are prior and posterior probabilities, Bayes factors, and credible intervals. In contrast to frequentist statistics, Bayesian statistics interprets probability as a quantification of beliefs and evidence. Thus, probabilities may also be assigned to non-repeatable events, such as the absence or presence of an effect. 

Questionable research practices: A range of analytical practices that researchers may use (intentionally or unintentionally) to increase the chances of publishing their results, at the cost of substantially decreasing the validity and replicability of findings. Common examples include p-hacking (analyzing the same dataset in many different ways, until a significant effect is found) and selective reporting (only reporting those analyses that produced a significant result). 

Myth 1: Power is an objective feature

Common critical practice can convey the impression that a test or study has “one” power. Results are often criticized on the grounds that the tests conducted were underpowered due to a small sample size, without a clear explanation for which effect sizes the test’s power was too low or what level of power would be acceptable in the given situation. However, one test in one sample has as many different levels of power as there are different effect sizes that could be of interest. A critique of power cannot be complete without transparent discussion about what range of interesting effect sizes can be expected and for which of these the test is under- or sufficiently powered (Morey & Lakens, 2016; Morey, 2019).  

Given that the “true” effect is never known beforehand, this demand introduces a substantial level of subjectivity into the assessment. Different researchers may expect different effect sizes, as they might give different weights to existing evidence. Moreover, the acceptable level of power will differ from stakeholder to stakeholder. Consequently, we consider it crucial that these subjective aspects of power assessments be openly acknowledged. At the same time, researchers and the evaluators of their works should transparently communicate their assumptions. While this adds complexity to critiques of power, it also ensures a more nuanced and fair evaluation of research results.  

Myth 2: With low power comes low credibility

One of the most widespread notions regarding power is that significant results from tests with low power are very likely to be false positives or flukes. This prominent idea can be traced back to the works of physician and scientist John Ioannidis (Stanford University), who named it as one of the reasons why “most published findings are false” (Ioannidis, 2005). However (and although we expect it to go against the intuition of most readers), this notion has little theoretical support in either of the two prevalent schools of statistical inference. In frequentist statistics, the notion is meaningless, because the inferential framework is strictly agnostic regarding the probability of an effect being absent or present (the effect either “is” or “is not”). To make such statements, we need to adopt a Bayesian point of view, in which such a posterior probability is a well-defined concept. In Bayesian statistics, however, tests with all possible levels of power allow fine-grained statements about an effect’s probability. Perhaps surprisingly, even a significant result from a test with low power can provide a level of evidence that researchers might not want to dismiss.  

Related content: Experimental Methods Are Not Neutral Tools

We explain these issues in more detail in our publication (Lengersdorff & Lamm, 2025). Crucially, our arguments are independent of the fact that it is definitely bad research practice to conduct studies with sample sizes that are too small to reliably detect the effects of interest. In all likelihood, such studies will fail to produce significant findings, and thus waste time and resources. However, once significant findings are already produced, the role of power in the assessment of their credibility becomes much more complex than commonly assumed. We suggest that results be assessed based on their actual evidentiary value, without taking the shortcut of dismissing results from tests that, by some standards, appear underpowered. In combination with our previous suggestion, this involves an assessment of the range of effect sizes for which the study would have been adequately powered and of whether these effects are realistic.  

Myth 3: High power protects against questionable research practices

Our previous points were based on the assumption that results are obtained without the use of questionable research practices (QRPs), such as p-hacking or selective reporting (Bakker et al., 2012). However, it is also a common notion that the results of low-powered studies are very likely to be tainted by QRPs, whereas high-powered studies are protected against their influence.  

The line of reasoning is this: Researchers who conduct studies with low power will detect the effect of interest only with a low probability. Under the pressure to publish interesting results, they are then more likely to use QRPs to turn their nonsignificant findings into significant ones. Researchers who conduct studies with high power have a higher probability of detecting effects and are less likely to be tempted to use QRPs.  

However, this reasoning works only under the assumption that the tested effects do, indeed, exist. In the arguably very common case that the hypothesized effects are null, low-powered and high-powered tests have the same high probability of producing non-significant results. Thus, the general tendency of researchers to embellish non-significant findings by using QRPs vastly dilutes the evidence we can gain from studies with all levels of power (Lengersdorff & Lamm, 2025; John et al., 2012).  

If a study shows clear signs of QRPs, scientists and reviewers should be very skeptical about its claims, even if it has high power. Relatedly, Stefan and Schönbrodt (2023) have recently published a simulation study showing that a large sample size is not a protective factor against most kinds of p-hacking. In reverse, low power gives little reason to dismiss the claims of a study that was otherwise well designed and transparently reported.  

Ultimately, the credibility of a study hinges on the integrity of the research process, of which power is one out of many aspects.

Lastly, we would like to note that there is, to the best of our knowledge, no empirical evidence of the link between a study’s power and the probability that QRPs will be used in response to non-significant findings. As Stefan and Schönbrodt (2023) pointed out, cases can be made in either direction: On the one hand, null effects from high-powered studies are more convincing, and therefore easier to publish, than null effects from studies with low power, so researchers of high-power studies might feel less pressure to engage in QRPs. On the other hand, high power typically comes with high costs in time and funding, due to large sample sizes, so researchers might feel tempted to salvage their expected results by using QRPs, especially in a research culture in which novel findings are seen as more impactful than null findings.  

Ultimately, the credibility of a study hinges on the integrity of the research process, of which power is one out of many aspects. High power does not shield against QRPs, just as low power does not automatically imply their use. To address these issues, it is crucial to move toward an open science culture in which the use of QRPs is both discouraged and disincentivized.  

Power is not an objective feature, nor does it ensure the credibility of a research finding or protect from the influences of QRPs. We deem it crucial that assessments of power move away from simplistic myths and toward a principled and open discussion about what power really tells us about the credibility of results. Our goal is not to justify low-powered studies, but to promote fair and nuanced critique of all research. 

Related content we think you’ll enjoy


Feedback on this article? Email [email protected] or login to comment.

References

Comments

This is a nice piece. I like the notion that a study does not have a single power and try to emphasize this with my students (and colleagues). Every study essentially has “all powers” across the complete range of effect sizes (from alpha to .999+). It’s the researchers’ task to determine what the range of plausible effect sizes might be, and from there develop sensitivity curves across the range of reasonable effect sizes and sample sizes to get a better sense of things. It is also worth considering the flip side of power: the type 2 error rate (beta). This can be especially helpful in thinking about what negative (not “significant”) results might mean. High beta associated with a plausible (but not significant) effect size suggests you might want to reconsider what a plausible effect size is in the context of your research! But with a small enough a priori hypothesized effect size (i.e., smallest effect size of theoretical or practical importance), low beta comes as close as we can to effectively “accepting the null” (at least for that study). Just remember not to simply calculate power (or beta) on your obtained effect size only, but again, to consider across a range of effect sizes. Finally, I would add another myth, one I see expressed fairly often: high power equals high precision in parameter estimation. A large expected effect size will often lead researchers to use only as large a sample as is necessary to obtain adequate power. If these sample sizes are small, they will not yield very precise confidence intervals. Only larger sample sizes (and more reliable measures) can do that.


APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.

Please login with your APS account to comment.