With the Ability to Calculate Power Comes Great Responsibility

Pete VanZandt
Geek Culture
Published in
7 min readJun 21, 2021

--

What is Statistical Power?

Statistical power sounds pretty impressive, and when shrouded in the mystery of statistical calculations it can take on an even higher level of perceived importance. Briefly, power is the probability of rejecting a null hypothesis when it should really be rejected (that is, when the true difference exists — which, of course, can’t be known for anything besides a simulation). For example, imagine we’re comparing the average fuel efficiency (mpg) of a bunch of hybrid cars to a collection of regular gas-powered cars. In this silly example, the null would be that there is no difference between the averages of the two types of cars, and the alternative would be that there is indeed a somewhat consistent and reliable difference between the two. Here, the statistical power would be the likelihood that we could detect that difference statistically. Obviously, we’d want to have a strong ability to detect a difference, because we’d look like idiots if we couldn’t see a distinct difference like this, right?

This image of the calculation of power for a t-test was taken from this fine article by Lucile Lu on the use of power in designing A/B tests

I’m no mathematician, but looking at this equation can provide a little guidance on what makes for a powerful test. While power ranges from 0–1, it increases with:

  • increasing sample size (n or m in this equation). In our car example, this would amount to comparing the mpg of dozens to hundreds of cars of each type. This variable is usually the one most under the control of the experimenter.
  • a larger difference between the mean values that you’re trying to detect (the mu1 & mu2 terms in the numerator). In our car example, the average difference in mpg is likely to be really large, whereas if we compared gas-powered SUVs to big trucks, the difference (if any) would be pretty small.
  • smaller variation among the data (signified by sigma squared terms in the denominators). In our car example, we might see low variation among all of the hybrid cars if they were all very similar in mpg. Conversely, we might see large variation in our gas-powered vehicles if we calculated mean mpg of a wide variety of cars: from compacts to big SUVs.

There are a couple of main applications of the process of calculating power, and the difference between the two comes about based on where the estimates of variation come from.

Application 1: Experimental Design — First, when you’re designing a study you would want to estimate the number of samples that you’d need to achieve a particular power (often targeted at 0.8 for a “powerful study”). You might also change your study design to increase the difference in the study’s outcome. The biggest impact of increasing sample size, though, is that the sample variation decreases as more samples are taken, following the Central Limit Theorem (which is nicely described with intuitive figures by Zijing Zhu here). When performing this kind of power analysis, the estimates of variation are typically estimates based in preliminary studies or taken from the literature. Power analyses done in this way and for the purpose of designing a study are often referred to as a priori, because they are done before the study is conducted. An example of this is estimating how many users a company plans on enrolling in an A/B test. In our car example, we might look at the mpg of a handful of cars in each category first before we decide to take on a huge sampling protocol. (By the way, a nice discussion of estimating power when designing a study is presented by Levine et al.)

Application 2: Interpreting the results of a study — The second use of power analysis has been to calculate the power of a study that didn’t show a significant effect. In these cases, researchers fail to find an effect of interest and are interested in better understanding what happened. Because these types of power analyses are done using the actual sample sizes, means, and variances observed from the study they just performed, they are sometimes called “post-hoc power”, “observed power”, “retrospective power”, and a variety of other names (O’Keefe). Andrew Gelman has even compared it to a “shit sandwich”. As the previous metaphor suggests, this kind of power analysis is not well loved by statisticians.

Power calculations are “of vital importance before the experiment, (but) are essentially meaningless once the experiment has been done.” (J. W, Tukey, as cited in Kraemer et al.)

There is a long history of several disciplines including psychology, ecology, veterinary surgery, human medicine, and teaching research all misusing, misapplying, and misinterpreting post-hoc power (reviewed in Hoenig & Heisey). It is even quite common for reputable scientific journals to request that post-hoc power analyses be performed on manuscripts that contain non-significant results (Levine et al., Zhang et al.). But while the problems with calculating post-hoc power have been known by statisticians for a long time, but the problem of its application keeps popping up in a variety of areas year after year.

Data science is a relatively new discipline, but the post-hoc power problem already seems entrenched in the community. For example, suggestions for power calculations to “comment on the confidence one might have in the conclusions drawn from the results of the study” (Brownlee), or “to comment on the confidence that one might have in the conclusions drawn from the results of an experiment or a study”(McCullum) — (wait, why are these phrases EXACTLY THE SAME between the two? Weird!). Post-hoc power is also supposed to be helpful when “evaluating the results” or in order to “report confidence in the conclusions drawn from the results of an experiment” (Lewinson). We are even encouraged to be prepared to conduct such power analyses as data scientists because future employers will expect you to do so. Unfortunately, this approach is just as flawed as it was 30 years ago.

What’s Wrong with Post-hoc Power?

One issue with post-hoc power is more of a psychological problem. Frequently, one would perform a power analysis after failing to reject a null hypothesis, and they’d be tempted to interpret high power, coupled with a non-significant result as evidence that the null hypothesis must be true (instead of the more conservative and appropriate interpretation of failing to reject the null). Alternatively, if the null hypothesis wasn’t falsified, a post-hoc power estimate that was low might suggest to the authors that there really was a significant difference, but they just didn’t have enough samples to prove it — a common desire, given all the work that had gone into the design and execution of the study (and also consistent with lots of cognitive biases we all share). We’ve all seen this in practice when someone says, “I just didn’t have enough samples to find the difference”, to which the reply is, “maybe the difference just wasn’t there to be found!”

Besides the psychological interpretive issues, there are also several statistical issues with the application and interpretation of post-hoc power. First, several authors note that there is a 1:1 relationship between power and the p-value obtained from a test, which means that there is no new information provided by a post-hoc power analysis that isn’t already provided by the p-value (see the discussion, nice graphics, and key Simpsons quote from Enthusiasm Curbed). Fundamentally, a low p-value already says that there wasn’t a big enough sample size to detect the small effect size you observed in your experiment, so conducting a power analysis based on observed variances and means won’t tell you anything new (O’Keefe).

Even more problematic is that non-significant findings will pretty much always equate to low observed power (Hoenig & Heisey), but statistically significant results will produce high power. In fact, non-significant results pretty much never have observed power greater than 0.5, and when they do the excess is slight (Chow). Moreover, the estimates of post-hoc power are highly variable and their variation is almost never considered (see discussion in Enthusiasm Curbed, Gelman, and especially Chow). Given this, post-hoc power can be characterized as “conceptually flawed” and “analytically misleading” (Zhang et al.). Even 20 years ago, it was considered “an idea whose time has passed” (Levine et al.).

Alternatives to Post-hoc Power

What are the suggested solutions, then? The first one is fairly obvious: only conduct a priori power analyses, since post-hoc power at best does not add any additional useful information and at worst is misleading. Even in this more justified case, it’s important to recognize that these estimates of power are not concrete and precise, but are themselves variable and subject to estimation errors. Basically, there’s no guarantee that an extremely powerful test will detect a significant difference. Beware, though, as there are even pitfalls in using the estimates of variance from a pilot study in estimating a priori power (Kraemer et al.).

Second, instead of post-hoc power, statisticians recommend basing the interpretation a study’s results on confidence intervals (Hoenig & Heisey, Levine et al., Chow), since confidence intervals are easier to interpret and less prone to misinterpretation — especially given that they are designed to show variability, when post-hoc power calculations are not. Confidence intervals are still prone to misinterpretation (as are p-values), so they aren’t a silver bullet to this problem.

Finally, the most practical and simple solution might be to just realize that the effect you were looking for didn’t really exist, at least in the form or relationship you thought. Failure in science (data science included) isn’t a bad thing, it just means that you didn’t understand your system as well as you thought. Maybe the answer is to look into the question a bit more, look for lurking variables, interactions, or hidden relationships and try the study again.

--

--

Pete VanZandt
Geek Culture

Lifelong learner; neither botanist nor statistician; fan of moths