Finding Bootstrap Confidence Intervals for Effect Sizes With BootES

BootES (pronounced “booties”) is a free software package for R that computes both unstandardized and standardized effect-size measures for most experimental research designs and finds bootstrap confidence intervals (CIs) for those effect sizes (Gerlanc & Kirby, 2015). We developed bootES to help fill the gap between the data-analysis methods that existing software offers and current recommendations for best practices. Our work follows in the footsteps of a long line of proponents of what has come to be called the “new statistics” (see Geoff Cumming’s article in the 2014 “March Methodology Madness” issue of the Observer).

The tools of reform. Null-hypothesis significance testing (NHST) is logically invalid as an inferential method, and the p values it yields mislead us about the magnitudes of effects because they are functions of sample size (see, e.g., Loftus, 1996). Consequently, current best practice requires that psychological scientists report:

  • effect-size estimates, which are not affected by sample size; and
  • their CIs, which communicate the precision of the effect-size estimate and contain all the information in a p value and more.

Many research journals, including Psychological Science, are urging researchers to shift from reliance on NHST to effect-size estimation and other preferred techniques (Eich, 2014, p. 5) and are setting the reporting of appropriate effect sizes and confidence intervals as minimum expectations for publication. So why have so few published research articles in those journals met these expectations?

We believe the biggest roadblock has been the lack of appropriate statistical software. The big commercial stats programs were all developed during the heyday of NHST. Most do not even report standardized effect sizes for many common research designs, such as mixed-factorial designs. And when they compute CIs at all, they typically use traditional CI methods that are known to have poor coverage (i.e., their 95% CIs do not actually cover the population value 95% of the time).

As a consequence, psychological scientists until recently have lacked readily available tools for meeting even “minimum expectations,” let alone best practices. Critics of NHST have had difficulty gaining momentum for reform, in part because the lack of viable alternatives has provided an excuse to stick with tradition. One of our goals in creating bootES was to provide a viable alternative to NHST, helping to bring best practices within everyone’s reach.

Why bootstrap CIs? Bootstrap CI methods have an advantage over other methods in that they do not assume that the data are drawn from a normal distribution, or that the shape of the distribution is even known. Instead, with bootstrap methods one approximates the unknown distribution from the data sample itself. A program repeatedly draws random samples with replacement (called resamples) from the original data sample and calculates the desired effect size anew for each resample. For example, the program might draw a resample, calculate an effect size such as Cohen’s d, and repeat this 2,000 times, yielding 2,000 Cohen’s ds. The distribution of these 2,000 effect-size estimates then serves as an empirical approximation of the population distribution of the effect size, and we can use this distribution to find a CI for the population effect size.

For example, to find a 95% CI using the percentile method, we simply put the 2,000 resampled effect sizes in rank order and locate the values at the 2.5 and 97.5 centiles, which contain the middle 95% of the distribution. Such CIs make no assumptions about the shape of the distribution of effect sizes in the population. (Although the percentile method is more intuitive, bootES actually uses by default the bias-corrected-and-accelerated [BCa] method, which tends to have better coverage than the percentile method.)

In most real-world research, the shapes of the population distributions of data and effect sizes are unknown. Thus, we believe the best practice in most research is to report bootstrap CIs. In cases when one has good reason to believe that the data is normally distributed, we recommend “exact CI” methods, as implemented in the pioneering software by APS Fellow Geoff Cumming (2016), Ken Kelley (2007), and James H. Steiger and Rachel T. Fouladi (1992). Although the advantages of bootstrap and exact methods have been known for decades, it is only in recent years that personal computers have become fast enough to make them practical for everyday data analyses. BootES makes bootstrap methods readily available, and for typical data sets the computations take at most a few seconds.

Why R? The first author (Kris) began teaching bootstrap CI methods in his undergraduate classes using customized functions written in commercial software. However, he wanted to enable his students to take their skills with them when they graduated and to use those skills anywhere without an expensive site license. The second author (Dan) encouraged him to implement his functions in R, a free, open-source, cross-platform statistical language and environment running on macOS, Windows, and Linux operating systems (R Development Core Team, 2016). Teaching students to perform data analyses in R means they will be able to employ those skills for free wherever they go. Eventually, Dan took over the programming of bootES in R, we greatly expanded bootES’s initial functionality, and Dan pulled it together into a package available for all R users.

This brings up another huge advantage of using R over commercial software: The Comprehensive R Archive Network (CRAN, 2017) team allows R developers to submit to their package repository specialized packages that meet CRAN’s quality-control standards. Once these packages are deposited, R users can easily download and install them within R — also for free — for their own use. A large number of wonderful packages for psychological scientists can be found in the repository, such as APS Fellow William R. Revelle’s (2017) psych package and Kelley’s (2016) MBESS package — and, of course, bootES. At this writing, bootES has been downloaded more than 13,000 times just from the one mirror site that tallies such numbers.

Some may be intimidated by R’s command-line interface and its reputation for a steep learning curve. Indeed, R’s online documentation is esoteric and can be hard to follow. However, R is easy to learn when one is shown how to use it. In Appendix 1 of our article describing bootES, we provided examples of most of the R commands that a data analyst needs, from importing the data to saving the results (Kirby & Gerlanc, 2013). With just a little experience, even first-semester undergraduate students can copy, paste, and edit example commands as easily and efficiently as they can search through drop-down menus and dialog boxes to do analyses in commercial software. Users can then gradually learn to use more of R’s powerful built-in functionality, including its publication-quality graphics.

We put a great deal of effort into making bootES as user-friendly as possible. All functions are invoked by a single command, and bootES makes educated guesses based on the data structure about the type of effect size and CI that the user wants, requiring the user to specify a minimum of options. For example, with the command bootES(myVariable), bootES finds the mean of the single variable myVariable and its 95% CI based on 2,000 replications. In contrast, when one submits two variables as arguments, as in bootES(c(variable1, variable2)), bootES finds the correlation between variable1 and variable2 and the 95% CI for that correlation.

Of course, optional arguments allow one to change the defaults, such as by selecting among several standardized effect sizes. For example, to find Hedge’s g for the mean of the effects represented by myVariable, one simply specifies the effect type with an option, as in bootES(myVariable, effect.type=”hedges.g”). In Kirby and Gerlanc (2013), we show how to change all of the defaults and explain how to use additional arguments to find more complex effect sizes such as those for contrasts in between-subjects, within-subjects, and mixed-factorial designs. BootES also finds CIs for slopes and for differences between correlations.

Why contrasts? Consistent with long-standing best-practice recommendations, bootES computes effect sizes only for 1-degree-of-freedom (df) effects, also known as contrasts or focused comparisons (see, e.g., Rosenthal & Rosnow, 1991, pp. 467–469, 486). Multi-df effects, also called omnibus effects or unfocused comparisons, are ambiguous in their interpretation, rarely of scientific interest, and superfluous when accompanied by a priori or post hoc contrasts. Thus, when one has more than two conditions for comparison, bootES encourages best-practice analyses by requiring the user to define within-subject contrasts in the data set itself (using a spreadsheet or R) and to define between-subjects contrasts with arguments to the bootES command, which allows bootES to resample within conditions properly. This approach effectively reduces all factorial designs to contrasts on one-way designs; we believe the transparency and interpretational clarity of this approach leads to better scientific inferences than does the automatic generation of the full output of default effects in factorial ANOVAs.

Most ambitiously, we see bootES not as a supplement to the big commercial programs but as their replacement. With bootES and with knowledge of how to compute contrasts in factorial designs, a psychological scientist can find effect sizes and CIs for just about any type of experimental design that he or she will encounter. When best practices are free, why pay for second best?

APS Fellow William R. Revelle will speak at the 2017 APS Annual Convention, May 25–28, 2017, in Boston, Massachusetts.

References

Comprehensive R Archive Network (CRAN) Packages (2017). Retrieved from https://cran.r-project.org/web/packages/

Cumming, G. (2014, March). There’s life beyond .05. Observer, 27, 19–21.

Cumming, G. (2016). ESCI (Exploratory software for confidence intervals). Computer software. Retrieved from http://thenewstatistics.com/itns/esci/

Eich, E. (2014). Business not as usual. Psychological Science, 25, 3–6.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–26.

Gerlanc, D., & Kirby, K. N. (2015). bootES: Bootstrap effect sizes (Version 1.2). Computer software. Retrieved from https://cran.r-project.org/package=bootES

Kelley, K. (2007). Methods for the behavioral, educational, and social sciences: An R package. Behavior Research Methods, 39, 979–984.

Kelley, K. (2016). MBESS: The MBESS R Package. (Version 4.1.0). Computer software. Retrieved from https://cran.r-project.org/package=MBESS

Kirby, K. N., & Gerlanc, D. (2013). BootES: An R package for bootstrap confidence intervals on effect sizes. Behavior Research Methods, 45, 905–927. [A preprint of the article, with errata corrections, can be obtained from https://cran.r-project.org/web/packages/bootES/bootES.pdf]

Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 6, 161–171.

R Development Core Team (2016). R (Version 3.3.2). Computer software. The R Project for Statistical Computing. Retrieved from https://cran.r-project.org/index.html

Revelle, W. (2017). Psych: Procedures for psychological, psychometric, and personality research (Version 1.6.12). Computer software. Retrieved from https://cran.r-project.org/package=psych

Rosenthal, R., & Rosnow, R. (1991). Essentials of behavioral research: Methods and data analysis (2nd ed.). New York, NY: McGraw-Hill.

Steiger, J. H., & Fouladi, R. T. (1992). R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation. Behavior Research Methods, Instruments, and Computers, 4, 581–582.