Presidential Column

Preregistration, Replication, and Nonexperimental Studies

This is a portrait of APS President Susan Goldin-Meadow.In last month’s column, I worried about whether encouraging us to preregister our hypotheses and analysis plan before running studies would stifle discovery. I came to the conclusion that it needn’t — but that we need to guard against letting the practice run away with itself. In this column, I take up a second concern about preregistration: That it seems to apply only to certain types of studies, and thus runs the risk of marginalizing studies for which preregistration is less fitting.

Preregistration is designed to ensure that if the data we collect confirm our hypotheses, those hypotheses were the ones we intended to test before the study began — and not new hypotheses we’re generating based on what we’re observing. If we are seeing patterns for the first time, we need to make it clear to ourselves, and to our readers, that the study is generating new hypotheses rather than testing old ones. In a sense, preregistration entails replication (at least conceptual replication if not exact replication; Crandall & Sherman, 2015), since the preregistered hypothesis-testing study rests on the foundations constructed on the basis of earlier hypothesis-generating studies.

Preregistration and replication lend themselves well to short-term experimental studies conducted on participants who are easy to find. But it’s just too costly or unwieldy to generate hypotheses on one sample and test them on another when, for example, we’re conducting a large field study or testing hard-to-find participants. Do we have to give up on the hope of replication and robustness for this type of study? There are two reasons not to despair.

First, some kinds of studies, by their nature, may be more robust than others. As Jon K. Maner (2015) notes, studies conducted in the field have two advantages over lab studies. The first advantage is obvious: The findings of a field study have clear relevance to the real world. The second advantage is less obvious: It is difficult to control all, or even many, of the variables in a field study. Why is this lack of control a good thing? If a phenomenon is discoverable under these messy conditions, it is likely to be a robust one that is worthy of explanation. Jean Piaget’s discoveries, which were made at home on his three infants, are a good example. Although his sample was small and therefore obviously not representative, the conditions under which Piaget made his observations varied extensively from trial to trial. Having a large number of naturalistic observations on a small number of participants can lead to robustness. In 1973, Roger Brown made his initial discoveries about language learning also by studying only three children at home talking about random topics. Piaget’s observations have stood the test of time, in part because he was a brilliant observer who could zero in on invariances that mattered, and in part because his observations came from a range of situations and thus were less likely to depend on the details of any one of those situations. Happily, this means that in areas where it is difficult to repeat a study, exact replication may not be essential in ensuring a phenomenon’s robustness.

The second reason not to despair is that there can be, and often is, replication built into observational studies — it just doesn’t get reported as such. For example, we can develop a coding system on the basis of a subset of the data, establish the reliability of the coding system, and then apply that system to the rest of the data (e.g., Goldin-Meadow & Mylander, 1991, pp. 322–324). This procedure allows us to discover hypotheses on one part of the data and test them on another part, a type of replication that can be conducted on populations that are rare or exist in difficult-to-recreate conditions.

Discovering the right coding system (i.e., the coding system that captures what’s interesting about the data) is analogous to piloting an experimental study to find the right parameters to reveal the phenomenon. Neither procedure is cheating — it’s the discovery part of science. But perhaps researchers should be encouraged to report these steps, along with the details of the coding system in an observational study (which is typically the heart of this type of study), in supplementary materials. Doing so could save others a great deal of time and, more importantly, could provide a preliminary sense of the boundary conditions under which the phenomenon does, and does not, hold.

There is currently an effort to raise the status of replication in experimental studies and devote some of our precious journal space to making sure a phenomenon is robust across labs (e.g., Nosek & Lakens, 2014). These efforts seem reasonable to me as long as they do not become exercises in fault-finding but are seen as what they are — ways to test the robustness and generality of a phenomenon. Barring intentional fraud, every finding is an accurate description of the sample on which it was run. The question — an important one — is whether the findings extend beyond the sample and its particular experimental conditions. If we’re going to take replication seriously in experimental studies, then I suggest we do the same for studies that use other methods. For example, when using observational methods, researchers can be encouraged not only to report iterative tests of a coding system on a single sample but also to recognize these tests as the replications that they are.

What we don’t want to do is require that the procedures used to ensure robustness and generalizability in experimental studies (e.g., preregistration, multiple-group replications of a single study) be applied to all types of psychological studies, and then devalue or marginalize the studies for which the preregistration procedures don’t fit. Rather, we need to think creatively about how to achieve robustness for the wide range of methods that comprise the richness of psychological studies. œ

References

Brown, R. (1973). A first language. Cambridge, MA: Harvard University Press.

Crandall, C. S., & Sherman, J. W. (2015). On the scientific superiority of conceptual replications for scientific progress. Journal of Experimental Social Psychology, 66, 93–99. doi:10.1016/j.jesp.2015.10.002

Goldin-Meadow, S., & Mylander, C. (1991). Levels of structure in a communication system developed without a language model. In K. R. Gibson & A. C. Peterson (Eds.), Brain maturation and cognitive development: Comparative and cross-cultural perspectives (pp. 315–344). New York, NY: Aldine de Gruyter.

Maner, J. K. (2015). Into the wild: Field research can increase both replicability and real-world impact. Journal of Experimental Social Psychology, 66, 100–106. doi:10.1016/j.jesp.2015.09.018

Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45, 137–141.

Piaget, J. (1952). The origins of intelligence in children (translated by M. Cook). New York, NY: International Universities Press, Inc.

Comments

I’m glad that APS President Goldin-Meadow is bringing the issue of preregistration to the fore. But it appears to me that some of the concerns she expressed pertain to Registered Reports, not to preregistration per se. For a very detailed and thorough treatment of Registered Reports put together by Chris Chambers and others, see https://osf.io/8mpji/wiki/home/ .

Very briefly, in Registered Reports, the authors submit a rationale and a methodology for review before they conduct the proposed study, and editors decide whether or not to publish the work before the results are known (but often with certain criteria, such as passing manipulation checks and avoiding floor/ceiling). RRs are an exciting new path in science and I believe that they have great potential especially for research designed to test well-specified hypotheses. But it would probably not make much sense to submit an RR for a exploratory study because the reviewers and editors would have too little basis for assessing the proposal.

It would, however, make perfect sense to preregister an exploratory study and indeed it would, I think, be a good thing to do. For a brief statement on preregistration that I wrote for Psych Science, see http://www.psychologicalscience.org/index.php/publications/journals/psychological_science/preregistration

Preregistering simply means that you put down in writing, in a repository that cannot be edited, your plans and predictions before you start collecting data (or at least before you see the data). You explicitly state, up-front, what sorts of observations you plan to collect, what analyses you plan to conduct, what (if any) hypotheses you plan to test. That doesn’t prevent you from changing your mind later, it just prevents you from doing so without knowing that you are doing so. And that’s a good thing.

Steve Lindsay

If I’m completely honest, the arguments in this second post left me even more speechless than those in the first.

A few points.

1. The author brings up the old chestnut that preregistration threatens to marginalise areas of research where it isn’t appropriate. This is about as logical as arguing that a treatment for cancer “marginalises” treatments for hepatitis.

2. Cherry picking Piaget as some kind of proof of the brilliance of observational research. How many confirmatory studies followed Piaget to verify his theory? How many comparable observational studies based on N of 3 have gone nowhere? The whole argument is hindsight bias.

3. The usual straw man objection to the proposition advanced by nobody that pre-registration should be mandatory.

4. An interesting argument for cross-validation, which is a useful tool but of course no substitute for independent replication – and there is no reason why any such replications in observational research can’t be pre-registered.

5. The extraordinary statement that “Barring intentional fraud, every finding is an accurate description of the sample on which it was run”. This amounts to a denial of the existence of unconscious p-hacking, HARKing and other forms of bias, which is a truly remarkable statement coming from the President of the APS.

6. Finally, the argument that “Preregistration and replication lend themselves well to short-term experimental studies conducted on participants who are easy to find”. This rather ironically commits the very sin it warns against by marginalising preregistration and replication. There is no scientific reason why long-term studies can’t be independently replicated or preregistered. Just imagine if other sciences worked this way — that as soon as the science got challenging, replication left the room because it was just too hard. Just imagine what kind of physics or chemistry or engineering we would have. Would you dare to catch a plane?

The notion that longitudinal research cannot or should not be preregistered is completely at odds with the practices in massive international large-scale research, such as the Programme for International Student Assessment (PISA).

While it is true that PISA does not preregister its research plans on platforms often advertised in the current debate (the osf.io or AsPredicted.org), there is a strict long-term data collection and analysis plan that all consortium members abide by (plus a number of “optional assessments” that countries can subscribe to). And for good reason: A study as large as PISA is relatively expensive compared to small-scale laboratory study. Running a tight ship is absolutely necessary to make the research worth our time, and to allow meaningful policy implications.

Governments certainly do not want to spend more on it than necessary, and therefore every scale, every item is thoroughly negotiated & discussed between the experts who plan the research. Generally, it might be “easier” to preregister laboratory research, but that is inherently a function of the scale of the study (i.e. small-scale laboratory research simply requires less resources than large-scale longitudinal studies).

If there’s any kind of research that would actually profit from preregistration (in terms of resources spent), it is longitudinal research. Fortunately, quite a few longitudinal research programs have discovered that already a while ago.

Dear Susan Goldin-Meadow,

Your comments have been discussed in a Facebook group dedicated to the discussion of research practices in the wake of the “replicability crisis” in psychology following Bem’s (2011) incredulous evidence for time-reversed mental processes. I would like to share my own comment that focuses on a specific quote.

“These efforts seem reasonable to me as long as they do not become exercises in fault-finding but are seen as what they are — ways to test the robustness and generality of a phenomenon.”

I think it reflects a general Angst among authors of original researchers that somebody might conduct a replication study of one of their original studies and report a failure to replicate the original results. Of course, we would all like to be right all the time, but that is impossible and improbable when we are doing original research, especially when the work is exploring new territory. So, I think we need to acquire a new attitude towards errors. Even world class athletes in tennis make errors, and sometimes they make unforced errors. Errors are a part of life. As psychologists, we should be the first to know that ignoring errors and repressing contradictory information is just a temporary fix to deal with negative feelings that in the long run will have even more negative consequences. The whole point of science is to correct errors and in the process of doing so, more errors will be made that also need to be corrected. We can only do so, if we are open to the idea that we are routinely making mistakes. It is our job to uncover errors. Einstein discovered errors in Newton’s theory. This did not mean Newton was a bad guy, it just was part of the scientific process to notice them and to correct them.

In psychology, we use significance tests. And significance tests already imply that we are likely to make mistakes sometimes. Every claim is made with a fixed type-I ERROR probability.

Unfortunately, psychologists are not very good at taking another error into account. Every study also has the risk of not showing an effect, if the effect is there but too small to be detected in a study with low power. This is a type-II ERROR. Psychologists have ignored this error for a long time because these errors are not reported in the literature because journals only publish significant results. So, we only make type-II errors routinely and fail to replicate our own findings, but we do not worry about it because these results are not reported.

In short, science is an exercise in “fault finding,” and it is not helpful when famous original researchers dismiss results of carefully conducted replication studies (see Baumeister and Strack response to failed RRR in Perspectives).

It is unfortunate that your comment may be misinterpreted as implying that you also believe original results are always trustworthy and that replication studies only test the robustness and generality of a phenomenon. I think replication studies also provide important information whether an original finding might have been a false positive result or to produce a much more accurate estimate of the population effect size than a small original study can produce.

We need to learn that a single original study with a small sample and a significant result of p < .05 only tells us that it is likely that the population effect size is not zero and in the direction of the effect in the small sample of the original study. If the p-value is close to .05, the lower limit of the 95%CI is close to 0 and it is possible that the population effect size is close to zero. Moreover, when we use this procedure again and again for thousands of studies published each year, some of the published results will be studies with a false positive result (there is no effect of the sign is in the opposite direction). Replication studies, especially those with a much larger sample size can tell us something the original study could not tell. Was the result a false positive result and is the population effect close to zero or small, moderate, or large?

In this regard, I take issue with your statement that "barring intentional fraud, every finding is an accurate description of the sample on which it was run. The question — an important one — is whether the findings extend beyond the sample and its particular experimental conditions. "

Yes, if I just report the means or correlations that were obtained in a sample, the results are an accurate description of the correlation in this sample. But that is not sufficient to get a publication. In my sample of 20 participants, the correlaiton amonng two variables was r = .5 when I removed 7 participants is an accurate description of the correlation for the selected sample of N = 13 , but results like this are only published when they are accompanied by p < .05 (one-tailed), which implies a claim that the results are NOT LIMITED to the sample of N = 13, but that the sign of the relationship will replicate with other samples and generalize to other populations.

We are not just publishing descriptive statistics of our samples. We publish these results with p-values that are used to reject the null-hypothesis that our sample means and effect sizes are just sampling error, and this means every conclusion in an original study comes with a warning label. X causes Y, p < .05, means the claim is TRUE only for exact replication studies where participants are sampled from the same population and only with an error rate of 5%, where error rate means that no more then 5% of statistical tests where psychologists pressed a button to get a p-value on their computer screen will show a p-value less than .05 without a corresponding population effect size that matches the effect size in the sample.

Now it is not hard to see that psychologists conduct more statistical significance tests than they report in publications. This means, we do not know the maximum error rate of significance tests that are published in original articles (see Sterling et al., 1995). One advantage of pre-registration is that it reduces the rate of significance tests that can be conducted. If a researcher can only report the results of one statistical test and reports the result with p < .05, the maximum probability of a false positive is 5%. If there is no pre-registration, the maximum probabilty of a false positive is 100%. So, what we gain from pre-registration is better error control.

This advantage of better error control is not limited to experimental studies or to studies that test a directional hypothesis. If we use two-tailed tests, we already allow for significance in both directions and do not require a directional hypothesis. To use an example from my own reserach, I could preregister a study where I am going to explore the relationship between extraversion and life-satisfaction and specifiy the sample size (ideally with a power analysis), the measures that I will use and the statistical approach of testing it. I will collected data from N = 1,000 participants, measure extraversion with the BFI with self-ratings and informant ratings by a friend and use self-ratings of Diener's SWLS as a well-being measure. I will fit a structural equation model and regress a latent life-satisfaction measure on a latent extraversion factor that captures the shared variance between self-ratings and informant ratings of extraversion as a measure of extraversion. I will also allow for an extra relationship between self-ratings of life-satisfaction and extraversion to allow for shared method variance. I can pre-register this even though it is not an experiment and I made no prediction about the direction of the relationship. In return for doing this work, I gain that I can claim that there is only a maximum 5% probability that a significant positive or negative relationship with p < .05 is a false positive result. If I don’t pre-register it, I cannot make the claim with a fixed error probability because the maximum error probability depends on the number of measures that I might also have used, the stopping rule for participants, and other degrees of freedom in the way I analyzed the data. One could even argue that it is pointless to conduct statistical tests and report significance because p < .05 will only be misinterpreted as if there is only a maximum 5% probability of a type-I error, when the true maximum type-I error probability is 100%.
I hope you find these arguments that have been made repeatedly over the past five years interesting and consider them in your reflections about pre-registration and replicability.

Sincerely, Ulrich Schimmack

Related information can be found here

Discusison group
https://www.facebook.com/groups/853552931365745/

Blogs about Replicability and Power
https://replicationindex.wordpress.com/

Replicability Rankings of Psychology Jornals
https://replicationindex.wordpress.com/2016/01/26/2015-replicability-ranking-of-100-psychology-journals/

Dear Susan Goldin-Meadow,

Your comments have been discussed in a Facebook group dedicated to the discussion of research practices in the wake of the “replicability crisis” in psychology following Bem’s (2011) incredulous evidence for time-reversed mental processes. I would like to share my own comment that focuses on a specific quote.

“These efforts seem reasonable to me as long as they do not become exercises in fault-finding but are seen as what they are — ways to test the robustness and generality of a phenomenon.”

I think it reflects a general Angst among authors of original researchers that somebody might conduct a replication study of one of their original studies and report a failure to replicate the original results. Of course, we would all like to be right all the time, but that is impossible and improbable when we are doing original research, especially when the work is exploring new territory. So, I think we need to acquire a new attitude towards errors. Even world class athletes in tennis make errors, and sometimes they make unforced errors. Errors are a part of life. As psychologists, we should be the first to know that ignoring errors and repressing contradictory information is just a temporary fix to deal with negative feelings that in the long run will have even more negative consequences. The whole point of science is to correct errors and in the process of doing so, more errors will be made that also need to be corrected. We can only do so, if we are open to the idea that we are routinely making mistakes. It is our job to uncover errors. Einstein discovered errors in Newton’s theory. This did not mean Newton was a bad guy, it just was part of the scientific process to notice them and to correct them.

In psychology, we use significance tests. And significance tests already imply that we are likely to make mistakes sometimes. Every claim is made with a fixed type-I ERROR probability.

Unfortunately, psychologists are not very good at taking another error into account. Every study also has the risk of not showing an effect, if the effect is there but too small to be detected in a study with low power. This is a type-II ERROR. Psychologists have ignored this error for a long time because these errors are not reported in the literature because journals only publish significant results. So, we make type-II errors routinely, but we do not worry about it because these results are not reported. Eventually, these errors are corrected when somebody else tests the same hypothesis and gets a significant result. So, we mainly have to worry about type-I errors because some of the original results will not replicate. And that is why we need replication studies. It is the only way to correct errors in a science where original studies have to report a significant result.

In short, science is an exercise in “fault finding,” and it is not helpful when famous original researchers dismiss results of carefully conducted replication studies (see Baumeister and Strack’s responses to failed RRR in Perspectives).

It is therefore unfortunate that your comment may be misinterpreted as implying that you also believe original results are always trustworthy and that replication studies only test the robustness and generality of a phenomenon. I think replication studies also provide important information whether an original finding might have been a false positive result or to produce a much more accurate estimate of the population effect size than a small original study can produce.

Thus, I politely take issue with your statement that “barring intentional fraud, every finding is an accurate description of the sample on which it was run. The question — an important one — is whether the findings extend beyond the sample and its particular experimental conditions. ”

Yes, this is true, but we are not just publishing means and make only claims about specific samples. Most publications also report inferential statistics and p-values. These p-values allow researchers to make claims that go beyond their sample and allow to generalize to other samples; at least other samples that are drawn from the same population. Pre-registration matters because the meaning of p-values is different between studies that pre-registered their data analysis plan and studies that did not pre-register their design and analysis plan. A p-value less than .05 allows researchers to draw conclusions that go beyond the sample with a maximum error rate of 5% if they conducted a single significance test that followed a pre-registered analysis plan. Without this plan the maximum error rate could be 100% (Sterling et al. 1995).

Thus, pre-registration can help to reduce error rates in published journals and I hope you and I share the common believe that the best way to avoid fault finding is to reduce the rate of errors that are being made in the first place. So, the need for costly replications can be reduced by encouraging the much-less costly practice of preregistration.

Sincerely, Ulrich Schimmack

Hi All: As the new editor of the APS journal Clinical Psychological Science, I have little to add to the excellent comments of PS editor Steve Lindsay.

Preregistration exemplifies the late physicist Richard Feynman’s astute point that science, at its best, is a recipe for minimizing (of course, not eliminating) the odds that we are fooled. There is nothing at all to fear from preregistration (except perhaps for the slight time commitment at the front end, which more than pays for itself at the back end in terms of better research). The procedure simply makes the frequently implicit distinction between confirmatory and exploratory research explicit, and minimizes the odds that researchers – even those with considerable integrity – will inadvertently fool themselves and fool others by detecting patterns post-hoc in their data, and persuading themselves that they had anticipated these patterns in advance (I have little doubt that I’ve fallen prey to this error myself). Because preregistration fully allows for exploratory research, there are no types of research for which this model doesn’t fit.

Understandably, all changes to our standard ways of doing business make us a bit nervous, as they certainly take some time to adjust to. But preregistering one’s hypotheses and data-analytic plan (if one has one; if not, that’s fine too, just so long as one makes that clear at the outset) is a win-win for researchers and for psychological science at large. In the long run, doing so will reduce inferential errors to which we’re all prone and render our conclusions more robust and ideally more replicable and generalizable.

Scott Lilienfeld, Emory University

I think that Susan Goldin-Meadow needs some defense. When I read her piece, I was nodding my head in agreement, and thinking that I wished she had gone farther. It bothers me that recent discussions of credibility of research have focused on replication and pre-registration, pushing into the background other factors that seem to me to be just as important, such as whether a result makes theoretical sense and whether the data analysis was done correctly at the outset.

I also worry that pre-registration discourages (but of course does not prevent) authors from undertaking the sort of preliminary analysis that is often sensible, such as looking at error distributions to decide whether a non-parametric test or a transformation is appropriate.

As an editor of one journal and an occasional reader of others, I think that the major sources of the credibility problem are often apparent from the outset when a paper is submitted. Submission of complete data (without exclusions) can lead to the discovery of p-hacking. Examination of the paper itself can expose weak inferences (such as findings based on low-power studies of results that are “surprising” because they don’t make theoretical sense) or weak statatics (such as interpreting removable interactions, or committing the “language as fixed-effect fallacy”).

These problems could be found by examining the protocol for a registered study, but also by reading a paper itself once it is written. But the use of pre-registration to get an editor to make a decision on a paper that might not even be submitted (depending on the results) is an extra burden on the editor and possible reviewers.

In my view, pre-registration is most useful when an author wants to undertake a risky study without worrying about whether it will be publishable regardless of the result, and when the author is willing to publish the study regardless of the results. This happens, and it happens in some fields more than others, but it is only one method among many for improving the credibility of published research. Although pre-registration can prevent p-hacking, it is not the only way to do this.

And, yes, I agree that many important studies, such as those of Piaget and Roger Brown, are new observations that illustrate new theories that make sense. These two examples were widely replicated, but that was not surprising. I could name countless others from fields familiar to me: the four-card problem; the Asian disease problem; the extra-cost effect (lost ticket); the Ellsberg paradox; and so on.

Jon Baron (Univerisity of Pensylvania, Editor of Judgment and Decision Making)

Leave a Comment

Your email address will not be published.
In the interest of transparency, we do not accept anonymous comments.
Required fields are marked*