1 What is a replication

Key Readings

Key readings:

Machery (2020): What is a replication?
Nosek & Errington (2020): What is replication?

Optional readings:

Rosenthal (1990): Replication in behavioral research. Especially relevant for the concepts of precision and the idea of weighting replication studies according to qualitative and quantitative factors.
National Academies of Sciences, Engineering, and Medicine et al. (2019) National Academies report on Reproducibility and Replicability in Science

1.1 Replication, Robustness and Reproducibility

This book is concerned with what we call the three “R”’s of trustworthy science: Replication, Robustness and Reproducibility. We begin by sketching a definition of each. This first Chapter then elaborates on the definition of replicability more specifically.

In 2019 the USA’s National Academies published a report on Reproducibility and Replicability in Science (National Academies of Sciences, Engineering, and Medicine et al., 2019), which we strongly recommend as a complementary reading to this document. They propose to define Replicability as “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”. This will be our first “R”. The NAS definition is also endorsed by the American Statistical Association (Broman et al., 2017) and generally adopted in statistics (for example (Heller et al., 2014)).

The second “R” is Reproducibility, defined by NAS as “obtaining consistent results using the same input data, computational steps, methods, and code, and conditions of analysis” (National Academies of Sciences, Engineering, and Medicine et al., 2019). Here the experimental design, data and methodologies of analysis are all fixed. A reproducible research paper should provide enough information to obtain the results originally reported, starting from the same raw data. Raw data are data in a form as close as possible to what was generated by the original experimental source. Defining precisely what raw data is can be complex both conceptually and practically, but a digression on this would take us too far astray. Nowadays it is becoming more common to supplement a paper with analysis code, raw data (if possible) and details or materials (e.g., experimental stimuli, biological reagents, questionnaires) about the experimental setup. Reproducibility can be considered as the most fundamental pre-requisite of replication in science. Chapter 2 will provide a more detailed explanation of the concept with an overview of tools for doing reproducible science.

Our third “R” also involves keeping the raw data fixed, and focuses on changing parts of the analysis method. This is different from the previous two, as it investigates the robustness of the results to the many judgment calls that typically need to be made in the implementation of a statistical analysis. Chapter 6 will present some methods for investigating robustness. For example, multiverse analysis (e.g., Steegen et al., 2016) is a way to systematically sample different plausible analytical strategies and see their impact on the final conclusions.

Usage of the first two “R” terms is often inconsistent; and, it is reversed in computer science. (Barba, 2018; Goodman et al., 2016; Kenett & Shmueli, 2015) review these terminologies and their usage.

1.2 Defining Replication

1.2.1 Challenges

When studying, planning or conducting studies about replication in science, it is essential to start from a clear definition and formalization of replication. First we need to define when an experiment or study constitutes a replication attempt of another.

A gray area often arises where scientists attempt to replicate the results of a study using a similar but not identical scientific methodology. If they obtain a different result, should they conclude that the new study did not replicate the original? To answer this question, the replication needs to be evaluated not only in terms of its result but also with regard to the study design — for example, with regard to whether experimental factors, approaches and conditions are the same or different. In this first chapter we discuss the design of a replication study from both an empirical and a philosophical perspective.

The statistical analysis and the evaluation of whether replication was successful will be covered in Chapter 4, where we will discover that there can be several different definitions of what a successful replication is. Anderson & Kelley (2024) reviewed more than 20 different replication criteria and Heyard et al. (2024) identified more than 50 published statistical approaches to assess a replication experiment.

We begin by reviewing concepts that in our view are general enough to be applied to multiple research areas and translated into an empirically testable statistical framework. The bibliography for this chapter contains a selected list of theoretical papers on replication.

1.2.2 Cronbach’s essential components of an experiment

We start by defining the essential components of an experiment as proposed by Cronbach & Shapiro (1982)¹ Their framework is concisely referred to as UTOS because it consisted of these four components:

UTOS Framework

Units: the elements of the population from which the study sample has been drawn; the term population is used here in its statistical sense: a real or hypothetical universe of units (not necessarily persons) to which the inferential conclusions of the study should apply;
Treatments: the independent variables, for example experimental interventions, or population subgroups of interest;
Outcomes: the dependent variables on which the study is focused
Setting: the remainder of factors that can affect the experimental outcome and its relation to treatments. For example the factors defining the “environment” where a psychological experiment takes place (lab vs online experiment, cultural aspects, etc.)

Some initial comments:

U: this component positions us unequivocally in the territory of statistical inference, where we posit a population of reference consisting of distinct units, and we are interested in generalizing from an observed sample to the entire population. Naturally not all science aligns with this paradigm. Before reading further, try to think about important scientific studies you may have come across recently that do not fit with this paradigm.
T & O: these components implicitly assume that the goal of the study was to elucidate how, in the population of interest, the outcomes relate to the independent variables, perhaps causally, if the experiment is controlled. Again, this is an important subset of science, but is by no means exhaustive. Clustering, discovery of latent dimensions in high dimensional data, segmentation of images, and a number of other exceptions come to mind. Add your own. Also, important science can be simply exploratory and descriptive, but replicability questions are relevant there as well.
S: By changing settings “minimally” you get “essentially” the same experiments. For example you may change the temperature in the lab for an experiment where temperature is a negligible factor. On the other hand, if you change the temperature in a chemistry experiment where a certain reaction is very sensitive to temperature, you may get a completely different experiment and arguably a different study: skating on ice vs water, so to speak. So it takes a little sprinkle of common sense for this to be a useful definition. Be alert.

1.2.3 Machery’s definition of a replication experiment

Machery (2020) proposed a simple and flexible perspective on replication called the resampling account.

Consider an original experiment in the UTOS framework, and imagine we can sharply specify fixed and random factors in the following sense: a random factor can be thought of as sampled from the population of interest, while a fixed factor is not sampled, but rather determined by design. For example, participants (Units) can be assumed to be random when the study samples a group of participants according to a well defined sampling scheme. In contrast, an experimental manipulation is generally fixed. Random sampling affords the option to use statistical inference to learn parameters of the population of interest (see also Yarkoni, 2020 for the idea of generalizability). Stricltly speaking, inference using a fixed factor is limited only to the tested condition. Defining fixed and random factors in real settings can be complex. Challenges and approaches are extensively debated in the literature (e.g., Clark & Linzer, 2015).

Returning to replication, according to Machery, a replication experiment is an experiment created by sampling from the random factors (e.g., new group of participants), and keeping the non-random factors fixed (e.g., same type of drug or treatment). In this definition, population parameters of interest are identical between an original study and its replication. We will return to this point in Chapter 3.

A replication experiment is meant to assess what Machery calls reliability. Machery (2020) uses the term reliability as in measurement theory (Tal, 2020) where a reliable instrument (in this case the whole experiment is seen as the instrument) produces consistent ² results across different replications of the measurement. When replicating a whole experiment, say sampling a new sample of participants, we are similarly interested in evaluating whether the experimental setup is reliable in the sense of producing consistent results.

As an illustration of the measurement metaphor, consider a study which evaluated a treatment for depression. The researchers developed a protocol, defined the concept of depression, administered the treatment and chose some depression self-report measures. In the treatment group, they found a lower level of depression compared to the control group. Another group of researchers decided to replicate this original experiment. They adopted the same protocol, definitions and measures but collected another sample of participants from the same population. The results of the replication experiment, for example the magnitude of the effect of the treatment on the depression level as defined by the protocol, may or may not be close to the original. If they are close, this provides evidence in favor of the reliability of the original experiment, as defined above (Nosek & Errington, 2017). Here, we are not questioning the definition or the measures used in the study.

To recap the progress made so far, if you can frame a study as UTOS, and clearly separate the random and fixed factors, you know how to design a replication study!

1.3 Validity and Extensions

1.3.1 Conceptual Replication

Does it make sense to compare experiments when fixed factors have changed? In many cases it does, and this type of analysis still falls under the purview of replication, broadly construed. Readers familiar with the replication literature may have noticed that we did not use concepts such as direct or conceptual replication yet. Briefly, a direct replication is usually defined as an experiment that tries to recreate the original experiment as closely as possible. While this objective can be achieved in some fields, it is practically challenging, due to uncontrollable factors (Nosek & Errington, 2017, 2020). Examples include replicating effects that are strongly culturally dependent, or experiments that used outdated or unavailable technology for the experimental setup.

In contrast, conceptual replication is a broad term defining a replication experiment with similar aims and methodology but with important differences (e.g., Crandall & Sherman, 2016). In Machery’s terms, a conceptual replication changes some of the fixed factors. In the depression example above, a conceptual replication might be a second experiment using another theoretical definition or measure of depression. Here, the goal is still within the broad umbrella of understanding the effect of the treatment on depression, but the experiment is not a strict replication in the Machery sense, because the outcome measurement methodology has changed. For this reason Machery (2020) uses the term extension. Extensions may include a methodological change but also a more profound philosophical change, for example using a scale motivated by a different theoretical framework for defining depression. An extension of an experiment is not assessing the reliability, but rather the validity of the original experiment and in a broader sense the theory underlying the original experiment. Like reliability, the term validity is used with measurement theory in mind. The validity of a measurement involves correspondence with the true phenomenon being assessed. (for example, the actual weight of an object being weighed).

In the depression example, if the researchers want to see whether the treatment is effective using other measures of depression, they are extending the original results; and, thereby, testing the validity of the result about the treatment, as long as both treatments are based on sufficiently similar theoretical foundations.

1.3.2 Precision of Replication

In practice we can often design a variety of replication / extension experiments, with varying degrees of similarity with the original one. Rosenthal (1990) defines the important concept of precision as the degree of similarity between the replication and the original experiment. The most precise is a Machery replication. Generally, very precise experiments lack external validity (they are not able to extend) but they can directly speak to the reliability of the original experimental setup and results. Less precise experiments will provide weaker evidence regarding the original experiment, shifting the focus to our confidence in the underlying theory. For example, we may say that the treatment also reduces physiological indexes of depression or is also effective in a different population of patients.

1.3.3 Replication vs Extension: different scientific value?

We have seen that a precise replication experiment is an attempt to ask the same inferential question in as similar a way as possible. Variations in setting and fixed factors can generate experiments that probe the trustworthiness of the conclusions further. These slight variations can be successfully analyzed together with replicability in mind, for example in a meta-analysis (see Chapter 4), even though they do not constitute a strict replication by Machery’s exacting standards.

As settings are varied further and the replication becomes less precise, we eventually find ourselves having extended the experimental paradigm beyond its initial goals and populations. When is this line crossed? The answer to this question is dependent on the context; and, reasonable scientists may disagree about it. An important debate in the literature regards the value of a replication, as compared to an extension, of an experiment. Crandall & Sherman (2016) consider extensions to more valuable while Simons (2014) argues in favor of focusing on replications of experiments. Clearly both have a place in science. If the main aim is to test the reliability of the original experiment, one should reduce the heterogeneity (i.e., increase the precision) as much as possible. On the other hand, if the aim is to explore different instances of the same effect or theory, one should extend the experimental setup.

In the remainder of the discussion we will use precise replication to denote a replication experiment in the strict sense of Machery, or a sufficiently close approximation, while we use the word extension to describe replication settings where the precision is lower. The appropriate definition of a successful replication, and the proper choice of statistical methodlogy for analyzing data may vary greatly, depending on where the follow-up study falls on this precision continuum.

1.3.4 Contextual dependency

Contextual dependency (Gollwitzer & Schwabe, 2022; Inbar, 2016; Van Bavel et al., 2016) complicates the Machery (2020) framework by considering specific weights assigned to each UTOS element. Without going into details, the impact of the different UTOS elements on the actual experiment could differ according to the type of research question. For example, consider a social psychology experiment originally conducted on American participants and later extended to European or Asian participants. If the psychological construct of interest is strongly affected by cultural variations, the impact of changing the ethnicity may be considerable. In contrast, a lab-based experiment about low-level perceptual effects may be more portable across culturally diverse populations. Our expectations about the replication/extension outcome need to be calibrated with respect to the specific research question at hand, and the impact of each experimental element.

1.3.5 Investigators as a Factor

As a final note, one important but often underestimated problem concerns the investigators conducting the replication study or studies. They can legitimately be considered part of the setting in some cases; but, they can be seen as random or irrelevant in others. Rosenthal (1990) clearly described the problem of correlated replicators and proposed different weighting approaches, based on the design of the replication study, for replication/extension studies conducted from authors that are orthogonal in terms of background theory and methods. Authors sharing a theory or a strict network of co-authorships are likely to create experiments and data that are more similar, as compared to independent authors. A replication or extension study conducted by an independent, and possibly skeptical, researcher should have a greater impact on increasing or decreasing confidence in the original findings, as compared to a follow-up study conducted by the original author.

1.4 Key Questions

In what ways do replication, robustness, and reproducibility contribute to the trustworthiness of science?
When should methodological differences between studies be viewed as a strength rather than a limitation in replication research?

References

Anderson, S. F., & Kelley, K. (2024). Sample size planning for replication studies: The devil is in the design. Psychological Methods, 29, 844–867. https://doi.org/10.1037/met0000520

Barba, L. A. (2018). Terminologies for Reproducible Research. arXiv [Cs.DL], arXiv:1802.03311. http://arxiv.org/abs/1802.03311

Broman, K., Cetinkaya-Rundel, M., Nussbaum, A., Paciorek, C., Peng, R., Turek, D., & Wickham, H. (2017, January 18). Recommendations to funding agencies for supporting reproducible research. https://www.amstat.org/asa/files/pdfs/POL-ReproducibleResearchRecommendations.pdf

Clark, T. S., & Linzer, D. A. (2015). Should I use fixed or random effects? Political Science Research and Methods, 3, 399–408. https://doi.org/10.1017/psrm.2014.32

Crandall, C. S., & Sherman, J. W. (2016). On the scientific superiority of conceptual replications for scientific progress. Journal of Experimental Social Psychology, 66, 93–99. https://doi.org/10.1016/j.jesp.2015.10.002

Cronbach, L. J., & Shapiro, K. (1982). Designing evaluations of educational and social programs. https://eduq.info/xmlui/handle/11515/9106

Gollwitzer, M., & Schwabe, J. (2022). Context dependency as a predictor of replicability. Review of General Psychology: Journal of Division 1, of the American Psychological Association, 26, 241–249. https://doi.org/10.1177/10892680211015635

Goodman, S. N., Fanelli, D., & Ioannidis, J. P. A. (2016). What does research reproducibility mean? Science Translational Medicine, 8, 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027

Heller, R., Bogomolov, M., & Benjamini, Y. (2014). Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study. Proceedings of the National Academy of Sciences of the United States of America, 111, 16262–16267. https://doi.org/10.1073/pnas.1314814111

Heyard, R., Pawel, S., Frese, J., Voelkl, B., Würbel, H., McCann, S., Held, L., Wever, K. E., Hartmann, H., Townsin, L., & Zellers, S. (2024). A scoping review on metrics to quantify reproducibility: a multitude of questions leads to a multitude of metrics. https://scholar.google.com/citations?view_op=view_citation&hl=en&citation_for_view=JAb7P1QAAAAJ:ldfaerwXgEUC

Inbar, Y. (2016). Association between contextual dependence and replicability in psychology may be spurious. Proceedings of the National Academy of Sciences of the United States of America, 113, E4933–4. https://doi.org/10.1073/pnas.1608676113

Kenett, R. S., & Shmueli, G. (2015). Clarifying the terminology that describes scientific reproducibility. Nature Methods, 12, 699. https://doi.org/10.1038/nmeth.3489

Machery, E. (2020). What is a replication? Philosophy of Science, 87, 545–567. https://doi.org/10.1086/709701

National Academies of Sciences, Engineering, and Medicine, Policy and Global Affairs, Committee on Science, Engineering, Medicine, and Public Policy, Board on Research Data and Information, Division on Engineering and Physical Sciences, Committee on Applied and Theoretical Statistics, Board on Mathematical Sciences and Analytics, Division on Earth and Life Studies, Nuclear and Radiation Studies Board, & Division of Behavioral and Social Sciences and Education. (2019). Reproducibility and Replicability in Science. National Academies Press. https://doi.org/10.17226/25303

Nosek, B. A., & Errington, T. M. (2017). Making sense of replications. eLife, 6, e23383. https://doi.org/10.7554/eLife.23383

Nosek, B. A., & Errington, T. M. (2020). What is replication? PLoS Biology, 18, e3000691. https://doi.org/10.1371/journal.pbio.3000691

Parmigiani, G. (2023). Defining replicability of prediction rules. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 38. https://doi.org/10.1214/23-sts891

Rosenthal, R. (1990). Replication in behavioral research. Journal of Social Behavior and Personality. https://search.proquest.com/openview/5ccc3a267139968c18c0d7ce1a1c09f6/1?pq-origsite=gscholar&cbl=1819046

Simons, D. J. (2014). The value of direct replication. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 9, 76–80. https://doi.org/10.1177/1745691613514755

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 11, 702–712. https://doi.org/10.1177/1745691616658637

Tal, E. (2020). Measurement in Science. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/fall2020/entries/measurement-science/

Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences of the United States of America, 113, 6454–6459. https://doi.org/10.1073/pnas.1521897113

Yarkoni, T. (2020). The generalizability crisis. The Behavioral and Brain Sciences, 45, e1. https://doi.org/10.1017/S0140525X20001685

This framework has been proposed for social sciences and psychology but can be easily applied to different fields. For example see Parmigiani (2023) for a more detailed taxonomy applied to biomedical studies.↩︎
e.g. close in some quantiative metric if they are real numbers↩︎