4  Statistical methods for replication assessment

4.1 How to assess the replication outcome?

In the previous chapters we introduced replication from a statistical and theoretical point of view. We purposely omitted defining the outcome of a replication study to dedicate an entire chapter about this part. Similarly to the multiple definitions problem there are several statistical measures for the replication assessment. The existing measures are different from several point of views:

In addition to the multitude of methods, the field of replication measures is relatively new and keep growing proposing new metrics, simulation studies, and comparisons between methods. Heyard et al. (2024) published an extensive systematic review, supported by an online and keep up-to-date database with all replication measures organized and classified according to a common scheme. The authors reviewed more than 50 different measures.

The database is avaliable at the following link http://rachelheyard.com/reproducibility_metrics/. We decided to make a selection of methods from the database according to the following criteria:

  • for similar methods we keep the most recent and general version
  • we included both bayesian and frequentists
  • we included methods for answering to the question “Is the study \(x\) replicated?” but also methods to combine evidence and estimate the size of the effect without an inferential decision
  • we selected methods that are not linked to a specific research area or topic. Some methods are specific for a given area of research or type of experiment such as replicating fMRI studies at the voxel level.

Beyond the single method, the large scale replication projects such as the used different methods for the same dataset. For this reason, while some methods can be considered clearly more sophisitcated and complete there is no a single best tool to evaluate replication results.

4.2 Replication studies as meta-analysis

As introduced by Hedges & Schauer (2019c) replication studies can be seen from a statistical point of view as meta-analyses. There is a single initial study and one or more attempts to replicate this initial finding. For the simple one-to-one replication design we have a meta-analysis with two studies while in the one-to-many design we have a meta-analysis with \(k\) studies. When comes to evaluate the results of the replication studies we can choose between several methods (see Heyard et al., 2024) and some of them are specifically meta-analysis based trying to pool togheter evidence from original and replication studies. We consider the meta-analytic thinking crucial to understand the replication studies and methods but not all methods are strictly meta-analyes in the usual sense.

In fact, while the methodology of the initial or replication studies is important when comes to evaluate the single result, at the aggregated level we evaluate how a certain focal parameter (e.g., the difference between two groups or conditions) vary across replications and id there is evidence, whatever the criteria, for a successful replication. Whatever is the model used in the original studies we can essentially think at the aggregated level without loss of generality loss.

4.3 Simulating data

While real-world examples are important, to understand the replication methods from a statistical point of view, simulating simplified examples is a good strategies. Furthermore, simulating data is nowadays considered an important tool to teach and understand statistical methods . In additions, Monte Carlo simulations are necessary to estimate statistical properties (e.g., statistical power or type-1 error rate) of complex models.

For the simulated examples we can define the following simulation approach:

  • primary studies always compare two independent groups on a certain response variable
  • within a study, the two groups are assumed to comes from two normal distributions with unit variance. For one group the mean is centered on zero and for the other group the mean is centered on the value representing the effect size for that specific study.
  • the sample size can vary between the two groups

Using the same notation as ?sec-prob-replication we can define:

  • \(y_0\): as the reference group
  • \(y_1\): as the treated group

For example we can simulate a single study:

es <- 0.3
n0 <- 30
n1 <- 30

y0 <- rnorm(n0, 0, 1)
y1 <- rnorm(n1, es, 1)

mean(y1) - mean(y0) # this is the effect size
## [1] 0.2506808
var(y1)/n1 + var(y0)/n0 # this is the sampling variability
## [1] 0.06698663
dd <- data.frame(y = c(y0, y1),
                 x = factor(rep(0:1, c(n0, n1))))

ggplot(dd, aes(x = x, y = y)) +
    geom_boxplot(aes(fill = x))

Now we can simply iterate the process for \(R\) studies to create a series of replications with a common true parameter \(\theta\). We can also generate a more realistic set of sample sizes sampling from a probability distribution (e.g., Poisson). Then, if we want to include variability in the true effects as in the extension framework we can simply sample \(k\) \(\theta\)’s from a normal distribution with variability \(\tau^2\).

k <- 10
mu <- 0.3 # average true effect
tau2 <- 0.1 # heterogeneity
theta <- rnorm(k, mu, sqrt(tau2)) # true effects for R studies

yi <- vi <- rep(NA, k)
n0 <- n1 <- 10 + rpois(k, 40 - 10)

for(i in 1:k){
    y0 <- rnorm(n0[i], 0, 1)
    y1 <- rnorm(n1[i], theta[i], 1)
    yi[i] <- mean(y1) - mean(y0)
    vi[i] <- var(y1)/n1[i] + var(y0)/n0[i]
}

sim <- data.frame(id = 1:k, yi, vi, n0, n1)
sim
   id         yi         vi n0 n1
1   1 0.03031231 0.06713664 36 36
2   2 0.65485683 0.04471934 45 45
3   3 0.68778103 0.04442382 41 41
4   4 0.80716226 0.06018050 39 39
5   5 1.04594451 0.03653029 47 47
6   6 0.36792056 0.06988056 32 32
7   7 0.62387229 0.07306874 36 36
8   8 0.37325746 0.06051872 45 45
9   9 0.04084495 0.04989911 40 40
10 10 0.35663995 0.03653834 44 44

Then we can put everything in a function that can be used to simulate different scenarios.

filor::print_fun(c(funs$sim_study, funs$sim_studies))
sim_study <- function(theta, n0, n1 = NULL, id = NULL){
    if(is.null(n1)) n1 <- n0
    y0 <- rnorm(n0, 0, 1)
    y1 <- rnorm(n1, theta, 1)
    sim <- data.frame(
        yi = mean(y1) - mean(y0),
        vi = var(y1)/n1 + var(y0)/n0
    )
    sim$sei <- sqrt(sim$vi)
    sim$n0 <- n0
    sim$n1 <- n1
    if(!is.null(id)){
        sim <- cbind(id = id, sim)
    }
    class(sim) <- c("rep3data", class(sim))
    return(sim)
}
sim_studies <- function(R, mu = 0, tau2 = 0, n0, n1 = NULL){
    # check input parameters consistency
    if(length(mu) == 1) mu <- rep(mu, R)
    if(length(n0) == 1) n0 <- rep(n0, R)
    if(length(n1) == 1) n1 <- rep(n1, R)
    if(is.null(n1)) n1 <- n0

    deltai <- rnorm(R, 0, sqrt(tau2))
    mu <- mu + deltai

    sim <- mapply(sim_study, mu, n0, n1, SIMPLIFY = FALSE)
    sim <- do.call(rbind, sim)
    sim <- cbind(id = 1:R, sim)
    class(sim) <- c("rep3data", class(sim))
    return(sim)
}
sim_studies(R = 10, mu = 0.5, tau2 = 0, n0 = 30)
   id           yi         vi       sei n0 n1
1   1 0.6637388572 0.05815519 0.2411539 30 30
2   2 0.5679240260 0.06381014 0.2526067 30 30
3   3 0.2788885059 0.06025047 0.2454597 30 30
4   4 0.6817201261 0.07733465 0.2780911 30 30
5   5 0.2359736753 0.07501622 0.2738909 30 30
6   6 0.0006746966 0.04938659 0.2222309 30 30
7   7 0.4815783638 0.06382364 0.2526334 30 30
8   8 0.7655321055 0.06968865 0.2639861 30 30
9   9 0.4829062160 0.07012715 0.2648153 30 30
10 10 0.3945500090 0.06645782 0.2577941 30 30

4.4 Methods

The database is avaliable at the following link http://rachelheyard.com/reproducibility_metrics/ can be consulted for a more complete overview. Among the all available methods we selected the following metrics1 that will be discussed more in details with examples.

  • Prediction interval: replication effect in original 95% prediction interval
  • Proportion of population effects agreeing in direction with the original
  • Bayes Factor: Independent Jeffreys-Zellner-Siow BF test (default BF)
  • Bayes Factor: Equality-of-effect-size BF test
  • Bayes Factor: Fixed-effect meta-analysis BF Test (Meta-analytic BF) [this probably just a bayesian meta-analysis]
  • Bayesian Evidence Synthesis (variant: Meta-Analysis Model-based Assessment of replicability (MAMBA))
  • Bayesian mixture model for reproducibility rate
  • Confidence interval: original effect in replication 95% CI (Coverage)
  • Confidence interval: replication effect in original 95%CI (Capture probability)
  • Continuously cumulating meta-analytic approach
  • Correspondence test
  • Credibility analysis (Reverse-Bayes, probability of credibility, probability of replicating an effect)
  • Design analysis [type-m/s]
  • Equivalence testing (TOST (two one-sided tests))
  • Likelihood-based approach for reproducibility (Likelihood-ratio) [similar to bf?]
  • Minimum effect testing [similar to small telescope?]
  • P interval
  • Prediction interval: replication effect in original 95% prediction interval
  • Replication Bayes factor [already included]
  • Sceptical \(p\)-value (versions: nominal sceptical \(p\)-value, golden sceptical \(p\)-value, controlled sceptical \(p\)-value)
  • Sceptical Bayes Factor (Reverse-Bayes)
  • Small Telescopes
  • Snapshot hybrid (Bayesian meta-analysis)
  • Z-curve (Exact replication rate, p-curves)
  • Consistency of original with replications, \(P_{\mbox{orig}}\)
  • I squared - \(I^2\) (Estimation of effect variance) [this can be togheter with Q and meta-analysis]
  • Proportion of population effects agreeing in direction with the original, \(\hat{P}_{>0}\)
  • Bland-Altman Plot (Agreement measures)
  • Correlation between effects
  • Difference in effect size (Q-statistic, (meta-analytic) Q-test, difference test, Tukey’s post-hoc honest significant difference test)
  • Externally standardized residuals [idea di calcolare tipo m su effetto stimato]
  • Meta-analysis
  • Significance criterion (vote counting, two-trials rule, regulatory agreement)

4.5 Vote Counting based on significance or direction

The simplest method is called vote counting (Hedges & Olkin, 1980; Valentine et al., 2011). A replication attempt \(\theta_{rep}\) is considered successful if the result has the same direction of the original study \(\theta_{orig}\) and it is statistically significant i.e., \(p_{\theta_{rep}} \leq \alpha\). Similarly we can count the number of replication with the same sign as the original study.

  • Easy to understand, communicate and compute
  • Did not consider the size of the effect
  • Depends on the power of \(\theta_{rep}\)

Let’s simulate a replication:

## original study
n_orig <- 30
theta_orig <- theta_from_z(2, n_orig)

orig <- data.frame(
  yi = theta_orig,
  vi = 4/(n_orig*2)
)

orig$sei <- sqrt(orig$vi)
orig <- summary_es(orig)

orig
         yi         vi       sei zi       pval      ci.lb    ci.ub
1 0.5163978 0.06666667 0.2581989  2 0.04550026 0.01033725 1.022458
## replications

R <- 10
reps <- sim_studies(R = R, 
                    mu = theta_orig, 
                    tau2 = 0, 
                    n_orig)

reps <- summary_es(reps)

head(reps)
  id         yi         vi       sei n0 n1        zi         pval       ci.lb
1  1 0.30093162 0.04118389 0.2029381 30 30 1.4828737 1.381080e-01 -0.09681981
2  2 0.26431383 0.05738837 0.2395587 30 30 1.1033364 2.698811e-01 -0.20521259
3  3 0.90100732 0.04986523 0.2233052 30 30 4.0348688 5.463281e-05  0.46333711
4  4 0.69905003 0.09251760 0.3041671 30 30 2.2982437 2.154792e-02  0.10289354
5  5 0.36250288 0.08164043 0.2857279 30 30 1.2686997 2.045482e-01 -0.19751350
6  6 0.08917127 0.07648660 0.2765621 30 30 0.3224277 7.471287e-01 -0.45288051
      ci.ub
1 0.6986830
2 0.7338403
3 1.3386775
4 1.2952065
5 0.9225193
6 0.6312231

Let’s compute the proportions of replication studies are statistically significant:

mean(reps$pval <= 0.05)
[1] 0.4

Let’s compute the proportions of replication studies with the same sign as the original:

mean(sign(orig$yi) == sign(reps$yi))
[1] 0.9

We could also perform some statistical tests. See Bushman & Wang (2009) and Hedges & Olkin (1980) for vote-counting methods in meta-analysis.

An extreme example:

Let’s imagine an original experiment with \(n_{orig} = 30\) and \(\hat \theta_{orig} = 0.5\) that is statistically significant \(p \approx 0.045\). Now a direct replication (thus assuming \(\tau^2 = 0\)) study with \(n_{rep} = 350\) found \(\hat \theta_{rep_1} = 0.15\), that is statistically significant \(p\approx 0.047\).

The most problematic aspect of using only the information from the sign of the effect or the p value is completely losing the information about the size of the effect and the precision.

4.6 Confidence/prediction interval methods

The main improvement of confidence/prediction interval methods is taking into account the size of the effect and the precision. The confidence interval represent the sampling uncertainty around the estimated value. It is interpreted as the percentage of confidence intervals under repetition of the same sampling procedure that contains the true value.

Confidence interval around an estimated effect \(\theta_r\) can be written as:

\[ 95\%\;\mbox{CI} = \hat\theta_r \pm \mbox{SE}_{r} t_{\alpha} \]

Where \(t_{\alpha}\) is the critical value for the test statistics with a given \(\alpha\). Clearly, this version of the confidence interval is symmetric around the estimated value and the width is a function of the estimation precision.

alpha <- 0.05
n <- c(10, 50, 100, 500)
theta <- rep(0.3, length(n))
se <- sqrt(1/n + 1/n)
tc <- abs(qnorm(alpha/2))
lb <- theta - se * tc
ub <- theta + se * tc

plot(n, theta, ylim = c(min(lb), max(ub)), type = "b", pch = 19,
     xlab = "Sample Size",
     ylab = latex2exp::TeX("$\\theta\\;_{r}$"))
lines(n, ub, lty = "dashed")
lines(n, lb, lty = "dashed")

4.6.1 Replication effect within the original CI

\[ \theta_{orig} - \Phi(\alpha/2) \sqrt{\sigma^2_{orig}} < \theta_{rep} < \theta_{orig} + \Phi(\alpha/2) \sqrt{\sigma^2_{orig}} \]

  • Take into account the size of the effect and the precision of \(\theta_{orig}\)
  • The original study is assumed to be a reliable estimation
  • No extension for many-to-one designs
  • Low precise original studies lead to higher success rate
z_orig <- 2
n_orig <- 30
theta_orig <- theta_from_z(z_orig, n_orig)
n_rep <- 350
theta_rep <- 0.15
se_orig <- sqrt(4 / (2 * n_orig))
ci_orig <- theta_orig + qnorm(c(0.025, 0.975)) * se_orig
curve(dnorm(x, theta_orig, se_orig), 
      -1, 2, 
      ylab = "Density", 
      xlab = latex2exp::TeX("$\\theta$"))
abline(v = ci_orig, lty = "dashed")
points(theta_orig, 0, pch = 19, cex = 2)
points(theta_rep, 0, pch = 19, cex = 2, col = "firebrick")
legend("topleft", 
       legend = c("Original", "Replication"), 
       fill = c("black", "firebrick"))

One potential problem of this method regards that low precise original studies are “easier” to replicate due to larger confidence intervals.

4.6.2 Original study within replication CI

The same approach can be applied checking if the original effect size is contained within the replication confidence interval. Clearly these methods depends on the precision of studies. Formally:

\[ \theta_{rep} - \Phi(\alpha/2) \sqrt{\sigma_{rep}^2} < \theta_{orig} < \theta_{rep} + \Phi(\alpha/2) \sqrt{\sigma_{rep}^2} \]

The method has the same pros and cons of the previous approach. One advantage is that usually replication studies are more precise (higher sample size) thus the parameter and the % CI is more reliable.

4.7 Prediction interval PI

There is still one missing information from the CI method that is considering the uncertainty only from the original or the replication study. The prediction interval PI express the uncertainty of a future observation (in this case a study) and not on the estimated parameter.

If we compute the difference between the original and the replication study, the sampling distribution of the difference express the range of values that are expected in absence of other source of variability. In other terms, assuming that \(\tau^2 = 0\).

If the original and replication studies comes from the same population, the sampling distribution of the difference is centered on 0 with a certain standard error \(\theta_{orig} - \theta_{rep_0} \sim \mathcal{N}\left(0, \sqrt{\sigma^2_{\hat \theta_{orig} - \hat \theta_{rep}}} \right)\) (subscript \(0\) to indicate that is expected to be sampled from the same population as \(\theta_{orig}\))

\[ \hat \theta_{orig} \pm z_{95\%} \sqrt{\sigma^2_{\theta_{orig} - \theta_{rep}}} \]

The new standard error is the sum (assuming that the two studies are independent) of the two standard errors. Thus we are taking into account the estimation uncertainty of the original and replication studies.

set.seed(2025)

o1 <- rnorm(50, 0.5, 1) # group 1
o2 <- rnorm(50, 0, 1) # group 2
od <- mean(o1) - mean(o2) # effect size
se_o <- sqrt(var(o1)/50 + var(o2)/50) # standard error of the difference

n_r <- 100 # sample size replication

se_o_r <- sqrt(se_o^2 + (var(o1)/100 + var(o2)/100))

od + qnorm(c(0.025, 0.975)) * se_o_r
[1] 0.325520 1.298378
Code
par(mar = c(4, 4, 0.1, 0.1))  
curve(dnorm(x, od, se_o_r), od - se_o_r*4, od + se_o_r*4, lwd = 2, xlab = latex2exp::TeX("$\\theta$"), ylab = "Density")
abline(v = od + qnorm(c(0.025, 0.975)) * se_o_r, lty = "dashed")

  • Take into account uncertainty of both studies
  • We can plan a replication using the standard deviation of the original study and the expected sample size
  • Low precise original studies lead to wide PI. For a replication study is difficult to fall outside the PI
  • Mainly for one-to-one replications design

4.8 Mathur & VanderWeele (2020) \(p_{orig}\)

Mathur & VanderWeele (2020) proposed a new method based on the prediction interval to calculate a p value \(p_{orig}\) representing the probability that \(\theta_{orig}\) is consistent with the replications. This method is suited for many-to-one replication designs. Formally:

\[ P_{orig} = 2 \left[ 1 - \Phi \left( \frac{|\hat \theta_{orig} - \hat \mu_{\theta_{rep}}|}{\sqrt{\hat \tau^2 + \sigma^2_{orig} + \hat{SE}^2_{\hat \mu_{\theta_{rep}}}}} \right) \right] \]

  • \(\mu_{\theta_{rep}}\) is the pooled (i.e., meta-analytic) estimation of the \(k\) replications
  • \(\tau^2\) is the variance among replications

It is interpreted as the probability that \(\theta_{orig}\) is equal or more extreme that what observed. A very low \(p_{orig}\) suggest that the original study is inconsistent with replications.

  • Suited for many-to-one designs
  • We take into account all sources of uncertainty
  • We have a p-value

The code is implemented in the Replicate and MetaUtility R packages:

tau2 <- 0.05
theta_rep <- 0.2
theta_orig <- 0.7

n_orig <- 30
n_rep <- 100
k <- 20

replications <- sim_studies(k, theta_rep, tau2, n_rep, n_rep)
original <- sim_studies(1, theta_orig, 0, n_orig, n_orig)

fit_rep <- metafor::rma(yi, vi, data = replications) # random-effects meta-analysis

Replicate::p_orig(original$yi, original$vi, fit_rep$b[[1]], fit_rep$tau2, fit_rep$se^2)
[1] 0.06692786

4.8.1 Simulation

# standard errors assuming same n and variance 1
se_orig <- sqrt(4/(n_orig * 2))
se_rep <- sqrt(4/(n_rep * 2))
se_theta_rep <- sqrt(1/((1/(se_rep^2 + tau2)) * k)) # standard error of the random-effects estimate

sep <- sqrt(tau2 + se_orig^2 + se_theta_rep^2) # z of p-orig denominator

curve(dnorm(x, theta_rep, sep), theta_rep - 4*sep, theta_rep + 4*sep, ylab = "Density", xlab = latex2exp::TeX("\\theta"))
points(theta_orig, 0.02, pch = 19, cex = 2)
abline(v = qnorm(c(0.025, 0.975), theta_rep, sep), lty = "dashed", col = "firebrick")

4.8.2 Mathur & VanderWeele (2020) \(\hat P_{> 0}\)

Another related metric is the \(\hat P_{> 0}\), representing the proportion of replications following the same direction as the original effect. Before simply computing the proportions we need to adjust the estimated \(\theta_{rep_i}\) with a shrinkage factor:

\[ \tilde{\theta}_{rep_i} = (\theta_{rep_i} - \mu_{\theta_{rep_i}}) \sqrt{\frac{\hat \tau^2}{\hat \tau^2 + v_{rep_i}}} \]

This method is somehow similar to the vote counting but we are adjusting the effects taking into account \(\tau^2\).

# compute calibrated estimation for the replications
# use restricted maximum likelihood to estimate tau2 under the hood
theta_sh <- MetaUtility::calib_ests(replications$yi, replications$sei, method = "REML")
mean(theta_sh > 0)
[1] 0.85

The authors suggest a bootstrapping approach for making inference on \(\hat P_{> 0}\)

nboot <- 1e3
theta_boot <- matrix(0, nrow = nboot, ncol = k)

for(i in 1:nboot){
  idx <- sample(1:nrow(replications), nrow(replications), replace = TRUE)
  replications_boot <- replications[idx, ]
  theta_cal <- MetaUtility::calib_ests(replications_boot$yi, 
                                       replications_boot$sei, 
                                       method = "REML")
  theta_boot[i, ] <- theta_cal
}

# calculate
p_greater_boot <- apply(theta_boot, 1, function(x) mean(x > 0))

4.8.3 Mathur & VanderWeele (2020) \(\hat P_{\gtrless q*}\)

Instead of using 0 as threshold, we can use meaningful effect size to be considered as low but different from 0. \(\hat P_{\gtrless q*}\) is the proportion of (calibrated) replications greater or lower than the \(q*\) value. This framework is similar to equivalence and minimum effect size testing (Lakens et al., 2018).

q <- 0.2 # minimum non zero effect

fit <- metafor::rma(yi, vi, data = replications)

# see ?MetaUtility::prop_stronger
MetaUtility::prop_stronger(q = q,
                           M = fit$b[[1]],
                           t2 = fit$tau2,
                           tail = "above",
                           estimate.method = "calibrated",
                           ci.method = "calibrated",
                           dat = replications,
                           yi.name = "yi",
                           vi.name = "vi")
   est        se lo  hi   bt.mn shapiro.pval
1 0.35 0.1756669  0 0.6 0.35375    0.8137044

4.9 Meta-analytic methods

Another approach is to combine the original and replication results (both one-to-one and many-to-one) using a meta-analysis model. Then we can test if the pooled estimate is different from 0 or another meaningful value.

  • Use all the available information, especially when fitting a random-effects model
  • Take into account the precision by inverse-variance weighting
  • Did not consider the publication bias
  • For one-to-one designs only a fixed-effects model can be used

4.9.1 Fixed-effects Model

# fixed-effects
fit_fixed <- rma(yi, vi, method = "FE")
summary(fit_fixed)

Fixed-Effects Model (k = 20)

  logLik  deviance       AIC       BIC      AICc   
 20.7415   -0.0000  -39.4829  -38.4872  -39.2607   

I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  0.00

Test for Heterogeneity:
Q(df = 19) = 0.0000, p-val = 1.0000

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub      
  0.2000  0.0316  6.3246  <.0001  0.1380  0.2620  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.9.2 Random-Effects model

# fixed-effects
fit_random <- rma(yi, vi, method = "REML")
summary(fit_random)

Random-Effects Model (k = 20; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 19.7044  -39.4088  -35.4088  -33.5199  -34.6588   

tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.0065)
tau (square root of estimated tau^2 value):      0
I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity:
Q(df = 19) = 0.0000, p-val = 1.0000

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub      
  0.2000  0.0316  6.3246  <.0001  0.1380  0.2620  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.9.3 Pooling replications

The previous approach can be also implemented combining replications into a single effect and then compare the original with the combined replication study.

This is similar to using the CI or PI approaches but the replication effect will probably by very precise due to pooling multiple studies.

4.9.4 Q Statistics

An interesting proposal is using the Q statistics (Hedges & Schauer, 2019a, 2019b, 2019c, 2021; Schauer, 2022; Schauer & Hedges, 2020; Schauer & Hedges, 2021), commonly used in meta-analysis to assess the presence of heterogeneity. Formally:

\[ Q = \sum_{i = 1}^{k} \frac{(\theta_i - \bar \theta_w)^2}{\sigma^2_i} \]

Where \(\bar \theta_w\) is the inverse-variance weighted average (i.g., fixed-effect model). The Q statistics is essentially a weighted sum of squares. Under the null hypothesis where all studies are equal \(\theta_1 = \theta_2, ... = \theta_i\) the Q statistics has a \(\chi^2\) distribution with \(k - 1\) degrees of freedom. Under the alternative hypothesis the distribution is a non-central \(\chi^2\) with non centrality parameter \(\lambda\). The expected value of the \(Q\) is \(E(Q) = v + \lambda\), where \(v\) are the degrees of freedom.

Hedges & Schauer proposed to use the Q statistics to evaluate the consistency of a series of replications:

  • In case of a replication, \(\lambda = 0\) because \(\theta_1 = \theta_2, ... = \theta_k\).
  • In case of an extension, \(\lambda < \lambda_0\) where \(\lambda_0\) is the maximum value considered as equal to null (i.e., 0).

This approach is testing the consistency (i.e., homogeneity) of replications. A successful replication should minimize the heterogeneity and the presence of a significant Q statistics should bring evidence for not replicating the effect2.

The method has been expanded and formalized in several papers with different objectives:

  • to cover different replications setup (burden of proof on replicating vs non-replicating, many-to-one and one-to-one, etc.)
  • interpret and choose the \(\lambda\) parameter given that is the core of the approach
  • evaluating the power and statistical properties under different replication scenarios
  • the standard implementation put the burden of proof on non-replication. Thus \(H_0\) is that studies replicates. They provided also a series of tests with the opposite formulation.

In the case of evaluating a replication we can use the Qrep() function that simply calculate the p-value based on the Q sampling distribution.

Qrep <- function(yi, vi, lambda0 = 0, alpha = 0.05){
  fit <- metafor::rma(yi, vi)
  k <- fit$k
  Q <- fit$QE
  df <- k - 1
  Qp <- pchisq(Q, df = df, ncp = lambda0, lower.tail = FALSE)
  pval <- ifelse(Qp < 0.001, "p < 0.001", sprintf("p = %.3f", Qp))
  lambda <- ifelse((Q - df) < 0, 0, (Q - df))
  res <- list(Q = Q, lambda = lambda, pval = Qp, df = df, k = k, alpha = alpha, lambda0 = lambda0)
  H0 <- ifelse(lambda0 != 0, paste("H0: lambda <", lambda0), "H0: lambda = 0")
  title <- ifelse(lambda0 != 0, "Q test for Approximate Replication", "Q test for Exact Replication")
  cli::cli_rule()
  cat(cli::col_blue(cli::style_bold(title)), "\n\n")
  cat(sprintf("Q = %.3f (df = %s), lambda = %.3f, %s", res$Q, res$df, lambda, pval), "\n")
  cat(H0, "\n")
  cli::cli_rule()
  class(res) <- "Qrep"
  invisible(res)
}
Q test for Exact Replication 

Q = 509.539 (df = 99), lambda = 410.539, p < 0.001 
H0: lambda = 0 
Qres <- Qrep(dat$yi, dat$vi)
Q test for Exact Replication 

Q = 509.539 (df = 99), lambda = 410.539, p < 0.001 
H0: lambda = 0 
plot.Qrep(Qres)

4.9.4.1 Q Statistics for an extension

In case of an extension we need to set \(\lambda_0\) to a meaningful value but the overall test is the same. The critical \(Q\) is no longer evaluated with a central \(\chi^2\) but a non-central \(\chi^2\) with \(\lambda_0\) as non-centrality parameter.

Hedges & Schauer (2019c) provide different strategies to choose \(\lambda_0\). They found that under some assumptions, \(\lambda = (k - 1) \frac{\tau^2}{\tilde{v}}\)

Given that we introduced the \(I^2\) statistics we can derive a \(\lambda_0\) based in \(I^2\). Schmidt & Hunter (2014) proposed that when \(\tilde{v}\) is at least 75% of total variance \(\tilde{v} + \tau^2\) thus \(\tau^2\) could be considered neglegible. This corresponds to a \(I^2 = 25%\) and a ratio \(\frac{\tau^2}{\tilde{v}} = 1/3\) thus \(\lambda_0 = \frac{(k - 1)}{3}\) can be considered a neglegible heterogeneity

k <- 100
dat <- sim_studies(k, 0.5, 0, 50, 50)
Qrep(dat$yi, dat$vi, lambda0 = (k - 1)/3)
Q test for Approximate Replication 

Q = 93.881 (df = 99), lambda = 0.000, p = 0.989 
H0: lambda < 33 

4.10 Small Telescopes (Simonsohn, 2015)

The idea is simple but quite powerful and insightful. Let’s assume that an original study found an effect of \(y_{orig} = 0.7\) on a two-sample design with \(n = 20\) per group.

  • we define a threshold as the effect size that is associated with a certain low power level e.g., \(33\%\) given the sample size i.e. \(\theta_{small} = 0.5\)
  • the replication study found an effect of \(y_{rep} = 0.2\) with \(n = 100\) subjects

If the \(y_{rep}\) is lower (i.e., the upper bound of the confidence interval) than the small effect (\(\theta_{small} = 0.5\)) we conclude that the effect is probably so tiny that could not have been detected by the original study. Thus there is no evidence for a replication.

We can use the custom small_telescope() function on simulated data:

small_telescope <- function(or_d,
                            or_se,
                            rep_d,
                            rep_se,
                            small,
                            ci = 0.95){
  # quantile for the ci
  qs <- c((1 - ci)/2, 1 - (1 - ci)/2)

  # original confidence interval
  or_ci <- or_d + qnorm(qs) * or_se

  # replication confidence interval
  rep_ci <- rep_d + qnorm(qs) * rep_se

  # small power
  is_replicated <- rep_ci[2] > small

  msg_original <- sprintf("Original Study: d = %.3f %s CI = [%.3f, %.3f]",
                          or_d, ci, or_ci[1], or_ci[2])

  msg_replicated <- sprintf("Replication Study: d = %.3f %s CI = [%.3f, %.3f]",
                            rep_d, ci, rep_ci[1], rep_ci[2])


  if(is_replicated){
    msg_res <- sprintf("The replicated effect is not smaller than the small effect (%.3f), (probably) replication!", small)
    msg_res <- cli::col_green(msg_res)
  }else{
    msg_res <- sprintf("The replicated effect is smaller than the small effect (%.3f), no replication!", small)
    msg_res <- cli::col_red(msg_res)
  }

  out <- data.frame(id = c("original", "replication"),
                    d = c(or_d, rep_d),
                    lower = c(or_ci[1], rep_ci[1]),
                    upper = c(or_ci[2], rep_ci[2]),
                    small = small
  )

  # nice message
  cat(
    msg_original,
    msg_replicated,
    cli::rule(),
    msg_res,
    sep = "\n"
  )

  invisible(out)

}
set.seed(2025)

d <- 0.2 # real effect

# original study
or_n <- 20
or_d <- 0.7
or_se <- sqrt(1/20 + 1/20)
d_small <- pwr::pwr.t.test(or_n, power = 0.33)$d

# replication
rep_n <- 100 # sample size of replication study
g0 <- rnorm(rep_n, 0, 1)
g1 <- rnorm(rep_n, d, 1)

rep_d <- mean(g1) - mean(g0)
rep_se <- sqrt(var(g1)/rep_n + var(g0)/rep_n)

Here we are using the pwr::pwr.t.test() to compute the effect size \(\theta_{small}\) (in code d) associated with 33% power.

small_telescope(or_d, or_se, rep_d, rep_se, d_small, ci = 0.95)
Original Study: d = 0.700 0.95 CI = [0.080, 1.320]
Replication Study: d = 0.214 0.95 CI = [-0.061, 0.490]
────────────────────────────────────────────────────────────────────────────────
The replicated effect is smaller than the small effect (0.493), no replication!

And a (quite over-killed) plot:

4.11 Replication Bayes factor

Verhagen & Wagenmakers (2014) proposed a method to estimate the evidence of a replication study. The core topics to understand the method are:

4.11.1 Bayesian inference

Bayesian inference is the statistical procedure where prior beliefs about a phenomenon are combined, using the Bayes theorem, with evidence from data to obtain the posterior beliefs.

The interesting part is that the researcher express the prior beliefs in probabilistic terms. Then after collecting data, evidence from the experiment is combined increasing or decreasing the plausibility of prior beliefs.

Let’s make an (not a very innovative :smile:) example. We need to evaluate the fairness of a coin. The crucial parameter is \(\theta\) that is the probability of success (e.g., head). We have our prior belief about the coin (e.g., fair but with some uncertainty). We toss the coin \(k\) times and we observe \(x\) heads. What are my conclusions?

\[ p(\theta|D) = \frac{p(D|\theta) \; p(\theta)}{p(D)} \] Where \(\theta\) is our parameter and \(D\) the data. \(p(\theta|D)\) is the posterior distribution that is the product between the likelihood \(p(D|\theta)\) and the prior \(p(\theta)\). \(p(D)\) is the probability of the data (aka marginal likelihood) and is necessary only for the posterior to be a proper probability distribution.

We can “read” the formula as: The probability of the parameter given the data is the product between the likelihood of the data given the parameter and the prior probability of the parameter.

Let’s express our prior belief in probabilistic terms:

Now we collect data and we observe \(x = 40\) tails out of \(k = 50\) trials thus \(\hat{\theta} = 0.8\) and compute the likelihood:

Finally we combine, using the Bayes rule, prior and likelihood to obtain the posterior distribution:

The idea of the Bayes Factor is computing the evidence of the data under two competing hypotheses, \(H_0\) and \(H_1\) (~ \(\theta\) in our previous example):

\[ \frac{p(H_0|D)}{p(H_1|D)} = \frac{f(D|H_0)}{f(D|H_1)} \times \frac{p(H_0)}{p(H_1)} \]

Where \(f\) is the likelihood function, \(y\) are the data. The \(\frac{p(H_0)}{p(H_1)}\) is the prior odds of the two hypothesis. The Bayes Factor is the ratio between the likelihood of the data under the two hypotheses.

4.11.2 Bayes Factor using the SDR

Calculating the BF can be problematic in some condition. The SDR is a convenient shortcut to calculate the Bayes Factor (Wagenmakers et al., 2010). The idea is that the ratio between the prior and posterior density distribution for the \(H_1\) is an estimate of the Bayes factor calculated in the standard way.

\[ BF_{01} = \frac{p(D|H_0)}{p(D|H_1)} = \frac{p(\theta = x|D, H_1)}{p(\theta = x, H_1)} \]

Where \(\theta\) is the parameter of interest and \(x\) is the null value under \(H_0\) e.g., 0. and \(D\) are the data.

Following the previous example \(H_0: \theta = 0.5\). Under \(H_1\) we use a completely uninformative prior by setting \(\theta \sim Beta(1, 1)\).

We flip again the coin 20 times and we found that \(\hat \theta = 0.75\).

The ratio between the two black dots is the Bayes Factor.

4.11.3 Verhagen & Wagenmakers (2014) model3

The idea is using the posterior distribution of the original study as prior for a Bayesian hypothesis testing where:

  • \(H_0: \theta_{rep} = 0\) thus there is no effect in the replication study
  • \(H_1: \theta_{rep} \neq 0\) and in particular is distributed as \(\delta \sim \mathcal{N}(\theta_{orig}, \sigma^2_{orig})\) where \(\theta_{orig}\) and \(\sigma^2_{orig}\) are the mean and standard error of the original study

If \(H_0\) is more likely after seeing the data, there is evidence against the replication (i.e., \(BF_{r0} > 1\)) otherwise there is evidence for a successful replication (\(BF_{r1} > 1\)).

Warning

Disclaimer: The actual implementation of Verhagen & Wagenmakers (2014) is different (they use the \(t\) statistics). We implemented a similar model using a Bayesian linear model with rstanarm.

Let’s assume that the original study (\(n = 30\)) estimate a \(y_{orig} = 0.4\) and a standard error of \(\sigma^2/n\).

# original study
n <- 30
yorig <- 0.4
se <- sqrt(1/30)
Note

The assumption of Verhagen & Wagenmakers (2014) is that the original study performed a Bayesian analysis with a completely flat prior. Thus the confidence interval is the same as the Bayesian credible interval.

For this reason, the posterior distribution of the original study can be approximated as:

Let’s imagine that a new study tried to replicate the original one. They collected \(n = 100\) participants with the same protocol and found and effect of \(y_{rep} = 0.1\).

nrep <- 100
yrep <- MASS::mvrnorm(nrep, mu = 0.1, Sigma = 1, empirical = TRUE)[, 1]
dat <- data.frame(y = yrep)
hist(yrep, main = "Replication Study (n1 = 100)", xlab = latex2exp::TeX("$y_{rep}$"))
abline(v = mean(yrep), lwd = 2, col = "firebrick")

We can analyze these data with an intercept-only regression model setting as prior the posterior distribution of the original study:

# setting the prior on the intercept parameter
prior <- rstanarm::normal(location = yorig,
                          scale = se)

# fitting the bayesian linear regression
fit <- stan_glm(y ~ 1, 
                data = dat, 
                prior_intercept = prior,
                refresh = FALSE)

summary(fit)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      y ~ 1
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 100
 predictors:   1

Estimates:
              mean   sd   10%   50%   90%
(Intercept) 0.2    0.1  0.1   0.2   0.3  
sigma       1.0    0.1  0.9   1.0   1.1  

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 0.2    0.1  0.0   0.2   0.3  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  2609 
sigma         0.0  1.0  2608 
mean_PPD      0.0  1.0  3185 
log-posterior 0.0  1.0  1693 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

We can use the bayestestR::bayesfactor_pointnull() to calculate the BF using the Savage-Dickey density ratio.

bf <- bayestestR::bayesfactor_pointnull(fit, null = 0)
print(bf)
Bayes Factor (Savage-Dickey density ratio)

Parameter   |    BF
-------------------
(Intercept) | 0.274

* Evidence Against The Null: 0
plot(bf)

You can also use the bf_replication() function:

bf_replication <- function(mu_original,
                           se_original,
                           replication){

  # prior based on the original study
  prior <- rstanarm::normal(location = mu_original, scale = se_original)

  # to dataframe
  replication <- data.frame(y = replication)

  fit <- rstanarm::stan_glm(y ~ 1,
                            data = replication,
                            prior_intercept = prior,
                            refresh = 0) # avoid printing

  bf <- bayestestR::bayesfactor_pointnull(fit, null = 0, verbose = FALSE)

  title <- "Bayes Factor Replication Rate"
  posterior <- "Posterior Distribution ~ Mean: %.3f, SE: %.3f"
  replication <- "Evidence for replication: %3f (log %.3f)"
  non_replication <- "Evidence for non replication: %3f (log %.3f)"

  if(bf$log_BF > 0){
    replication <- cli::col_green(sprintf(replication, exp(bf$log_BF), bf$log_BF))
    non_replication <- sprintf(non_replication, 1/exp(bf$log_BF), -bf$log_BF)
  }else{
    replication <- sprintf(replication, exp(bf$log_BF), bf$log_BF)
    non_replication <- cli::col_red(sprintf(non_replication, 1/exp(bf$log_BF), -bf$log_BF))
  }

  outlist <- list(
    fit = fit,
    bf = bf
  )

  cat(
    cli::col_blue(title),
    cli::rule(),
    sprintf(posterior, fit$coefficients, fit$ses),
    "\n",
    replication,
    non_replication,
    sep = "\n"
  )

  invisible(outlist)

}
bf_replication(mu_original = yorig, se_original = se, replication = yrep)

A better custom plot:

bfplot <- data.frame(
  prior = rnorm(1e5, yorig, se),
  likelihood = rnorm(1e5, mean(yrep), sd(yrep)/sqrt(length(yrep))),
  posterior = rnorm(1e5, fit$coefficients, fit$ses)
)
 
ggplot() +
    stat_function(geom = "line", 
                  aes(color = "Original Study (Prior)"),
                  linewidth = 1,
                  alpha = 0.7,
                  fun = dnorm, args = list(mean = yorig, sd = se)) +
    stat_function(geom = "line",
                  linewidth = 1,
                  aes(color = "Replication Study (Posterior)"),
                  fun = dnorm, args = list(mean = fit$coefficients, sd = fit$ses)) +
    stat_function(geom = "line",
                  linewidth = 0.7,
                  alpha = 0.7,
                  aes(color = "Replication Study (Likelihood)"),
                  #lty = "dashed",
                  fun = dnorm, args = list(mean = mean(yrep), sd = sd(yrep)/sqrt(length(yrep)))) +
    xlim(c(-0.5, 1.2)) +
    geom_point(aes(x = c(0, 0), y = c(dnorm(0, yorig, sd = se),
                                      dnorm(0, fit$coefficients, sd = fit$ses))),
               size = 3) +
    xlab(latex2exp::TeX("\\delta")) +
    ylab("Density") +
    theme(legend.position = "bottom",
          legend.title = element_blank())

Bushman, B. J., & Wang, M. C. (2009). Vote-counting procedures in meta-analysis. In The handbook of research synthesis and meta-analysis (pp. 207–220). Russell Sage Foundation.
Hedges, L. V., & Olkin, I. (1980). Vote-counting methods in research synthesis. Psychological Bulletin, 88, 359–369. https://doi.org/10.1037/0033-2909.88.2.359
Hedges, L. V., & Schauer, J. M. (2019a). Consistency of effects is important in replication: Rejoinder to mathur and VanderWeele (2019). Psychological Methods, 24, 576–577. https://doi.org/10.1037/met0000237
Hedges, L. V., & Schauer, J. M. (2019b). More than one replication study is needed for unambiguous tests of replication. Journal of Educational and Behavioral Statistics: A Quarterly Publication Sponsored by the American Educational Research Association and the American Statistical Association, 44, 543–570. https://doi.org/10.3102/1076998619852953
Hedges, L. V., & Schauer, J. M. (2019c). Statistical analyses for studying replication: Meta-analytic perspectives. Psychological Methods, 24, 557–570. https://doi.org/10.1037/met0000189
Hedges, L. V., & Schauer, J. M. (2021). The design of replication studies. Journal of the Royal Statistical Society. Series A, (Statistics in Society), 184, 868–886. https://doi.org/10.1111/rssa.12688
Heyard, R., Pawel, S., Frese, J., Voelkl, B., Würbel, H., McCann, S., Held, L., Wever, K. E., Hartmann, H., Townsin, L., & Zellers, S. (2024). A scoping review on metrics to quantify reproducibility: A multitude of questions leads to a multitude of metrics. https://scholar.google.com/citations?view_op=view_citation&hl=en&citation_for_view=JAb7P1QAAAAJ:ldfaerwXgEUC
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1, 259–269. https://doi.org/10.1177/2515245918770963
Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2019). Replication bayes factors from evidence updating. Behavior Research Methods, 51, 2498–2508. https://doi.org/10.3758/s13428-018-1092-x
Mathur, M. B., & VanderWeele, T. J. (2019). Challenges and suggestions for defining replication "success" when effects may be heterogeneous: Comment on hedges and schauer (2019). Psychological Methods, 24, 571–575. https://doi.org/10.1037/met0000223
Mathur, M. B., & VanderWeele, T. J. (2020). New statistical metrics for multisite replication projects. Journal of the Royal Statistical Society. Series A, (Statistics in Society), 183, 1145–1166. https://doi.org/10.1111/rssa.12572
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237. https://doi.org/10.3758/PBR.16.2.225
Schauer, J. M. (2022). Replicability and meta-analysis. In W. O’Donohue, A. Masuda, & S. Lilienfeld (Eds.), Avoiding questionable research practices in applied psychology (pp. 301–342). Springer International Publishing. https://doi.org/10.1007/978-3-031-04968-2_14
Schauer, J. M., & Hedges, L. V. (2020). Assessing heterogeneity and power in replications of psychological experiments. Psychological Bulletin, 146, 701–719. https://doi.org/10.1037/bul0000232
Schauer, J. M., & Hedges, L. V. (2021). Reconsidering statistical methods for assessing replication. Psychological Methods, 26, 127–139. https://doi.org/10.1037/met0000302
Schmidt, F. L., & Hunter, J. E. (2014). Methods of meta-analysis: Correcting error and bias in research findings (3rd ed.). SAGE Publications. https://doi.org/10.4135/9781483398105
Simonsohn, U. (2015). Small telescopes: Detectability and the evaluation of replication results: Detectability and the evaluation of replication results. Psychological Science, 26, 559–569. https://doi.org/10.1177/0956797614567341
Valentine, J. C., Biglan, A., Boruch, R. F., Castro, F. G., Collins, L. M., Flay, B. R., Kellam, S., Mościcki, E. K., & Schinke, S. P. (2011). Replication in prevention science. Prevention Science: The Official Journal of the Society for Prevention Research, 12, 103–117. https://doi.org/10.1007/s11121-011-0217-6
Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology. General, 143, 1457–1475. https://doi.org/10.1037/a0036731
Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the savage–dickey method. Cognitive Psychology, 60, 158–189. https://doi.org/10.1016/j.cogpsych.2009.12.001

  1. The name of the method corresponds to the first column of the online table↩︎

  2. The approach has been debated by a series of opinion papers (see Hedges & Schauer, 2019a; Mathur & VanderWeele, 2019)↩︎

  3. see also Ly et al. (2019) for an improvement↩︎