Simulating Simpson's Paradox
By Bernhard Bieri in econometrics aidememoire statistics simulation
September 26, 2022
This is the first post in a series called Bernie’s statistics aidemémoire where I strive to understand various econometric concepts with simulations and other techniques. This first entry is dedicated to Simpson’s paradox, a well known phenomenon occuring when considering data at multiple levels of observation.
Simpson’s Paradox
Simpson’s Case Study
Simpson’s paradox has been defined in multiple ways over the past few decades and the Stanford Encyclopedia provides an excellent theoretical overview of the concept. The aim of this less extensive post is to provide a short introduction of Simpson’s paradox and illustrate it by running Monte Carlo simulations to visually explore its impact on causal inference.
In short, Simpson’s paradox manifests itself when one considers data at
different levels of observation and observes the reversal or the disappearance
of an association between two variables (
Sprenger and Weinberger 2021). Imagine
that you are running a field experiment in Statsland and run a randomized
control trial to assess the causal effect of a medical treatment on a given
health condition. For simplicity, assume that health status is binary e.g.
\(Y\_i \in \{\text{Healthy}; \text{Sick}\}\)
. Assume that you observe a large enough
sample so that the observed proportions correspond to the true proportions of
the population. Finally, imagine that there are two regions in Statsland: North
and South. Given this setup, the following situation may arise.
Region Arm  North Control  North Treated  South Control  South Treated  Total 

Healthy  4  8  2  12  26 
Sick  3  5  3  15  26 
Total  7  13  5  27  52 
The numbers in the table above are inspired by ( Sprenger and Weinberger 2021) who use the example of ( Simpson 1951) to illustrate the intuition behind Simpson’s Paradox.
One can identify two seemingly contradicting insights in the numbers above. When considering the population as a whole, the proportion of cured individuals is equal in the treatment and the control arm e.g.
$$P(HealthyTreated) = \frac{8+12}{40} = P(HealthyControl) = \frac{2+4}{20}$$
However, one sees that when conditioning on the region in addition to treatment status,
we notice that
$$P(HealthyTreated, North) = \frac{8}{13} > P(HealthyControl, North) = \frac{4}{7}$$
and
$$P(HealthyTreated, South) = \frac{12}{27} > P(HealthyControl, South) = \frac{2}{5}$$
In other words, the treatment seems to be ineffective when considering the population as a whole. On the contrary, when considering the conditional probabilities within each level of observation, the treatment seems to have a positive effect on health status^{1}.
Definition
Theoretical Intuition
Where does this seemingly counterintuitive pattern emerge from? As ( Sprenger and Weinberger 2021) point out, one can explain the situation through two observations in the data above. First, people in the northern region are more likely to recover from treatment overall. Second, individuals in the northern region are less likely to be in the treatment group. These two facts explain the disappearance of the correlation at the population level. Formally, we can show it through the following probability decomposition:
$$
\begin{aligned} P(\mathrm{~H} \mid \mathrm{T}) &=P(\mathrm{~H} \mid \mathrm{T}, \mathrm{N}) P(\mathrm{N} \mid \mathrm{T})+P(\mathrm{~H} \mid \mathrm{T}, \neg \mathrm{N}) P(\neg \mathrm{N} \mid \mathrm{T}) \\ P(\mathrm{~H} \mid \neg \mathrm{T}) &=P(\mathrm{~H} \mid \neg \mathrm{T}, \mathrm{N}) P(\mathrm{N} \mid \neg \mathrm{T})+P(\mathrm{~H} \mid \neg \mathrm{T}, \neg \mathrm{N}) P(\neg \mathrm{N} \mid \neg \mathrm{T}) \end{aligned}
$$
The key terms in this decomposition are \(P(\mathrm{N} \mid \mathrm{T})\)
and
\(P(\mathrm{\neg N} \mid \mathrm{T})\)
that weight the probability within the
region to form the population level probability^{2} in a way that leads to the positive association within
regions to vanish in the aggregate population.
Different Forms of the Paradox
Simpson’s paradox has been defined in the following three main ways ( Sprenger and Weinberger 2021):
 Association Reversal
 In this first version, the association wihtin each subpopulation/region is either positive, negative, or null while the association at the population level has a different sign.
 See ( Samuels 1993).
 Yule’s Association Paradox
 In this second version, the association in the whole population is different from zero e.g. there is a positive or negative effect of the treatment when considering Statsland as a whole. But the effect dissappears when considering individual regions.
 See ( Mittal 1991).
 Amalgamation Paradox
 In this third and more general version, the association between the two variables of interest in the population as a whole is stronger than the maximum association between the two variables within each region.
 See ( Good and Mittal 1987).
For a more formal characterisation of the different variants of Simpson’s Paradox see the excellent entry by ( Sprenger and Weinberger 2021).
Simulations with NonCategorical Data
Now that we gained an intuitive understanding of the paradox, let us run some
simulations of an Association Reversal. Unlike in the first section of this
post where we considered a binary treatment variable, we will illustrate
it with a continuous independent variable \(X\)
. We start by writing the data
generation process where we define three villages in which we record 1000
observations each. We then define three village level effects called beta
as well as various moments of the independent variable \(X\)
and the error term
\(U\)
. We then generate one example run of the synthetic dataset, plot
it and examine a table of results of the various specifications. We define the
naive full specification as:
$$ y_i = x_i \beta_i + u_i $$ the village level specification for village 1 as:
$$ y_i = x_i \beta_i + u_i,\ \text{where}\ i\ \text{lives in village 1} $$ and the full nonnaive regression with a village level intercept as:
$$ y_i = \alpha_v + x_i \beta_i + u_i $$
# Function generating data affected by Simpson's paradox with an
# Association Reversal
gensimpsondata < function(N, # Within village beta.
beta = c(1,1,1), # Within village beta.
mu = c(5, 10, 15), # Mean of X within village
sigmax = rep(2, 3), # SD of X within village
sigmau = rep(6, 3), # SD of error term
groupFE = 15, # Fixed effect of Group
name_groups = "Village") {
groupNo < seq(length(mu))
groups < paste0(name_groups, " ", groupNo)
datasetlist < list()
for (i in seq(1, length(groupNo))) {
datasetlist[[i]] < dplyr::tibble(
GroupNo = groupNo[[i]],
Group = rep(groups[[i]], N),
GroupFE = groupNo[[i]] * groupFE,
X = rnorm(N, mu[[i]], sigmax[[i]]) + groupNo[[i]],
# X depends on village level effects
U = rnorm(N, 0, sigmau[[i]])
) >
dplyr::mutate(Y = beta[[i]] * X + GroupFE + U)
# Y depends on X and Village level effects.
}
out < dplyr::bind_rows(datasetlist)
}
# Generate one iteration of the data called "example".
example < gensimpsondata(N = 1000)
In the first graph below, we consider all three villages and seemingly identify
a positive association between \(X\)
and \(Y\)
.
# Plot the example data.
# Plot with full sample.
plotexamplefull < ggplot2::ggplot(example) +
ggplot2::geom_point(ggplot2::aes(X, Y), color = "white", alpha = 0.2) +
ggplot2::geom_smooth(
ggplot2::aes(X, Y),
method = "lm",
color = "white",
formula = y ~ x
) +
ggplot2::labs(title = "Illustrating Simpson's Paradox",
subtitle = "Simple regression of Y on X for the full sample.",
caption = "Simulated data.") +
HohgantR::themebernie()
plotexamplefull
However, when running the regressions within each individual village, we find that the effect seems to be reversed! This illustrates the Association Reversal, described in the previous section ( Samuels 1993).
# Plot with full sample grouped by village.
plotexample < ggplot2::ggplot(example) +
ggplot2::geom_point(ggplot2::aes(X, Y, colour = Group), alpha = 0.2) +
ggplot2::geom_smooth(
ggplot2::aes(X, Y),
method = "lm",
color = "white",
formula = y ~ x
) +
ggplot2::geom_smooth(ggplot2::aes(x = X, y = Y, color = Group),
method = "lm",
formula = y ~ x) +
ggplot2::labs(title = "Illustrating Simpson's Paradox",
subtitle = "Simple regression of Y on X for the three villages individually.",
caption = "Simulated data.") +
HohgantR::themebernie()
# Display plot
plotexample
Let us quickly complement this visual intuition within a table with actual
regression coefficients. Since we generated the data above, we already know
the true village level association between our variables of interest and
know the the village effect. We thus specify the naive regression of
\(Y\)
on \(X\)
before looking at the effect within each village and finally
specifying the “correct” regression with village level intercepts.
# Display the OLS regressions in a simple table to show the different results
# when considering the full sample and considering the villagebyvillage
lmall < lm(data = example, Y ~ X)
lmvillage1 < dplyr::filter(example, Group == "Village 1") >
lm(formula = Y ~ X)
lmvillage2 < dplyr::filter(example, Group == "Village 2") >
lm(formula = Y ~ X)
lmvillage3 < dplyr::filter(example, Group == "Village 3") >
lm(formula = Y ~ X)
# The true regression with village level intercepts
lmcorrect < lm(data = example, Y ~ X + Group)
stargazer::stargazer(
lmall,
lmvillage1,
lmvillage2,
lmvillage3,
lmcorrect,
type = "html",
style = "aer",
title = "Simple Regressions with Full Sample and Village level data",
column.labels = c("Full Sample Naive",
"Village 1",
"Village 2",
"Village 3",
"Full Sample Correct")
)
Y  
Full Sample Naive  Village 1  Village 2  Village 3  Full Sample Correct  
\(1\)  \(2\)  \(3\)  \(4\)  \(5\)  
X  1.173^{\*\*\*}  0.729^{\*\*\*}  1.102^{\*\*\*}  1.112^{\*\*\*}  0.983^{\*\*\*} 
(0.026)  (0.094)  (0.098)  (0.091)  (0.054)  
GroupVillage 2  14.997^{\*\*\*}  
(0.424)  
GroupVillage 3  30.346^{\*\*\*}  
(0.709)  
Constant  3.942^{\*\*\*}  13.239^{\*\*\*}  31.189^{\*\*\*}  47.439^{\*\*\*}  14.762^{\*\*\*} 
(0.344)  (0.592)  (1.192)  (1.653)  (0.378)  
Observations  3,000  1,000  1,000  1,000  3,000 
R^{2}  0.401  0.057  0.113  0.130  0.629 
Adjusted R^{2}  0.401  0.056  0.112  0.129  0.628 
Residual Std. Error  7.622 (df = 2998)  5.915 (df = 998)  6.234 (df = 998)  5.832 (df = 998)  6.005 (df = 2996) 
F Statistic  2,009.359^{\*\*\*} (df = 1; 2998)  60.746^{\*\*\*} (df = 1; 998)  127.264^{\*\*\*} (df = 1; 998)  149.318^{\*\*\*} (df = 1; 998)  1,690.086^{\*\*\*} (df = 3; 2996) 
Notes:  ^{\*\*\*}Significant at the 1 percent level.  
^{\*\*}Significant at the 5 percent level.  
^{\*}Significant at the 10 percent level. 
Let us now convince ourselves that we did not get these results by chance as
this was only one draw of our data generating function. To do that, we
plot a distribution of the coefficient of interest \(\beta\)
in the three regressions
specified at the beginning of this section we computed on 200 draws of the
data generating process specified above. We can gain three main insights from
the graph below. First, we see that the coefficient estimate under the naive
specification is biased and on average has a different sign than the village
level estimates. Second, we see that the our “correct” specification and
village level specifications are unbiased. Finally, we see that the “correct”
specification estimates the coefficient in a more precise way as the curve
is narrower than village level estimates due to a higher sample size.
mcsimpson < function(reps = 10) {
datalist < list()
coefvectlst < list()
for (i in seq(reps)) {
datalist[[i]] < gensimpsondata(N = 1000)
coefvectlst[[i]] < vector()
lmall < lm(data = datalist[[i]], Y ~ X)
coefvectlst[[i]][[1]] < lmall[["coefficients"]][["X"]]
lmvillage1 <
dplyr::filter(datalist[[i]], Group == "Village 1") >
lm(formula = Y ~ X)
coefvectlst[[i]][[2]] < lmvillage1[["coefficients"]][["X"]]
lmvillage2 <
dplyr::filter(datalist[[i]], Group == "Village 2") >
lm(formula = Y ~ X)
coefvectlst[[i]][[3]] < lmvillage2[["coefficients"]][["X"]]
lmvillage3 <
dplyr::filter(datalist[[i]], Group == "Village 3") >
lm(formula = Y ~ X)
coefvectlst[[i]][[4]] < lmvillage3[["coefficients"]][["X"]]
lmcorrect < lm(data = datalist[[i]], Y ~ X + Group)
coefvectlst[[i]][[5]] < lmcorrect[["coefficients"]][["X"]]
}
coefvectdf < as.data.frame(do.call(rbind, coefvectlst))
coefvectdf
}
coefsMC < mcsimpson(reps = 200) >
dplyr::mutate(index = seq(200)) >
tidyr::gather(index, coefficient, dplyr::starts_with("V")) >
dplyr::mutate(
index = replace(index, index == "V1", "Full Sample: Naive"),
index = replace(index, index == "V2", "Village 1"),
index = replace(index, index == "V3", "Village 2"),
index = replace(index, index == "V4", "Village 3"),
index = replace(index, index == "V5", "Full Sample: Correct")
) >
dplyr::rename(Sample = index,
Beta = coefficient)
# Density plot of different coefficients
freqmcsimpson < ggplot2::ggplot(coefsMC) +
ggplot2::geom_density(ggplot2::aes(
Beta,
group = Sample,
color = Sample,
fill = Sample
),
alpha = 0.2) +
ggplot2::labs(title = "Density Plot of Coefficient Estimates Under\nDifferent Specifications",
subtitle = "Coefficients estimated by OLS",
caption = "Monte Carlo Simulation: 200 iterations of synthetic DGP.") +
HohgantR::themebernie()
freqmcsimpson
Simulations such as the one we just performed are a neat way of illustrating Simpson’s paradox. We have, however, not yet fully determined a method that would formally tell us at which level associations need to be considered in an observational study with actual empirical data. One way of seeing this is through what is known as Pearl’s Simpson Machine which will be the topic of the next post ( Armistead 2014; Pearl 1993)!
PS: Please reach out if you spot any errors or have any comments and suggestions via the contact form.
## Warning in utils::citation(..., lib.loc = lib.loc): no date field in DESCRIPTION
## file of package 'ggplot2'
Code Appendix
# Libraries
library(tidyverse)
library(MASS)
library(tidyverse)
library(stargazer)
library(dagitty)
library(ggdag)
# Load nonstandard fonts with {extrafont}
extrafont::loadfonts("win")
extrafont::fonts()
# Function generating data affected by Simpson's paradox with an
# Association Reversal
gensimpsondata < function(N, # Within village beta.
beta = c(1,1,1), # Within village beta.
mu = c(5, 10, 15), # Mean of X within village
sigmax = rep(2, 3), # SD of X within village
sigmau = rep(6, 3), # SD of error term
groupFE = 15, # Fixed effect of Group
name_groups = "Village") {
groupNo < seq(length(mu))
groups < paste0(name_groups, " ", groupNo)
datasetlist < list()
for (i in seq(1, length(groupNo))) {
datasetlist[[i]] < dplyr::tibble(
GroupNo = groupNo[[i]],
Group = rep(groups[[i]], N),
GroupFE = groupNo[[i]] * groupFE,
X = rnorm(N, mu[[i]], sigmax[[i]]) + groupNo[[i]],
# X depends on village level effects
U = rnorm(N, 0, sigmau[[i]])
) >
dplyr::mutate(Y = beta[[i]] * X + GroupFE + U)
# Y depends on X and Village level effects.
}
out < dplyr::bind_rows(datasetlist)
}
# Generate one iteration of the data called "example".
example < gensimpsondata(N = 1000)
# Plot the example data.
# Plot with full sample.
plotexamplefull < ggplot2::ggplot(example) +
ggplot2::geom_point(ggplot2::aes(X, Y), color = "white", alpha = 0.2) +
ggplot2::geom_smooth(
ggplot2::aes(X, Y),
method = "lm",
color = "white",
formula = y ~ x
) +
ggplot2::labs(title = "Illustrating Simpson's Paradox",
subtitle = "Simple regression of Y on X for the full sample.",
caption = "Simulated data.") +
HohgantR::themebernie()
plotexamplefull
# Plot with full sample grouped by village.
plotexample < ggplot2::ggplot(example) +
ggplot2::geom_point(ggplot2::aes(X, Y, colour = Group), alpha = 0.2) +
ggplot2::geom_smooth(
ggplot2::aes(X, Y),
method = "lm",
color = "white",
formula = y ~ x
) +
ggplot2::geom_smooth(ggplot2::aes(x = X, y = Y, color = Group),
method = "lm",
formula = y ~ x) +
ggplot2::labs(title = "Illustrating Simpson's Paradox",
subtitle = "Simple regression of Y on X for the three villages individually.",
caption = "Simulated data.") +
HohgantR::themebernie()
# Display plot
plotexample
# Generate featured gif with the example data. Does not appear in blogpost.
plot1 < ggplot2::ggplot(example) +
ggplot2::geom_point(ggplot2::aes(X, Y), alpha = 0.2) +
ggplot2::geom_smooth(
ggplot2::aes(X, Y),
method = "lm",
color = "white",
formula = y ~ x
) +
ggplot2::theme_void() +
ggplot2::theme(
panel.background = ggplot2::element_rect(fill = "#222831", color = NA),
plot.background = ggplot2::element_rect(fill = "#222831", color = NA),
legend.position = "none"
)
plot2 < ggplot2::ggplot(example) +
ggplot2::geom_point(ggplot2::aes(X, Y, colour = Group), alpha = 0.2) +
ggplot2::geom_smooth(ggplot2::aes(x = X, y = Y, color = Group),
method = "lm",
formula = y ~ x) +
ggplot2::theme_void() +
ggplot2::theme(
panel.background = ggplot2::element_rect(fill = "#222831", color = NA),
plot.background = ggplot2::element_rect(fill = "#222831", color = NA),
legend.position = "none"
)
ggplot2::ggsave(plot1,
filename = "~/GitHub/Website/content/blog/20220930simpsonsparadox/plot1.png",
dpi = 192)
ggplot2::ggsave(plot2,
filename = "~/GitHub/Website/content/blog/20220930simpsonsparadox/plot2.png",
dpi = 192)
# Display the OLS regressions in a simple table to show the different results
# when considering the full sample and considering the villagebyvillage
lmall < lm(data = example, Y ~ X)
lmvillage1 < dplyr::filter(example, Group == "Village 1") >
lm(formula = Y ~ X)
lmvillage2 < dplyr::filter(example, Group == "Village 2") >
lm(formula = Y ~ X)
lmvillage3 < dplyr::filter(example, Group == "Village 3") >
lm(formula = Y ~ X)
# The true regression with village level intercepts
lmcorrect < lm(data = example, Y ~ X + Group)
stargazer::stargazer(
lmall,
lmvillage1,
lmvillage2,
lmvillage3,
lmcorrect,
type = "html",
style = "aer",
title = "Simple Regressions with Full Sample and Village level data",
column.labels = c("Full Sample Naive",
"Village 1",
"Village 2",
"Village 3",
"Full Sample Correct")
)
mcsimpson < function(reps = 10) {
datalist < list()
coefvectlst < list()
for (i in seq(reps)) {
datalist[[i]] < gensimpsondata(N = 1000)
coefvectlst[[i]] < vector()
lmall < lm(data = datalist[[i]], Y ~ X)
coefvectlst[[i]][[1]] < lmall[["coefficients"]][["X"]]
lmvillage1 <
dplyr::filter(datalist[[i]], Group == "Village 1") >
lm(formula = Y ~ X)
coefvectlst[[i]][[2]] < lmvillage1[["coefficients"]][["X"]]
lmvillage2 <
dplyr::filter(datalist[[i]], Group == "Village 2") >
lm(formula = Y ~ X)
coefvectlst[[i]][[3]] < lmvillage2[["coefficients"]][["X"]]
lmvillage3 <
dplyr::filter(datalist[[i]], Group == "Village 3") >
lm(formula = Y ~ X)
coefvectlst[[i]][[4]] < lmvillage3[["coefficients"]][["X"]]
lmcorrect < lm(data = datalist[[i]], Y ~ X + Group)
coefvectlst[[i]][[5]] < lmcorrect[["coefficients"]][["X"]]
}
coefvectdf < as.data.frame(do.call(rbind, coefvectlst))
coefvectdf
}
coefsMC < mcsimpson(reps = 200) >
dplyr::mutate(index = seq(200)) >
tidyr::gather(index, coefficient, dplyr::starts_with("V")) >
dplyr::mutate(
index = replace(index, index == "V1", "Full Sample: Naive"),
index = replace(index, index == "V2", "Village 1"),
index = replace(index, index == "V3", "Village 2"),
index = replace(index, index == "V4", "Village 3"),
index = replace(index, index == "V5", "Full Sample: Correct")
) >
dplyr::rename(Sample = index,
Beta = coefficient)
# Density plot of different coefficients
freqmcsimpson < ggplot2::ggplot(coefsMC) +
ggplot2::geom_density(ggplot2::aes(
Beta,
group = Sample,
color = Sample,
fill = Sample
),
alpha = 0.2) +
ggplot2::labs(title = "Density Plot of Coefficient Estimates Under\nDifferent Specifications",
subtitle = "Coefficients estimated by OLS",
caption = "Monte Carlo Simulation: 200 iterations of synthetic DGP.") +
HohgantR::themebernie()
freqmcsimpson
################################
## Citing all loaded packages ##
################################
knitr::write_bib(c(.packages(), "blogdown"), paste0(here::here(), "/content/blog/20220930simpsonsparadox/packages.bib"))
References
Armistead, Timothy W. 2014. “Resurrecting the Third Variable: A Critique of Pearl’s Causal Analysis of Simpson’s Paradox.” The American Statistician 68 (1): 1–7. https://doi.org/10.1080/00031305.2013.807750.
Good, I. J., and Y. Mittal. 1987. “The Amalgamation and Geometry of TwobyTwo Contingency Tables.” The Annals of Statistics 15 (2): 694–711.
Mittal, Yashaswini. 1991. “Homogeneity of Subpopulations and Simpson’s Paradox.” Journal of the American Statistical Association 86 (413): 167–72. https://doi.org/10.1080/01621459.1991.10475016.
Pearl, Judea. 1993. “[Bayesian Analysis in Expert Systems]: Comment: Graphical Models, Causality and Intervention.” Statistical Science 8 (3): 266–69.
Samuels, Myra L. 1993. “Simpson’s Paradox and Related Phenomena.” Journal of the American Statistical Association 88 (421): 81–88. https://doi.org/10.1080/01621459.1993.10594297.
Simpson, E. H. 1951. “The Interpretation of Interaction in Contingency Tables.” Journal of the Royal Statistical Society: Series B (Methodological) 13 (2): 238–41. https://doi.org/10.1111/j.25176161.1951.tb00088.x.
Sprenger, Jan, and Naftali Weinberger. 2021. “Simpson’s Paradox.” In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta, Summer 2021. Metaphysics Research Lab, Stanford University.

In this example, we refer to different levels of observations in a geographic sense. Partitioning the population could also be done across any observable characteristic of individuals such as gender as shown in the examples used by Simpson ( 1951) and Sprenger and Weinberger ( 2021) ↩︎

See Stanford’s Plato entry for a formal definition. ↩︎
 Posted on:
 September 26, 2022
 Length:
 15 minute read, 3101 words
 Categories:
 econometrics aidememoire statistics simulation
 See Also: