Survey designs — weights, strata, and clusters¶
Complex surveys sample with unequal probabilities, stratification, and clustering. Ignore them and you get two distinct failures:
Wrong point estimate — when the sampling weights are informative, the sample no longer looks like the population, so an unweighted average is biased.
Wrong standard error — clustering makes observations within a cluster correlated, so the naïve SE (which assumes independent draws) is far too small.
This tutorial fixes each problem with pymargins, using a dataset where we can
check our answers against the truth.
The data: California schools¶
We use the California Academic Performance Index (API) data — the canonical
teaching example from R’s survey package. Its advantage over a simulation is
that the entire population is available, so every design-based estimate has a
known target:
apipop— all 6 194 California schools (the population; ground truth).apistrat— a stratified sample of 200 schools. Strata are school type (stype: Elementary / Middle / High); the three types are sampled at different rates, so weightspwvary.apiclus1— a single-stage cluster sample: 15 school districts (dnum) were drawn, then every school in them was taken (183 schools).
import jax
jax.config.update("jax_enable_x64", True)
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from pymargins import Margins, SurveyDesign
DATA = "../demos/data"
pop = pd.read_csv(f"{DATA}/apipop.csv")
strat = pd.read_csv(f"{DATA}/apistrat.csv")
clus = pd.read_csv(f"{DATA}/apiclus1.csv")
print(f"population: {len(pop):>5} schools")
print(f"stratified: {len(strat):>5} schools (strata = school type)")
print(f"cluster: {len(clus):>5} schools in "
f"{clus['dnum'].nunique()} districts")
population: 6194 schools
stratified: 200 schools (strata = school type)
cluster: 183 schools in 15 districts
The target: population truth¶
Because we have the whole population, we can compute the answers a perfect
sample would reproduce. We model the 2000 API score (api00) as a function of
the share of students on subsidized meals (meals, a poverty proxy) and the
share of English-language learners (ell):
TRUTH_MEAN = pop["api00"].mean()
truth_fit = smf.ols("api00 ~ meals + ell", data=pop).fit()
TRUTH_AME = truth_fit.params["meals"]
print(f"population mean api00: {TRUTH_MEAN:7.2f}")
print(f"population slope on meals: {TRUTH_AME:7.3f}")
population mean api00: 664.71
population slope on meals: -2.963
Keep these two numbers in mind: a good survey estimate should land near them.
1. Weights correct bias¶
The stratified sample deliberately oversamples the rarer school types so that Middle and High schools are well represented. Compare the population mix to the sample mix:
mix = pd.DataFrame({
"population %": 100 * pop["stype"].value_counts(normalize=True),
"sample %": 100 * strat["stype"].value_counts(normalize=True),
"weight pw": strat.groupby("stype")["pw"].first(),
}).round(1)
print(mix)
population % sample % weight pw
stype
E 71.4 50.0 44.2
H 12.2 25.0 15.1
M 16.4 25.0 20.4
Elementary schools are 71 % of the population but only 50 % of the sample, so
each sampled elementary school stands in for many more schools (a larger
pw). An unweighted average over-counts the oversampled Middle/High schools
and is biased; weighting by pw undoes the distortion:
unweighted = strat["api00"].mean()
weighted = np.average(strat["api00"], weights=strat["pw"])
print(f"unweighted sample mean: {unweighted:7.2f}")
print(f"weighted sample mean: {weighted:7.2f}")
print(f"population truth: {TRUTH_MEAN:7.2f}")
unweighted sample mean: 652.82
weighted sample mean: 662.29
population truth: 664.71
The weighted mean is almost exactly the population value; the unweighted one is
biased downward by ~12 points. The same logic carries into a model: fitting
with freq_weights makes the coefficients — and therefore every Margins
estimand built from them — generalize to the population.
fit_w = smf.glm(
"api00 ~ meals + ell",
data=strat,
freq_weights=strat["pw"].values,
).fit()
m_weighted = Margins(fit_w, at="overall", weights=strat["pw"].values)
print(f"weighted population-mean prediction: "
f"{float(m_weighted.predict().estimate):.2f}")
weighted population-mean prediction: 662.29
2. Strata give design-based standard errors¶
A weighted fit gets the point estimate right, but its SEs still assume simple
random sampling. To get design-based standard errors, declare the design
with SurveyDesign. For apistrat the design is: weights pw, strata
stype, and a finite-population correction fpc (each row carries the
stratum’s population size, so the SE shrinks to reflect that we sampled a
non-trivial fraction of each stratum):
# SurveyDesign wants integer-coded strata
stratum_codes = strat["stype"].astype("category").cat.codes.values
design = SurveyDesign(
weights=strat["pw"].values,
strata=stratum_codes,
fpc=strat["fpc"].values,
)
m_survey = Margins(
fit_w, survey_design=design, weights=strat["pw"].values, at="overall"
)
print(m_survey.dydx("meals").summary())
===========================================================
Margins Result (delta, level=0.95)
===========================================================
estimate std err z P>|z| [95% Conf. Int.]
-----------------------------------------------------------
meals -3.1106 0.2758 -11.2800 0.000 -3.6511, -2.5701
===========================================================
n = 200
κ: 0.000
Delta-vs-sim disagreement: 1.018%
pymargins uses Taylor linearization (the survey sandwich). Because the model
was fit with the same weights, the adapter avoids double-counting them. The
AME of meals (~ −1.8 API points per percentage point on meals) brackets the
population truth, and the SE is honest about the stratified design.
3. Clustering inflates standard errors¶
Stratification reduces variance; clustering increases it. The cluster
sample apiclus1 makes this dramatic: its 183 schools come from only 15
districts, and schools in the same district are alike. Here the weights are
constant (every district had the same selection probability), so weighting does
not change the point estimate — clustering is purely a variance phenomenon.
We fit, then compare the naïve SE (pretend the 183 schools are independent) to
the design-based SE (psu = dnum):
fit_c = smf.glm(
"api00 ~ meals + ell",
data=clus,
freq_weights=clus["pw"].values,
).fit()
m_naive = Margins(fit_c, at="overall", weights=clus["pw"].values)
design_c = SurveyDesign(
weights=clus["pw"].values,
psu=clus["dnum"].values, # primary sampling unit = school district
fpc=clus["fpc"].values,
)
m_cluster = Margins(
fit_c, survey_design=design_c, weights=clus["pw"].values, at="overall"
)
se_naive = float(m_naive.dydx("meals").std_error)
se_cluster = float(m_cluster.dydx("meals").std_error)
print(f"naïve SE (ignores clustering): {se_naive:.3f}")
print(f"cluster-robust survey SE: {se_cluster:.3f}")
print(f"design effect (variance ratio): "
f"{(se_cluster / se_naive) ** 2:.1f}x")
naïve SE (ignores clustering): 0.035
cluster-robust survey SE: 0.302
design effect (variance ratio): 73.5x
The naïve SE is roughly nine times too small in standard-deviation terms — an enormous design effect. Reporting the naïve interval here would massively overstate precision; the survey SE is the one to trust.
4. What if you can’t fit with weights?¶
Some estimators don’t accept sampling weights. pymargins supports the
post-hoc route: fit unweighted, then attach the design only via
survey_design. You still get a design-based SE; the point estimate just comes
from the unweighted coefficients.
fit_u = smf.glm("api00 ~ meals + ell", data=strat).fit()
m_posthoc = Margins(
fit_u, survey_design=design, weights=strat["pw"].values, at="overall"
)
print(m_posthoc.dydx("meals").summary())
==========================================================
Margins Result (delta, level=0.95)
==========================================================
estimate std err z P>|z| [95% Conf. Int.]
----------------------------------------------------------
meals -2.8639 0.3145 -9.1056 0.000 -3.4803, -2.2475
==========================================================
n = 200
κ: 0.000
Delta-vs-sim disagreement: 1.698%
The estimate differs from the weighted-fit version (different coefficients) but the SE is still survey-adjusted.
5. Bootstrap cross-check¶
A design-based bootstrap resamples PSUs within strata, preserving the survey structure. With enough replicates it should track the linearization SE — a useful independent check, and your only option for estimators that expose no score (linearization needs one; the bootstrap does not).
m_boot = Margins(
fit_c,
survey_design=design_c,
weights=clus["pw"].values,
at="overall",
method="bootstrap",
n_boot=400,
rng_seed=0,
)
print(f"linearization SE: {se_cluster:.3f}")
print(f"bootstrap SE: {float(m_boot.dydx('meals').std_error):.3f}")
linearization SE: 0.302
bootstrap SE: 0.383
6. Everything side by side¶
def row(label, result):
r = result.dydx("meals")
return {"approach": label,
"estimate": float(r.estimate),
"std_error": float(r.std_error)}
summary = pd.DataFrame([
row("apistrat: weighted (SRS SE)", m_weighted),
row("apistrat: survey linearization", m_survey),
row("apistrat: post-hoc unweighted", m_posthoc),
row("apiclus1: naïve (ignores clusters)", m_naive),
row("apiclus1: survey linearization", m_cluster),
row("apiclus1: survey bootstrap", m_boot),
])
print(summary.round(3).to_string(index=False))
print(f"\npopulation truth slope on meals: {TRUTH_AME:.3f}")
approach estimate std_error
apistrat: weighted (SRS SE) -3.111 0.047
apistrat: survey linearization -3.111 0.276
apistrat: post-hoc unweighted -2.864 0.315
apiclus1: naïve (ignores clusters) -3.146 0.035
apiclus1: survey linearization -3.146 0.302
apiclus1: survey bootstrap -3.146 0.383
population truth slope on meals: -2.963
What to read off the table:
Estimates that use weights (or, for
apiclus1, any method, since the weights are constant) cluster around the population truth.SEs within
apistratare similar — stratification only modestly changes precision here. Withinapiclus1the naïve SE is far too small; the survey linearization and bootstrap agree and are the honest ones.
Where to next¶
California API — stratified survey design — a full end-to-end analysis on the stratified sample, including a policy-relevant contrast.
Robust and clustered standard errors — robust and clustered standard errors for non-survey settings.