Survey designs — weights, strata, and clusters

Complex surveys sample with unequal probabilities, stratification, and clustering. Ignore them and you get two distinct failures:

  • Wrong point estimate — when the sampling weights are informative, the sample no longer looks like the population, so an unweighted average is biased.

  • Wrong standard error — clustering makes observations within a cluster correlated, so the naïve SE (which assumes independent draws) is far too small.

This tutorial fixes each problem with pymargins, using a dataset where we can check our answers against the truth.

The data: California schools

We use the California Academic Performance Index (API) data — the canonical teaching example from R’s survey package. Its advantage over a simulation is that the entire population is available, so every design-based estimate has a known target:

  • apipop — all 6 194 California schools (the population; ground truth).

  • apistrat — a stratified sample of 200 schools. Strata are school type (stype: Elementary / Middle / High); the three types are sampled at different rates, so weights pw vary.

  • apiclus1 — a single-stage cluster sample: 15 school districts (dnum) were drawn, then every school in them was taken (183 schools).

import jax
jax.config.update("jax_enable_x64", True)

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from pymargins import Margins, SurveyDesign

DATA = "../demos/data"
pop = pd.read_csv(f"{DATA}/apipop.csv")
strat = pd.read_csv(f"{DATA}/apistrat.csv")
clus = pd.read_csv(f"{DATA}/apiclus1.csv")

print(f"population: {len(pop):>5} schools")
print(f"stratified: {len(strat):>5} schools (strata = school type)")
print(f"cluster:    {len(clus):>5} schools in "
      f"{clus['dnum'].nunique()} districts")
population:  6194 schools
stratified:   200 schools (strata = school type)
cluster:      183 schools in 15 districts

The target: population truth

Because we have the whole population, we can compute the answers a perfect sample would reproduce. We model the 2000 API score (api00) as a function of the share of students on subsidized meals (meals, a poverty proxy) and the share of English-language learners (ell):

TRUTH_MEAN = pop["api00"].mean()
truth_fit = smf.ols("api00 ~ meals + ell", data=pop).fit()
TRUTH_AME = truth_fit.params["meals"]

print(f"population mean api00:      {TRUTH_MEAN:7.2f}")
print(f"population slope on meals:  {TRUTH_AME:7.3f}")
population mean api00:       664.71
population slope on meals:   -2.963

Keep these two numbers in mind: a good survey estimate should land near them.

1. Weights correct bias

The stratified sample deliberately oversamples the rarer school types so that Middle and High schools are well represented. Compare the population mix to the sample mix:

mix = pd.DataFrame({
    "population %": 100 * pop["stype"].value_counts(normalize=True),
    "sample %": 100 * strat["stype"].value_counts(normalize=True),
    "weight pw": strat.groupby("stype")["pw"].first(),
}).round(1)
print(mix)
       population %  sample %  weight pw
stype                                   
E              71.4      50.0       44.2
H              12.2      25.0       15.1
M              16.4      25.0       20.4

Elementary schools are 71 % of the population but only 50 % of the sample, so each sampled elementary school stands in for many more schools (a larger pw). An unweighted average over-counts the oversampled Middle/High schools and is biased; weighting by pw undoes the distortion:

unweighted = strat["api00"].mean()
weighted = np.average(strat["api00"], weights=strat["pw"])

print(f"unweighted sample mean: {unweighted:7.2f}")
print(f"weighted sample mean:   {weighted:7.2f}")
print(f"population truth:        {TRUTH_MEAN:7.2f}")
unweighted sample mean:  652.82
weighted sample mean:    662.29
population truth:         664.71

The weighted mean is almost exactly the population value; the unweighted one is biased downward by ~12 points. The same logic carries into a model: fitting with freq_weights makes the coefficients — and therefore every Margins estimand built from them — generalize to the population.

fit_w = smf.glm(
    "api00 ~ meals + ell",
    data=strat,
    freq_weights=strat["pw"].values,
).fit()

m_weighted = Margins(fit_w, at="overall", weights=strat["pw"].values)
print(f"weighted population-mean prediction: "
      f"{float(m_weighted.predict().estimate):.2f}")
weighted population-mean prediction: 662.29

2. Strata give design-based standard errors

A weighted fit gets the point estimate right, but its SEs still assume simple random sampling. To get design-based standard errors, declare the design with SurveyDesign. For apistrat the design is: weights pw, strata stype, and a finite-population correction fpc (each row carries the stratum’s population size, so the SE shrinks to reflect that we sampled a non-trivial fraction of each stratum):

# SurveyDesign wants integer-coded strata
stratum_codes = strat["stype"].astype("category").cat.codes.values

design = SurveyDesign(
    weights=strat["pw"].values,
    strata=stratum_codes,
    fpc=strat["fpc"].values,
)

m_survey = Margins(
    fit_w, survey_design=design, weights=strat["pw"].values, at="overall"
)
print(m_survey.dydx("meals").summary())
===========================================================
             Margins Result (delta, level=0.95)            
===========================================================
       estimate  std err         z  P>|z|  [95% Conf. Int.]
-----------------------------------------------------------
meals   -3.1106   0.2758  -11.2800  0.000  -3.6511, -2.5701
===========================================================

n = 200
κ: 0.000
Delta-vs-sim disagreement: 1.018%

pymargins uses Taylor linearization (the survey sandwich). Because the model was fit with the same weights, the adapter avoids double-counting them. The AME of meals (~ −1.8 API points per percentage point on meals) brackets the population truth, and the SE is honest about the stratified design.

3. Clustering inflates standard errors

Stratification reduces variance; clustering increases it. The cluster sample apiclus1 makes this dramatic: its 183 schools come from only 15 districts, and schools in the same district are alike. Here the weights are constant (every district had the same selection probability), so weighting does not change the point estimate — clustering is purely a variance phenomenon.

We fit, then compare the naïve SE (pretend the 183 schools are independent) to the design-based SE (psu = dnum):

fit_c = smf.glm(
    "api00 ~ meals + ell",
    data=clus,
    freq_weights=clus["pw"].values,
).fit()

m_naive = Margins(fit_c, at="overall", weights=clus["pw"].values)

design_c = SurveyDesign(
    weights=clus["pw"].values,
    psu=clus["dnum"].values,     # primary sampling unit = school district
    fpc=clus["fpc"].values,
)
m_cluster = Margins(
    fit_c, survey_design=design_c, weights=clus["pw"].values, at="overall"
)

se_naive = float(m_naive.dydx("meals").std_error)
se_cluster = float(m_cluster.dydx("meals").std_error)
print(f"naïve SE (ignores clustering): {se_naive:.3f}")
print(f"cluster-robust survey SE:      {se_cluster:.3f}")
print(f"design effect (variance ratio): "
      f"{(se_cluster / se_naive) ** 2:.1f}x")
naïve SE (ignores clustering): 0.035
cluster-robust survey SE:      0.302
design effect (variance ratio): 73.5x

The naïve SE is roughly nine times too small in standard-deviation terms — an enormous design effect. Reporting the naïve interval here would massively overstate precision; the survey SE is the one to trust.

4. What if you can’t fit with weights?

Some estimators don’t accept sampling weights. pymargins supports the post-hoc route: fit unweighted, then attach the design only via survey_design. You still get a design-based SE; the point estimate just comes from the unweighted coefficients.

fit_u = smf.glm("api00 ~ meals + ell", data=strat).fit()

m_posthoc = Margins(
    fit_u, survey_design=design, weights=strat["pw"].values, at="overall"
)
print(m_posthoc.dydx("meals").summary())
==========================================================
            Margins Result (delta, level=0.95)            
==========================================================
       estimate  std err        z  P>|z|  [95% Conf. Int.]
----------------------------------------------------------
meals   -2.8639   0.3145  -9.1056  0.000  -3.4803, -2.2475
==========================================================

n = 200
κ: 0.000
Delta-vs-sim disagreement: 1.698%

The estimate differs from the weighted-fit version (different coefficients) but the SE is still survey-adjusted.

5. Bootstrap cross-check

A design-based bootstrap resamples PSUs within strata, preserving the survey structure. With enough replicates it should track the linearization SE — a useful independent check, and your only option for estimators that expose no score (linearization needs one; the bootstrap does not).

m_boot = Margins(
    fit_c,
    survey_design=design_c,
    weights=clus["pw"].values,
    at="overall",
    method="bootstrap",
    n_boot=400,
    rng_seed=0,
)
print(f"linearization SE: {se_cluster:.3f}")
print(f"bootstrap SE:     {float(m_boot.dydx('meals').std_error):.3f}")
linearization SE: 0.302
bootstrap SE:     0.383

6. Everything side by side

def row(label, result):
    r = result.dydx("meals")
    return {"approach": label,
            "estimate": float(r.estimate),
            "std_error": float(r.std_error)}

summary = pd.DataFrame([
    row("apistrat: weighted (SRS SE)", m_weighted),
    row("apistrat: survey linearization", m_survey),
    row("apistrat: post-hoc unweighted", m_posthoc),
    row("apiclus1: naïve (ignores clusters)", m_naive),
    row("apiclus1: survey linearization", m_cluster),
    row("apiclus1: survey bootstrap", m_boot),
])
print(summary.round(3).to_string(index=False))
print(f"\npopulation truth slope on meals: {TRUTH_AME:.3f}")
                          approach  estimate  std_error
       apistrat: weighted (SRS SE)    -3.111      0.047
    apistrat: survey linearization    -3.111      0.276
     apistrat: post-hoc unweighted    -2.864      0.315
apiclus1: naïve (ignores clusters)    -3.146      0.035
    apiclus1: survey linearization    -3.146      0.302
        apiclus1: survey bootstrap    -3.146      0.383

population truth slope on meals: -2.963

What to read off the table:

  • Estimates that use weights (or, for apiclus1, any method, since the weights are constant) cluster around the population truth.

  • SEs within apistrat are similar — stratification only modestly changes precision here. Within apiclus1 the naïve SE is far too small; the survey linearization and bootstrap agree and are the honest ones.

Where to next