--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 kernelspec: display_name: Python 3 language: python name: python3 --- # Survey designs — weights, strata, and clusters Complex surveys sample with **unequal probabilities**, **stratification**, and **clustering**. Ignore them and you get two distinct failures: - *Wrong point estimate* — when the sampling weights are informative, the sample no longer looks like the population, so an unweighted average is biased. - *Wrong standard error* — clustering makes observations within a cluster correlated, so the naïve SE (which assumes independent draws) is far too small. This tutorial fixes each problem with `pymargins`, using a dataset where we can **check our answers against the truth**. ## The data: California schools We use the California Academic Performance Index (API) data — the canonical teaching example from R's `survey` package. Its advantage over a simulation is that the *entire population* is available, so every design-based estimate has a known target: - **`apipop`** — all 6 194 California schools (the population; ground truth). - **`apistrat`** — a *stratified* sample of 200 schools. Strata are school type (`stype`: Elementary / Middle / High); the three types are sampled at different rates, so weights `pw` vary. - **`apiclus1`** — a single-stage *cluster* sample: 15 school **districts** (`dnum`) were drawn, then every school in them was taken (183 schools). ```{code-cell} python import jax jax.config.update("jax_enable_x64", True) import numpy as np import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from pymargins import Margins, SurveyDesign DATA = "../demos/data" pop = pd.read_csv(f"{DATA}/apipop.csv") strat = pd.read_csv(f"{DATA}/apistrat.csv") clus = pd.read_csv(f"{DATA}/apiclus1.csv") print(f"population: {len(pop):>5} schools") print(f"stratified: {len(strat):>5} schools (strata = school type)") print(f"cluster: {len(clus):>5} schools in " f"{clus['dnum'].nunique()} districts") ``` ## The target: population truth Because we have the whole population, we can compute the answers a perfect sample would reproduce. We model the 2000 API score (`api00`) as a function of the share of students on subsidized meals (`meals`, a poverty proxy) and the share of English-language learners (`ell`): ```{code-cell} python TRUTH_MEAN = pop["api00"].mean() truth_fit = smf.ols("api00 ~ meals + ell", data=pop).fit() TRUTH_AME = truth_fit.params["meals"] print(f"population mean api00: {TRUTH_MEAN:7.2f}") print(f"population slope on meals: {TRUTH_AME:7.3f}") ``` Keep these two numbers in mind: a good survey estimate should land near them. ## 1. Weights correct bias The stratified sample deliberately **oversamples** the rarer school types so that Middle and High schools are well represented. Compare the population mix to the sample mix: ```{code-cell} python mix = pd.DataFrame({ "population %": 100 * pop["stype"].value_counts(normalize=True), "sample %": 100 * strat["stype"].value_counts(normalize=True), "weight pw": strat.groupby("stype")["pw"].first(), }).round(1) print(mix) ``` Elementary schools are 71 % of the population but only 50 % of the sample, so each sampled elementary school stands in for many more schools (a larger `pw`). An unweighted average over-counts the oversampled Middle/High schools and is biased; weighting by `pw` undoes the distortion: ```{code-cell} python unweighted = strat["api00"].mean() weighted = np.average(strat["api00"], weights=strat["pw"]) print(f"unweighted sample mean: {unweighted:7.2f}") print(f"weighted sample mean: {weighted:7.2f}") print(f"population truth: {TRUTH_MEAN:7.2f}") ``` The weighted mean is almost exactly the population value; the unweighted one is biased downward by ~12 points. The same logic carries into a model: fitting with `freq_weights` makes the coefficients — and therefore every `Margins` estimand built from them — generalize to the population. ```{code-cell} python fit_w = smf.glm( "api00 ~ meals + ell", data=strat, freq_weights=strat["pw"].values, ).fit() m_weighted = Margins(fit_w, at="overall", weights=strat["pw"].values) print(f"weighted population-mean prediction: " f"{float(m_weighted.predict().estimate):.2f}") ``` ## 2. Strata give design-based standard errors A weighted fit gets the *point estimate* right, but its SEs still assume simple random sampling. To get **design-based** standard errors, declare the design with `SurveyDesign`. For `apistrat` the design is: weights `pw`, strata `stype`, and a finite-population correction `fpc` (each row carries the stratum's population size, so the SE shrinks to reflect that we sampled a non-trivial fraction of each stratum): ```{code-cell} python # SurveyDesign wants integer-coded strata stratum_codes = strat["stype"].astype("category").cat.codes.values design = SurveyDesign( weights=strat["pw"].values, strata=stratum_codes, fpc=strat["fpc"].values, ) m_survey = Margins( fit_w, survey_design=design, weights=strat["pw"].values, at="overall" ) print(m_survey.dydx("meals").summary()) ``` `pymargins` uses Taylor linearization (the survey sandwich). Because the model was fit *with* the same weights, the adapter avoids double-counting them. The AME of `meals` (~ −1.8 API points per percentage point on meals) brackets the population truth, and the SE is honest about the stratified design. ## 3. Clustering inflates standard errors Stratification *reduces* variance; clustering *increases* it. The cluster sample `apiclus1` makes this dramatic: its 183 schools come from only **15 districts**, and schools in the same district are alike. Here the weights are constant (every district had the same selection probability), so weighting does *not* change the point estimate — clustering is purely a variance phenomenon. We fit, then compare the naïve SE (pretend the 183 schools are independent) to the design-based SE (`psu = dnum`): ```{code-cell} python fit_c = smf.glm( "api00 ~ meals + ell", data=clus, freq_weights=clus["pw"].values, ).fit() m_naive = Margins(fit_c, at="overall", weights=clus["pw"].values) design_c = SurveyDesign( weights=clus["pw"].values, psu=clus["dnum"].values, # primary sampling unit = school district fpc=clus["fpc"].values, ) m_cluster = Margins( fit_c, survey_design=design_c, weights=clus["pw"].values, at="overall" ) se_naive = float(m_naive.dydx("meals").std_error) se_cluster = float(m_cluster.dydx("meals").std_error) print(f"naïve SE (ignores clustering): {se_naive:.3f}") print(f"cluster-robust survey SE: {se_cluster:.3f}") print(f"design effect (variance ratio): " f"{(se_cluster / se_naive) ** 2:.1f}x") ``` The naïve SE is roughly **nine times too small** in standard-deviation terms — an enormous design effect. Reporting the naïve interval here would massively overstate precision; the survey SE is the one to trust. ## 4. What if you can't fit with weights? Some estimators don't accept sampling weights. `pymargins` supports the post-hoc route: fit **unweighted**, then attach the design only via `survey_design`. You still get a design-based SE; the point estimate just comes from the unweighted coefficients. ```{code-cell} python fit_u = smf.glm("api00 ~ meals + ell", data=strat).fit() m_posthoc = Margins( fit_u, survey_design=design, weights=strat["pw"].values, at="overall" ) print(m_posthoc.dydx("meals").summary()) ``` The estimate differs from the weighted-fit version (different coefficients) but the SE is still survey-adjusted. ## 5. Bootstrap cross-check A design-based bootstrap resamples PSUs **within** strata, preserving the survey structure. With enough replicates it should track the linearization SE — a useful independent check, and your only option for estimators that expose no score (linearization needs one; the bootstrap does not). ```{code-cell} python m_boot = Margins( fit_c, survey_design=design_c, weights=clus["pw"].values, at="overall", method="bootstrap", n_boot=400, rng_seed=0, ) print(f"linearization SE: {se_cluster:.3f}") print(f"bootstrap SE: {float(m_boot.dydx('meals').std_error):.3f}") ``` ## 6. Everything side by side ```{code-cell} python def row(label, result): r = result.dydx("meals") return {"approach": label, "estimate": float(r.estimate), "std_error": float(r.std_error)} summary = pd.DataFrame([ row("apistrat: weighted (SRS SE)", m_weighted), row("apistrat: survey linearization", m_survey), row("apistrat: post-hoc unweighted", m_posthoc), row("apiclus1: naïve (ignores clusters)", m_naive), row("apiclus1: survey linearization", m_cluster), row("apiclus1: survey bootstrap", m_boot), ]) print(summary.round(3).to_string(index=False)) print(f"\npopulation truth slope on meals: {TRUTH_AME:.3f}") ``` What to read off the table: - **Estimates** that use weights (or, for `apiclus1`, any method, since the weights are constant) cluster around the population truth. - **SEs** within `apistrat` are similar — stratification only modestly changes precision here. Within `apiclus1` the naïve SE is far too small; the survey linearization and bootstrap agree and are the honest ones. ## Where to next - [](../demos/california_api_survey.md) — a full end-to-end analysis on the stratified sample, including a policy-relevant contrast. - [](../howto/robust_clustered_ses.md) — robust and clustered standard errors for non-survey settings.