California API — stratified survey design

The California Academic Performance Index (API) measures school achievement. The California Department of Education publishes the full population (apipop, 6 194 schools) and several survey samples drawn from it. Here we analyze apistrat, a stratified sample of 200 schools where strata are school type (stype: Elementary / Middle / High) and sampling weights (pw) correct for unequal selection probabilities within each stratum.

This demo shows:

  1. Loading the stratified sample and its design variables.

  2. Fitting a weighted model for 2000 API scores (Stata svy: reg style).

  3. Computing survey-adjusted average marginal effects and standard errors via Taylor linearization.

  4. A design-based bootstrap cross-check.

  5. A policy-relevant contrast: the expected API gap between low- and high-poverty schools.

  6. Validating the survey estimate against the known population truth.

import jax
jax.config.update("jax_enable_x64", True)

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from pymargins import Margins, SurveyDesign

# Load the stratified sample generated from R's survey package
apistrat = pd.read_csv("data/apistrat.csv")
print(apistrat.shape)
print(apistrat[["stype", "api00", "api99", "meals", "ell", "avg.ed",
                "mobility", "pw", "fpc"]].head())
(200, 39)
  stype  api00  api99  meals  ell  avg.ed  mobility         pw   fpc
0     E    840    816     33   25    3.32        11  44.209999  4421
1     E    516    476     98   77    1.67        26  44.209999  4421
2     E    531    544     64   23    2.34        17  44.209999  4421
3     E    501    457     83   63    1.86        13  44.209999  4421
4     E    720    659     26   17    3.17        31  44.209999  4421

1. Explore the survey design

print("Schools per stratum:")
print(apistrat["stype"].value_counts().sort_index())
print("\nWeight range: {:.1f}{:.1f}".format(apistrat["pw"].min(),
                                                apistrat["pw"].max()))
print("FPC range:    {:.1f}{:.1f}".format(apistrat["fpc"].min(),
                                                apistrat["fpc"].max()))
Schools per stratum:
stype
E    100
H     50
M     50
Name: count, dtype: int64

Weight range: 15.1 – 44.2
FPC range:    755.0 – 4421.0

The weights differ by roughly a factor of three because elementary, middle, and high schools were sampled at different rates within their strata.

2. Fit a weighted model

We model the 2000 API score as a function of:

  • meals — percent of students eligible for subsidized meals (poverty proxy)

  • ell — percent of English-language learners

  • avg.ed — average parent education (in years)

  • mobility — percent of students who move in/out during the year

Because the sample is stratified with unequal probabilities, we fit with sampling weights so the coefficients generalize to the population:

fit = smf.glm(
    "api00 ~ meals + ell + Q(\"avg.ed\") + mobility",
    data=apistrat,
    freq_weights=apistrat["pw"].values,
).fit()
print(fit.summary().tables[1])
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept     532.8152      8.719     61.108      0.000     515.726     549.905
meals          -1.8421      0.058    -31.534      0.000      -1.957      -1.728
ell            -0.0022      0.062     -0.036      0.972      -0.123       0.119
Q("avg.ed")    77.3537      2.285     33.851      0.000      72.875      81.832
mobility        0.1594      0.076      2.110      0.035       0.011       0.307
===============================================================================

Note that, conditional on parent education and poverty, the English-learner share (ell) carries almost no independent signal — its coefficient is small and statistically indistinguishable from zero. Poverty (meals) is the dominant predictor, so that is the marginal effect we focus on below.

3. Declare the survey design and compute AMEs

SurveyDesign receives the weights, the stratum labels, and the finite-population correction. Here the schools are drawn directly from strata without an intermediate cluster stage, so we omit psu. The fpc column holds each stratum’s population size, which tightens the SE to reflect that we sampled a non-trivial fraction of every stratum:

# stype is 'E', 'H', 'M' — encode as integers for SurveyDesign
stratum_codes = apistrat["stype"].astype("category").cat.codes

survey = SurveyDesign(
    weights=apistrat["pw"].values,
    strata=stratum_codes.values,
    fpc=apistrat["fpc"].values,
)

m = Margins(fit, survey_design=survey,
            weights=apistrat["pw"].values, at="overall")
print(m.dydx("meals").summary())
==========================================================
            Margins Result (delta, level=0.95)            
==========================================================
       estimate  std err        z  P>|z|  [95% Conf. Int.]
----------------------------------------------------------
meals   -1.8421   0.4139  -4.4503  0.000  -2.6534, -1.0308
==========================================================

n = 200
κ: 0.000
Delta-vs-sim disagreement: 1.657%

The point estimate is a population-weighted AME and the standard error is survey-adjusted — it accounts for the stratified sampling design via Taylor linearization.

4. Design-based bootstrap cross-check

A stratified bootstrap resamples schools within each stratum. With enough replicates it should give a standard error close to the linearization one:

m_boot = Margins(
    fit,
    survey_design=survey,
    weights=apistrat["pw"].values,
    at="overall",
    method="bootstrap",
    n_boot=500,
    rng_seed=42,
)
print(m_boot.dydx("meals").summary())
============================================================
           Margins Result (bootstrap, level=0.95)           
============================================================
       estimate  std err  statistic  P>|z|  [95% Conf. Int.]
------------------------------------------------------------
meals   -1.8421   0.4409    -1.8421  0.008  -2.1456, -0.4205
============================================================

n = 200
κ: 0.000

5. Side-by-side comparison

results = pd.DataFrame({
    "approach": ["survey linearization", "survey bootstrap"],
    "estimate": [
        float(m.dydx("meals").estimate),
        float(m_boot.dydx("meals").estimate),
    ],
    "std_error": [
        float(m.dydx("meals").std_error),
        float(m_boot.dydx("meals").std_error),
    ],
})
print(results.round(4))
               approach  estimate  std_error
0  survey linearization   -1.8421     0.4139
1      survey bootstrap   -1.8421     0.4409

Both approaches agree: each additional percentage point of students on subsidized meals is associated with roughly 1.8 fewer API points, with a standard error around 0.4.

6. Policy contrast: low- vs high-poverty schools

What is the expected API gap between a low-poverty school (20 % of students on subsidized meals) and a high-poverty school (80 %), holding the other covariates at their (population-weighted) observed values?

from pymargins import pairwise

scen, w = pairwise("meals", [20, 80])
res = m.contrasts(scenarios=scen, contrasts=w)
print(res.summary())
=============================================================
              Margins Result (delta, level=0.95)             
=============================================================
          estimate  std err       z  P>|z|   [95% Conf. Int.]
-------------------------------------------------------------
meals=20  110.5286  24.8362  4.4503  0.000  61.8505, 159.2067
=============================================================

n = 200
κ: 0.000
Delta-vs-sim disagreement: 3.534%

The contrast is a difference in expected API scores — the low-poverty prediction minus the high-poverty one — not a slope. It is averaged over the weighted distribution of the other covariates, and its standard error is propagated through both the regression and the survey design. A 60-point swing in the poverty share maps to roughly a 110-point API gap.

7. Validate against the population truth

Because the full population apipop is published, we can check whether the survey estimate actually recovers the population marginal effect. We fit the same model on all 6 194 schools (no weights needed — this is the population) and compare:

apipop = pd.read_csv("data/apipop.csv")

truth_fit = smf.glm(
    "api00 ~ meals + ell + Q(\"avg.ed\") + mobility",
    data=apipop.dropna(subset=["api00", "meals", "ell", "avg.ed", "mobility"]),
).fit()
m_truth = Margins(truth_fit, at="overall")

ame_survey = m.dydx("meals")
truth_value = float(m_truth.dydx("meals").estimate)
lo, hi = ame_survey.conf_int()

print(f"population truth AME(meals): {truth_value:7.3f}")
print(f"survey estimate:            {float(ame_survey.estimate):7.3f}")
print(f"95% survey CI:              [{float(lo):.3f}, {float(hi):.3f}]")
print(f"truth inside CI:            {float(lo) <= truth_value <= float(hi)}")
population truth AME(meals):  -1.641
survey estimate:             -1.842
95% survey CI:              [-2.653, -1.031]
truth inside CI:            True

The 200-school survey estimate lands close to the population value, and the design-based confidence interval covers the truth — exactly what a correctly specified survey analysis should deliver.

Where to next