---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Spector & Mazzeo (1980) — Did PSI work?

Thirty-two economics students. Three covariates: incoming GPA, a
diagnostic test score (`TUCE`), and a binary indicator `PSI` for
whether the student was taught via Personalized System of
Instruction. Outcome `GRADE` is 1 if the student's grade improved.

The dataset is tiny — small enough that the *uncertainty* dominates
the analysis. That makes it perfect for demonstrating how
`pymargins` lets you cross-check delta-method intervals against a
simulation or bootstrap with one keyword change, and how the κ
diagnostic flags when those methods disagree.

```{code-cell} python
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from pymargins import Margins, pairwise

data = sm.datasets.spector.load_pandas()
df = data.exog.copy()
df["GRADE"] = data.endog.astype(int)
print(df.describe().round(2))
```

## 1. Fit the logit

```{code-cell} python
fit = smf.glm(
    "GRADE ~ GPA + TUCE + PSI",
    data=df,
    family=sm.families.Binomial(),
).fit()
print(fit.summary().tables[1])
```

## 2. The treatment effect three ways

The substantive question is "what does PSI do to the probability of
grade improvement, holding GPA and TUCE fixed?" That's a contrast
between PSI=1 and PSI=0:

```{code-cell} python
scen, w = pairwise("PSI", [1, 0])

m_delta = Margins.linear_scale(fit, vcov="HC1", at="overall")
print("Delta method:")
print(m_delta.contrasts(scenarios=scen, contrasts=w).summary())
```

With only 32 observations the delta-method CI is wide. A Krinsky–Robb
simulation is the cleanest cross-check that does not require refits:

```{code-cell} python
m_sim = Margins.linear_scale(
    fit, vcov="HC1", at="overall",
    method="simulation", n_sim=2000, rng_seed=0,
)
print("Krinsky–Robb simulation:")
print(m_sim.contrasts(scenarios=scen, contrasts=w).summary())
```

A paired bootstrap is the more general non-parametric check, but
with n=32 and three covariates the resampled datasets are frequently
perfectly separable — the logit refit fails to converge and the
bootstrap is not the right tool for this size of problem. The κ
diagnostic gives the analytical reason the delta and simulation
intervals agree:

```{code-cell} python
print(m_delta.diagnose().summary())
```

A small κ confirms the response surface is essentially linear in the
neighborhood being asked about, so the delta method is the cheapest
correct answer. A large κ would point you straight at the simulation
column.

## 3. Heterogeneous treatment effect — does PSI help everyone equally?

Compute the PSI vs no-PSI contrast at three representative GPA
levels. This is `at` doing real work — each row of the result is the
treatment effect at a fixed GPA, with the other covariate (`TUCE`)
held at its sample average:

```{code-cell} python
gpas = [2.5, 3.0, 3.5]
k = len(gpas)
scenarios = [
    {"atexog": {"PSI": 1, "GPA": g}, "label": f"PSI=1, GPA={g}"} for g in gpas
] + [
    {"atexog": {"PSI": 0, "GPA": g}, "label": f"PSI=0, GPA={g}"} for g in gpas
]
W = np.zeros((k, 2 * k))
for i in range(k):
    W[i, i] = +1.0
    W[i, k + i] = -1.0
het = m_delta.contrasts(scenarios=scenarios, contrasts=W)
print(het.summary())
```

If the joint effect is statistically distinguishable from zero but
the individual GPA-stratified effects are not, that's a signal that
the dataset has just enough power for an *average* claim and not
enough for a *subgroup* claim. With n=32 this happens a lot.

## 4. Predicted probability curves

```{code-cell} python
import matplotlib.pyplot as plt

gpa_grid = np.linspace(df["GPA"].min(), df["GPA"].max(), 40)
res = m_delta.predict(atexog={"GPA": gpa_grid, "PSI": [0, 1]})
df_plot = res.to_frame()

fig, ax = plt.subplots(figsize=(6, 4))
for psi, sub in df_plot.groupby("PSI"):
    label = "PSI" if psi == 1 else "Control"
    ax.plot(sub["GPA"], sub["estimate"], label=label)
    ax.fill_between(
        sub["GPA"], sub["ci_lower"], sub["ci_upper"], alpha=0.2
    )
ax.set(xlabel="Entering GPA", ylabel="P(grade improvement)")
ax.legend()
```

The plot makes the small-sample story honest: the PSI curve sits
above the control curve at every GPA, but the CIs overlap heavily,
which the contrast table above already told you in scalar form.

## Where to next

- [](../tutorials/inference_methods.md) — delta vs simulation vs
  bootstrap, side by side, on a single estimand.
- [](../howto/bootstrap.md) — bootstrap recipes.
- [](../explanations/kappa_diagnostic.md) — what κ measures and when
  it triggers a fallback.