---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Generalised Estimating Equations (GEE)

`statsmodels.GEE` handles correlated outcomes through a working
correlation structure and sandwich covariance.  `pymargins` consumes the
GEE fit directly — the robust (sandwich) covariance is the default, and
the session API is identical to GLM.

```{code-cell} python
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from pymargins import Margins

rng = np.random.default_rng(42)
n = 600
df = pd.DataFrame({
    "age": rng.integers(20, 75, n),
    "female": rng.binomial(1, 0.52, n),
    "treated": rng.binomial(1, 0.40, n),
    "site": rng.choice(["A", "B", "C", "D"], n),
    "subject": np.repeat(np.arange(60), 10),
})
lp = -1.5 + 0.04 * df["age"] - 0.3 * df["female"] + 0.8 * df["treated"]
df["y"] = rng.binomial(1, 1 / (1 + np.exp(-lp)))

fit = smf.gee(
    "y ~ age + female + treated + C(site)",
    groups="subject",
    data=df,
    family=sm.families.Binomial(),
    cov_struct=sm.cov_struct.Independence(),
).fit()
```

## Average marginal effect of `age`

The default covariance is the GEE sandwich (robust) estimator:

```{code-cell} python
m = Margins.linear_scale(fit, at="overall")
print(m.dydx("age").summary())
```

## Predicted probabilities by treatment arm

```{code-cell} python
print(m.predict(atexog={"treated": [0, 1]}).summary())
```

## Risk difference for treatment

```{code-cell} python
from pymargins import pairwise

scen, w = pairwise("treated", [1, 0])
print(m.contrasts(scenarios=scen, contrasts=w).summary())
```

## Plot: predicted probability curve over age

```{code-cell} python
import matplotlib.pyplot as plt

ages = list(range(20, 76, 2))
res = m.predict(atexog={"age": ages, "treated": [0, 1]})
df_plot = res.to_frame()

fig, ax = plt.subplots(figsize=(6, 4))
for level, sub in df_plot.groupby("treated"):
    label = "Treated" if level == 1 else "Control"
    ax.plot(sub["age"], sub["estimate"], label=label)
    ax.fill_between(
        sub["age"], sub["ci_lower"], sub["ci_upper"], alpha=0.15
    )
ax.set(xlabel="Age", ylabel="P(y=1)")
ax.legend(title="Treatment")
```

## Plot: forest plot of site contrasts

```{code-cell} python
from pymargins import reference

scen, W = reference("site", ["A", "B", "C", "D"], ref_level="A")
res = m.contrasts(scenarios=scen, contrasts=W)
df_forest = res.to_frame()

fig, ax = plt.subplots(figsize=(4, 3))
y = range(len(df_forest))
ax.errorbar(
    df_forest["estimate"], y,
    xerr=[df_forest["estimate"] - df_forest["ci_lower"],
          df_forest["ci_upper"] - df_forest["estimate"]],
    fmt="o", capsize=3,
)
ax.axvline(0, color="grey", lw=0.5)
ax.set_yticks(list(y))
ax.set_yticklabels(df_forest["label"])
ax.set_xlabel("Risk difference vs Site A")
ax.invert_yaxis()
```

## Alternative vcov flavours

GEE fits carry three built-in covariances.  You can switch at session
construction:

```{code-cell} python
# Naive (model-based) covariance
m_naive = Margins.linear_scale(fit, vcov="naive", at="overall")
print(m_naive.dydx("age").summary())

# Bias-corrected robust sandwich (if available)
try:
    m_bc = Margins.linear_scale(fit, vcov="robust_bc", at="overall")
    print(m_bc.dydx("age").summary())
except ValueError as exc:
    print("robust_bc not available:", exc)
```

## Formula vs array interface

If you fit with `sm.GEE(y, X, groups=...)` instead of `smf.gee(...)`,
provide the training data explicitly so the adapter can reconstruct the
design matrix:

```python
from pymargins.adapters import StatsmodelsGEEAdapter

adapter = StatsmodelsGEEAdapter(fit_array, training_data=df)
m = Margins.linear_scale(fit_array, adapter=adapter, at="overall")
```