Generalised Estimating Equations (GEE)

statsmodels.GEE handles correlated outcomes through a working correlation structure and sandwich covariance. pymargins consumes the GEE fit directly — the robust (sandwich) covariance is the default, and the session API is identical to GLM.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from pymargins import Margins

rng = np.random.default_rng(42)
n = 600
df = pd.DataFrame({
    "age": rng.integers(20, 75, n),
    "female": rng.binomial(1, 0.52, n),
    "treated": rng.binomial(1, 0.40, n),
    "site": rng.choice(["A", "B", "C", "D"], n),
    "subject": np.repeat(np.arange(60), 10),
})
lp = -1.5 + 0.04 * df["age"] - 0.3 * df["female"] + 0.8 * df["treated"]
df["y"] = rng.binomial(1, 1 / (1 + np.exp(-lp)))

fit = smf.gee(
    "y ~ age + female + treated + C(site)",
    groups="subject",
    data=df,
    family=sm.families.Binomial(),
    cov_struct=sm.cov_struct.Independence(),
).fit()

Average marginal effect of age

The default covariance is the GEE sandwich (robust) estimator:

m = Margins.linear_scale(fit, at="overall")
print(m.dydx("age").summary())
=======================================================
           Margins Result (delta, level=0.95)          
=======================================================
     estimate  std err       z  P>|z|  [95% Conf. Int.]
-------------------------------------------------------
age    0.0095   0.0016  6.0065  0.000    0.0064, 0.0126
=======================================================

n = 600
κ: 0.073
Delta-vs-sim disagreement: 12.145%

Predicted probabilities by treatment arm

print(m.predict(atexog={"treated": [0, 1]}).summary())
==============================================================
              Margins Result (delta, level=0.95)              
==============================================================
           estimate  std err        z  P>|z|  [95% Conf. Int.]
--------------------------------------------------------------
treated=0    0.5450   0.0201  27.1783  0.000    0.5057, 0.5843
treated=1    0.7111   0.0295  24.1027  0.000    0.6533, 0.7689
==============================================================

n = 600
κ: max=0.062
Delta-vs-sim disagreement: 0.844%

Risk difference for treatment

from pymargins import pairwise

scen, w = pairwise("treated", [1, 0])
print(m.contrasts(scenarios=scen, contrasts=w).summary())
=============================================================
              Margins Result (delta, level=0.95)             
=============================================================
           estimate  std err       z  P>|z|  [95% Conf. Int.]
-------------------------------------------------------------
treated=1    0.1661   0.0374  4.4473  0.000    0.0929, 0.2393
=============================================================

n = 600
κ: 0.053
Delta-vs-sim disagreement: 2.091%

Plot: predicted probability curve over age

import matplotlib.pyplot as plt

ages = list(range(20, 76, 2))
res = m.predict(atexog={"age": ages, "treated": [0, 1]})
df_plot = res.to_frame()

fig, ax = plt.subplots(figsize=(6, 4))
for level, sub in df_plot.groupby("treated"):
    label = "Treated" if level == 1 else "Control"
    ax.plot(sub["age"], sub["estimate"], label=label)
    ax.fill_between(
        sub["age"], sub["ci_lower"], sub["ci_upper"], alpha=0.15
    )
ax.set(xlabel="Age", ylabel="P(y=1)")
ax.legend(title="Treatment")
<matplotlib.legend.Legend at 0x7f678c8dc500>
../_images/1d6be01ebf06f66bbb99da5730c3aa7382867726f20e8c10a63cdd5eea2efd29.png

Plot: forest plot of site contrasts

from pymargins import reference

scen, W = reference("site", ["A", "B", "C", "D"], ref_level="A")
res = m.contrasts(scenarios=scen, contrasts=W)
df_forest = res.to_frame()

fig, ax = plt.subplots(figsize=(4, 3))
y = range(len(df_forest))
ax.errorbar(
    df_forest["estimate"], y,
    xerr=[df_forest["estimate"] - df_forest["ci_lower"],
          df_forest["ci_upper"] - df_forest["estimate"]],
    fmt="o", capsize=3,
)
ax.axvline(0, color="grey", lw=0.5)
ax.set_yticks(list(y))
ax.set_yticklabels(df_forest["label"])
ax.set_xlabel("Risk difference vs Site A")
ax.invert_yaxis()
../_images/23f557c57b728e2d9e901f66c1ddeffbebc60a1af6022d4546637fea4b77c49a.png

Alternative vcov flavours

GEE fits carry three built-in covariances. You can switch at session construction:

# Naive (model-based) covariance
m_naive = Margins.linear_scale(fit, vcov="naive", at="overall")
print(m_naive.dydx("age").summary())

# Bias-corrected robust sandwich (if available)
try:
    m_bc = Margins.linear_scale(fit, vcov="robust_bc", at="overall")
    print(m_bc.dydx("age").summary())
except ValueError as exc:
    print("robust_bc not available:", exc)
=======================================================
           Margins Result (delta, level=0.95)          
=======================================================
     estimate  std err       z  P>|z|  [95% Conf. Int.]
-------------------------------------------------------
age    0.0095   0.0013  7.1816  0.000    0.0069, 0.0121
=======================================================

n = 600
κ: 0.063
Delta-vs-sim disagreement: 9.631%
robust_bc not available: cov_robust_bc is not available on this fit.

Formula vs array interface

If you fit with sm.GEE(y, X, groups=...) instead of smf.gee(...), provide the training data explicitly so the adapter can reconstruct the design matrix:

from pymargins.adapters import StatsmodelsGEEAdapter

adapter = StatsmodelsGEEAdapter(fit_array, training_data=df)
m = Margins.linear_scale(fit_array, adapter=adapter, at="overall")