---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Effects by subgroup with `over=`

`over=` partitions the sample and computes the estimand *within* each
group, returning one row per group with a shared covariance. It works
on `predict` and `dydx` (the aggregating estimands); for discrete
contrasts by subgroup, pin the group in the scenario instead (last
section).

```{code-cell} python
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from pymargins import Margins

rng = np.random.default_rng(42)
n = 4000
df = pd.DataFrame({
    "age": rng.integers(20, 75, n),
    "treated": rng.binomial(1, 0.5, n),
    "region": rng.choice(["N", "S", "E", "W"], n),
})
lp = -3 + 0.04 * df["age"] + 0.9 * df["treated"] + 0.05 * df["age"] * df["treated"]
df["y"] = rng.binomial(1, 1 / (1 + np.exp(-lp)))

fit = smf.glm("y ~ age * treated + C(region)", data=df,
              family=sm.families.Binomial()).fit()
m = Margins.linear_scale(fit, at="overall")
```

## AME within each subgroup

Pass the grouping column name. Each row is the average marginal effect
computed over the rows in that group only:

```{code-cell} python
print(m.dydx("age", over="treated").summary())
```

The slope of `age` differs between the treated and untreated groups —
here because the model contains an `age × treated` interaction.

## Predicted values within each subgroup

`predict` takes `over=` the same way. Combine it with `atexog` to sweep
a variable *inside* each group:

```{code-cell} python
res = m.predict(atexog={"treated": [0, 1]}, over="region")
print(res.to_frame()[["region", "treated", "estimate", "std_error"]])
```

Other covariates are averaged within each region, so the rows are
region-specific adjusted predictions rather than a single pooled
profile.

## Crossing several grouping variables

`over=` accepts a list. The sample is partitioned by the full cross of
the listed columns, one row per non-empty cell:

```{code-cell} python
print(m.dydx("age", over=["treated", "region"]).summary())
```

## Nonlinear models give heterogeneity *for free*

In a linear model with no interaction, every subgroup AME is identical
— `over=` only changes the *baseline*, not the *effect*. In a nonlinear
model the marginal effect depends on where each subgroup sits on the
response curve, so subgroup AMEs differ **even without an interaction
term**:

```{code-cell} python
# Drop the interaction: age enters additively, treated only shifts the level.
fit_add = smf.glm("y ~ age + treated + C(region)", data=df,
                  family=sm.families.Binomial()).fit()
m_add = Margins.linear_scale(fit_add, at="overall")
print(m_add.dydx("age", over="treated").summary())
```

The two slopes still differ: the treated group sits higher on the
logistic curve, where the same change in the linear predictor maps to a
different change in probability. This is genuine link-driven
heterogeneity, not a modelling artefact — but note it is a property of
the *probability* scale. On a `log_scale` or `logit_scale` session the
subgroup effects of an additive model would instead coincide.

## Discrete contrasts by subgroup

`contrasts` does not take `over=` — a contrast is a single linear
combination, not an aggregation. To get a contrast *within* each
subgroup, pin the group in the scenarios (the grouping variable must be
a model regressor) and request all groups' contrasts at once via a
named-contrast `dict`, which gives them a joint covariance:

```{code-cell} python
from itertools import product

regions = ["N", "S", "E", "W"]
scenarios = []
for r in regions:
    scenarios.append({"atexog": {"treated": 1, "region": r}, "label": f"{r}:treated"})
    scenarios.append({"atexog": {"treated": 0, "region": r}, "label": f"{r}:control"})

weights = {}
for i, r in enumerate(regions):
    w = [0] * len(scenarios)
    w[2 * i], w[2 * i + 1] = 1, -1
    weights[f"effect[{r}]"] = w

print(m.contrasts(scenarios=scenarios, contrasts=weights).summary())
```

Because the contrasts share a covariance, you can follow up with a
joint test that the subgroup effects are equal, or add a
difference-of-contrasts row to read a specific between-group gap
directly — see [](contrasts.md) and the
[union-premium demo](../demos/wage_heterogeneity.md).

## Where to next

- [](../demos/wage_heterogeneity.md) — a full worked subgroup analysis
  (union premium by education) on real panel data.
- [](contrasts.md) — the contrast primitive used for per-subgroup
  effects and their differences.
- [](grid_predictions.md) — sweeping a covariate grid, which composes
  with `over=` for subgroup-by-grid surfaces.