Union premium heterogeneity — effects by subgroup

An average marginal effect answers “what is the effect on average?”. Often the research question is sharper: for whom is the effect large, and for whom is it small? The union wage premium is the textbook case — unions are widely held to compress wages, raising pay most at the bottom of the skill distribution and least at the top. That is a statement about heterogeneity, and it is invisible in a single AME.

This demo uses the linearmodels wage panel (545 young men, 1980–1987) to estimate the union premium by education band, using the over= subgroup argument and a contrast that lets the premium differ across groups. It also makes the central modelling point explicit: a subgroup effect can only differ if the model lets it — through an interaction, or through a nonlinear link.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from linearmodels.datasets import wage_panel

from pymargins import Margins

df = wage_panel.load().reset_index()
df["educ_band"] = pd.cut(
    df["educ"], [0, 11, 12, 20], labels=["<HS", "HS", "college"]
).astype(str)
print(df[["nr", "year", "lwage", "union", "married", "exper", "educ",
          "educ_band"]].head())
print("\nObservations per education band:")
print(df["educ_band"].value_counts())
   nr  year     lwage  union  married  exper  educ educ_band
0  13  1980  1.197540      0        0      1    14   college
1  13  1981  1.853060      1        0      2    14   college
2  13  1982  1.344462      0        0      3    14   college
3  13  1983  1.433213      0        0      4    14   college
4  13  1984  1.568125      0        0      5    14   college

Observations per education band:
educ_band
HS         1848
<HS        1472
college    1040
Name: count, dtype: int64

1. A model that allows heterogeneity

If we regress log wage on union plus controls without an interaction, the union coefficient is a single number and the implied premium is identical in every subgroup — over= would just return the same value three times. To let the premium vary by education we interact union with educ_band (and, so that the return to experience can vary too, exper with educ_band):

fit = smf.ols(
    "lwage ~ union * educ_band + exper * educ_band + expersq + married + C(year)",
    data=df,
).fit()
m = Margins.linear_scale(
    fit,
    vcov={"type": "cluster", "groups": df["nr"]},  # cluster by worker
    at="overall",
)

We cluster the standard errors by worker (nr) because the same man is observed in up to eight years — the panel analogue of the survey-design adjustment in the California API demo.

2. Predicted wages by union status, within each band

over="educ_band" averages the prediction within each education band rather than over the whole sample. Toggling union across [0, 1] gives the union and non-union wage in each band:

profiles = m.predict(atexog={"union": [0, 1]}, over="educ_band")
print(profiles.summary())
===============================================================================
                       Margins Result (delta, level=0.95)                      
===============================================================================
                            estimate  std err        z  P>|z|  [95% Conf. Int.]
-------------------------------------------------------------------------------
educ_band=<HS, union=0        1.4286   0.0299  47.8568  0.000    1.3701, 1.4871
educ_band=<HS, union=1        1.6163   0.0484  33.3711  0.000    1.5214, 1.7113
educ_band=HS, union=0         1.6158   0.0240  67.4288  0.000    1.5688, 1.6627
educ_band=HS, union=1         1.8381   0.0328  55.9693  0.000    1.7737, 1.9025
educ_band=college, union=0    1.8210   0.0355  51.3477  0.000    1.7515, 1.8905
educ_band=college, union=1    1.9201   0.0683  28.1324  0.000    1.7863, 2.0538
===============================================================================

n = 4360
κ: max=0.000
Delta-vs-sim disagreement: 0.413%

Each row is a band × union cell, with the other covariates averaged within that band — so the comparison holds composition fixed at each education level rather than at the pooled mean.

3. The premium, allowed to differ across bands

The premium within a band is the union-minus-non-union contrast, taken inside that band by pinning educ_band in the scenario. A dict of named contrasts returns all three premiums in one result with a joint covariance — which is what lets us then test whether they differ.

bands = ["<HS", "HS", "college"]
scenarios = []
for b in bands:
    scenarios.append({"atexog": {"union": 1, "educ_band": b}, "label": f"{b}:union"})
    scenarios.append({"atexog": {"union": 0, "educ_band": b}, "label": f"{b}:nonunion"})

# premium[b] = (union cell) − (non-union cell) for band b
weights = {}
for i, b in enumerate(bands):
    w = [0] * len(scenarios)
    w[2 * i] = 1
    w[2 * i + 1] = -1
    weights[f"premium[{b}]"] = w

# add a difference-of-premiums so we can read its CI directly
w = [0] * len(scenarios)
w[0], w[1] = -1, 1          # subtract <HS premium
w[4], w[5] = 1, -1          # add college premium
weights["college − <HS"] = w

premiums = m.contrasts(scenarios=scenarios, contrasts=weights)
print(premiums.summary())
=====================================================================
                  Margins Result (delta, level=0.95)                 
=====================================================================
                  estimate  std err        z  P>|z|  [95% Conf. Int.]
---------------------------------------------------------------------
premium[<HS]        0.1877   0.0529   3.5508  0.000    0.0841, 0.2913
premium[HS]         0.2223   0.0373   5.9548  0.000    0.1492, 0.2955
premium[college]    0.0991   0.0712   1.3922  0.164   -0.0404, 0.2386
college − <HS      -0.0886   0.0885  -1.0014  0.317   -0.2620, 0.0848
=====================================================================

n = 4360
κ: max=0.000
Delta-vs-sim disagreement: 6.768%

The premium is largest for high-school and below-high-school workers and roughly half the size for college graduates — the wage-compression story, made quantitative. The final row, college <HS, is the heterogeneity itself: a single contrast with its own interval, so you can report whether the gap in premiums is distinguishable from noise rather than eyeballing three overlapping confidence intervals.

4. Continuous-variable heterogeneity with over=

For a continuous regressor the subgroup slope comes straight from dydx(..., over=...) — no scenario bookkeeping. Here is the return to labour-market experience by education band, which the exper × educ_band interaction lets vary:

ret = m.dydx("exper", over="educ_band")
print(ret.summary())
=============================================================================
                      Margins Result (delta, level=0.95)                     
=============================================================================
                          estimate  std err        z  P>|z|  [95% Conf. Int.]
-----------------------------------------------------------------------------
educ_band=<HS, exper        0.0543   0.0082   6.6563  0.000    0.0383, 0.0703
educ_band=HS, exper         0.0363   0.0012  29.9430  0.000    0.0339, 0.0387
educ_band=college, exper    0.0670   0.0210   3.1986  0.001    0.0260, 0.1081
=============================================================================

n = 4360
κ: max=0.000
Delta-vs-sim disagreement: 105.034%

over= accepts a list to cross several grouping variables at once (over=["union", "educ_band"]), giving one slope per cell — the same machinery scales from one subgroup dimension to several.

The modelling caveat, stated plainly

Subgroup effects are only as rich as the model that produces them:

  • Linear model, no interaction → every subgroup AME is identical. over= is descriptive (different baselines) but the effect is constant by construction.

  • Linear model, with interaction (this demo) → effects differ by exactly what the interaction encodes.

  • Nonlinear link (logit, Poisson, Cox) → effects differ across subgroups even with no interaction, because the marginal effect depends on where each subgroup sits on the response curve. The subgroup how-to demonstrates this case.

over= reports the heterogeneity your specification implies; it does not manufacture heterogeneity the model has no room for. Deciding which interactions belong in the model is a modelling choice that precedes the margins call.

Where to next