Transform pipelines — honest CIs for data-dependent preprocessing¶
Most applied work runs some preprocessing before the model: imputing a missing covariate, trimming implausible values, dropping outliers by a data-driven rule. The estimate then carries the fingerprints of that preprocessing — but the standard inference pretends the cleaned data were handed down from on high. If the imputation model, the trim bounds, or the outlier flags would have come out differently on a different sample, your confidence interval is too narrow.
A pymargins transform pipeline closes that gap. You pass
transforms=[...] to a bootstrap session; each stage is a frame → frame
transform that the bootstrap re-derives on every replicate, exactly the
way matching re-matches. The imputer re-fits, the
trim bounds recompute, the outlier rule re-runs — so the resample
distribution absorbs the variability of the preprocessing itself.
This demo runs three stages end-to-end on the Wage panel:
reimpute— bootstrap-then-impute multiple imputation, and how its CI widens relative to a naive single-imputation bootstrap.Composition — chaining
reimputewith atrimstage in one pipeline.drop_outliers— a data-driven outlier rule re-derived per replicate.
It closes with the structural guards: the combinations the session rejects at construction, and why.
import warnings
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from linearmodels.datasets import wage_panel
from sklearn.experimental import enable_iterative_imputer # noqa: F401
from sklearn.impute import IterativeImputer
from pymargins import Margins, drop_outliers, reimpute, trim
cols = ["lwage", "exper", "educ", "married", "union"]
df = wage_panel.load().reset_index(drop=True)[cols].copy()
print(df.describe().round(2))
lwage exper educ married union
count 4360.00 4360.00 4360.00 4360.00 4360.00
mean 1.65 6.51 11.77 0.44 0.24
std 0.53 2.83 1.75 0.50 0.43
min -3.58 0.00 3.00 0.00 0.00
25% 1.35 4.00 11.00 0.00 0.00
50% 1.67 6.00 12.00 0.00 0.00
75% 1.99 9.00 12.00 1.00 0.00
max 4.05 18.00 16.00 1.00 1.00
1. reimpute — multiple imputation under the bootstrap¶
We inject MAR missingness into educ: workers with more experience are
more likely to have an unrecorded education, so the missingness depends on
an observed covariate (missing-at-random, not completely-at-random).
rng = np.random.default_rng(7)
p_miss = 1 / (1 + np.exp(-(df["exper"] - df["exper"].mean()) / 2))
miss = rng.uniform(size=len(df)) < 0.30 * p_miss
df_nan = df.copy()
df_nan.loc[miss, "educ"] = np.nan
print(f"missing educ: {int(miss.sum())} rows ({miss.mean():.1%})")
missing educ: 654 rows (15.0%)
The point estimate comes from a single, cheap mean-fill imputation — this matches how the package already treats bootstrap (the estimate is from the original fit, the CI from the resample distribution).
df_init = df_nan.fillna(df_nan.mean(numeric_only=True))
fit = smf.ols("lwage ~ exper + educ + married + union", data=df_init).fit()
print(fit.params.round(4))
Intercept 0.1010
exper 0.0429
educ 0.0986
married 0.1414
union 0.1730
dtype: float64
For the bootstrap we need a stochastic, runnable imputer: a callable
that takes a DataFrame and returns a fit-and-imputed DataFrame. We use
sklearn’s IterativeImputer with sample_posterior=True so each draw adds
residual noise — a deterministic imputer (mean-fill) would leave the CIs too
narrow, and reimpute warns you if it detects one.
imp = IterativeImputer(max_iter=10, random_state=0, sample_posterior=True)
def imputer(frame):
return pd.DataFrame(imp.fit_transform(frame), columns=frame.columns)
Naive single-imputation bootstrap¶
First the wrong-but-common approach: bootstrap the already mean-filled frame. Every replicate resamples the same frozen imputed values, so the CI sees only sampling variability — not the uncertainty about what the missing education values actually were.
m_naive = Margins.linear_scale(fit, method="bootstrap", n_boot=500, rng_seed=3)
print(m_naive.dydx("educ").summary())
===========================================================
Margins Result (bootstrap, level=0.95)
===========================================================
estimate std err statistic P>|z| [95% Conf. Int.]
-----------------------------------------------------------
educ 0.1009 0.0050 0.1009 0.000 0.0912, 0.1100
===========================================================
n = 4360
κ: 0.000
Bootstrap-then-impute¶
Now the pipeline. reimpute(imputer, incomplete=df_nan) tells the bootstrap
to resample the incomplete frame and re-impute it fresh on every
replicate. The session per-replicate seed flows into the imputer’s
random_state, so the run is reproducible.
with warnings.catch_warnings():
warnings.simplefilter("ignore") # silence the imputer's convergence chatter
m_mi = Margins.linear_scale(
fit,
transforms=[reimpute(imputer, incomplete=df_nan)],
method="bootstrap",
n_boot=500,
rng_seed=3,
)
print(m_mi.dydx("educ").summary())
===========================================================
Margins Result (bootstrap, level=0.95)
===========================================================
estimate std err statistic P>|z| [95% Conf. Int.]
-----------------------------------------------------------
educ 0.1009 0.0054 0.1009 0.000 0.0943, 0.1152
===========================================================
n = 4360
κ: 0.000
The point estimate is identical — both sessions report the slope from the
same fit. But the standard error is larger and the interval is wider:
that extra width is the imputation-model uncertainty the naive bootstrap
threw away. Because the missingness lands on educ, the educ slope is
exactly the coefficient that should pay for it; a covariate with no
missingness in its predictors would be essentially unchanged.
For the full contract — why the imputer must be stochastic, how seeding
works, the BCa restriction — see the
reimpute tutorial.
2. Composing stages¶
Stages compose in the order you list them; each one sees the frame the
previous stage produced. Here we re-impute, then trim away any imputed
educ below 2 (years of schooling that low are almost certainly a bad draw,
not a real observation). Both steps re-run on every replicate.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
m_compose = Margins.linear_scale(
fit,
transforms=[
reimpute(imputer, incomplete=df_nan),
trim(lower=2.0, columns=["educ"]),
],
method="bootstrap",
n_boot=400,
rng_seed=3,
)
print(m_compose.dydx("educ").summary())
===========================================================
Margins Result (bootstrap, level=0.95)
===========================================================
estimate std err statistic P>|z| [95% Conf. Int.]
-----------------------------------------------------------
educ 0.1009 0.0055 0.1009 0.000 0.0942, 0.1139
===========================================================
n = 4360
κ: 0.000
trim sets alters_rows=True: it changes the row set, so the bootstrap
refits with index=None rather than carrying the resample index through. The
engine reads that flag off the stage’s declared contract — you do not have to
manage index bookkeeping yourself.
3. drop_outliers — a re-derived detection rule¶
drop_outliers(rule) takes a callable returning a boolean mask of rows to
drop. The rule is data-dependent here — it flags log-wages more than five
median-absolute-deviations below the median — so it genuinely should be
recomputed on each resample. We drop back to the complete data for this part.
def far_below(frame):
med = frame["lwage"].median()
mad = (frame["lwage"] - med).abs().median()
return frame["lwage"] < med - 5 * mad
print(f"flagged on the full sample: {int(far_below(df).sum())} rows")
df_clean = df[~far_below(df)].reset_index(drop=True)
fit_clean = smf.ols("lwage ~ exper + educ + married + union", data=df_clean).fit()
m_out = Margins.linear_scale(
fit_clean,
transforms=[drop_outliers(far_below)],
method="bootstrap",
n_boot=500,
rng_seed=0,
)
print(m_out.dydx("educ").summary())
flagged on the full sample: 51 rows
===========================================================
Margins Result (bootstrap, level=0.95)
===========================================================
estimate std err statistic P>|z| [95% Conf. Int.]
-----------------------------------------------------------
educ 0.1050 0.0037 0.1050 0.000 0.0970, 0.1111
===========================================================
n = 4309
κ: 0.000
Each replicate recomputes the median and MAD on its own resample and
re-applies the threshold, so the interval reflects the fact that the set of
flagged rows is itself a random quantity. When the flagged rows barely move
the coefficient — as for the educ slope here — the pipeline CI tracks the
plain bootstrap closely; when they do move it, the pipeline is the honest
interval.
4. Structural guards¶
The pipeline rejects combinations that would silently produce wrong numbers, and it does so at session construction — not three minutes into a bootstrap.
A stage that declares requires_resampling=True (like reimpute) is
bootstrap-only; there is no resample distribution under the delta method to
re-derive it over:
try:
Margins.linear_scale(
fit,
transforms=[reimpute(imputer, incomplete=df_nan, warn_on_deterministic=False)],
method="delta",
)
except ValueError as exc:
print(exc)
method='delta' is not compatible with transform stage _ReimputeStage because requires_resampling=True. Use method='bootstrap'.
A row-altering stage (drop_outliers, trim) cannot be combined with
session weights=: the stage thins the rows but the weight vector is not
thinned with it, so the weighted aggregation would misalign.
try:
Margins.linear_scale(
fit,
transforms=[drop_outliers(far_below)],
weights=np.ones(len(df)),
method="bootstrap",
)
except ValueError as exc:
print(exc)
weights= is not compatible with row-altering transform stages (_DropOutliersStage). Row-altering stages thin the data but session weights are not thinned, so weighted aggregation would misalign.
The same spirit covers the rest of the contract: survey_design rejects
row-altering and source-overriding stages (a fixed survey design cannot have
its rows or source frame changed underneath it), matching= and
transforms= are mutually exclusive, and ci_method="bca" is rejected
because the BCa jackknife would operate on the raw frame without running the
pipeline.
Where to next¶
Multiple imputation via reimpute — the full
reimputecontract: stochastic-imputer requirement, seeding, and limitations.Matching support — the re-derive-per-replicate pattern that the pipeline generalizes.
The session pre-commitment — why the session freezes inference parameters once the bootstrap bank is built.