---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Transform pipelines — honest CIs for data-dependent preprocessing

Most applied work runs some preprocessing *before* the model: imputing a
missing covariate, trimming implausible values, dropping outliers by a
data-driven rule. The estimate then carries the fingerprints of that
preprocessing — but the standard inference pretends the cleaned data were
handed down from on high. If the imputation model, the trim bounds, or the
outlier flags would have come out differently on a different sample, your
confidence interval is too narrow.

A `pymargins` **transform pipeline** closes that gap. You pass
`transforms=[...]` to a bootstrap session; each stage is a `frame → frame`
transform that the bootstrap **re-derives on every replicate**, exactly the
way [`matching`](../howto/matching.md) re-matches. The imputer re-fits, the
trim bounds recompute, the outlier rule re-runs — so the resample
distribution absorbs the variability of the preprocessing itself.

This demo runs three stages end-to-end on the Wage panel:

1. `reimpute` — bootstrap-then-impute multiple imputation, and how its CI
   widens relative to a naive single-imputation bootstrap.
2. Composition — chaining `reimpute` with a `trim` stage in one pipeline.
3. `drop_outliers` — a data-driven outlier rule re-derived per replicate.

It closes with the **structural guards**: the combinations the session
rejects at construction, and why.

```{code-cell} python
import warnings

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from linearmodels.datasets import wage_panel
from sklearn.experimental import enable_iterative_imputer  # noqa: F401
from sklearn.impute import IterativeImputer

from pymargins import Margins, drop_outliers, reimpute, trim

cols = ["lwage", "exper", "educ", "married", "union"]
df = wage_panel.load().reset_index(drop=True)[cols].copy()
print(df.describe().round(2))
```

## 1. `reimpute` — multiple imputation under the bootstrap

We inject MAR missingness into `educ`: workers with more experience are
more likely to have an unrecorded education, so the missingness depends on
an observed covariate (missing-at-random, not completely-at-random).

```{code-cell} python
rng = np.random.default_rng(7)
p_miss = 1 / (1 + np.exp(-(df["exper"] - df["exper"].mean()) / 2))
miss = rng.uniform(size=len(df)) < 0.30 * p_miss

df_nan = df.copy()
df_nan.loc[miss, "educ"] = np.nan
print(f"missing educ: {int(miss.sum())} rows ({miss.mean():.1%})")
```

The **point estimate** comes from a single, cheap mean-fill imputation —
this matches how the package already treats bootstrap (the estimate is from
the original fit, the CI from the resample distribution).

```{code-cell} python
df_init = df_nan.fillna(df_nan.mean(numeric_only=True))
fit = smf.ols("lwage ~ exper + educ + married + union", data=df_init).fit()
print(fit.params.round(4))
```

For the bootstrap we need a **stochastic, runnable** imputer: a callable
that takes a DataFrame and returns a fit-and-imputed DataFrame. We use
sklearn's `IterativeImputer` with `sample_posterior=True` so each draw adds
residual noise — a deterministic imputer (mean-fill) would leave the CIs too
narrow, and `reimpute` warns you if it detects one.

```{code-cell} python
imp = IterativeImputer(max_iter=10, random_state=0, sample_posterior=True)


def imputer(frame):
    return pd.DataFrame(imp.fit_transform(frame), columns=frame.columns)
```

### Naive single-imputation bootstrap

First the wrong-but-common approach: bootstrap the *already* mean-filled
frame. Every replicate resamples the same frozen imputed values, so the CI
sees only sampling variability — not the uncertainty about what the missing
education values actually were.

```{code-cell} python
m_naive = Margins.linear_scale(fit, method="bootstrap", n_boot=500, rng_seed=3)
print(m_naive.dydx("educ").summary())
```

### Bootstrap-then-impute

Now the pipeline. `reimpute(imputer, incomplete=df_nan)` tells the bootstrap
to resample the *incomplete* frame and re-impute it fresh on every
replicate. The session per-replicate seed flows into the imputer's
`random_state`, so the run is reproducible.

```{code-cell} python
with warnings.catch_warnings():
    warnings.simplefilter("ignore")  # silence the imputer's convergence chatter
    m_mi = Margins.linear_scale(
        fit,
        transforms=[reimpute(imputer, incomplete=df_nan)],
        method="bootstrap",
        n_boot=500,
        rng_seed=3,
    )
print(m_mi.dydx("educ").summary())
```

The point estimate is identical — both sessions report the slope from the
same `fit`. But the **standard error is larger** and the interval is wider:
that extra width is the imputation-model uncertainty the naive bootstrap
threw away. Because the missingness lands on `educ`, the `educ` slope is
exactly the coefficient that should pay for it; a covariate with no
missingness in its predictors would be essentially unchanged.

For the full contract — why the imputer must be stochastic, how seeding
works, the BCa restriction — see the
[`reimpute` tutorial](../tutorials/mi_via_reimpute.md).

## 2. Composing stages

Stages compose in the order you list them; each one sees the frame the
previous stage produced. Here we re-impute, then `trim` away any imputed
`educ` below 2 (years of schooling that low are almost certainly a bad draw,
not a real observation). Both steps re-run on every replicate.

```{code-cell} python
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    m_compose = Margins.linear_scale(
        fit,
        transforms=[
            reimpute(imputer, incomplete=df_nan),
            trim(lower=2.0, columns=["educ"]),
        ],
        method="bootstrap",
        n_boot=400,
        rng_seed=3,
    )
print(m_compose.dydx("educ").summary())
```

`trim` sets `alters_rows=True`: it changes the row set, so the bootstrap
refits with `index=None` rather than carrying the resample index through. The
engine reads that flag off the stage's declared contract — you do not have to
manage index bookkeeping yourself.

## 3. `drop_outliers` — a re-derived detection rule

`drop_outliers(rule)` takes a callable returning a boolean mask of rows to
drop. The rule is data-dependent here — it flags log-wages more than five
median-absolute-deviations below the median — so it genuinely should be
recomputed on each resample. We drop back to the complete data for this part.

```{code-cell} python
def far_below(frame):
    med = frame["lwage"].median()
    mad = (frame["lwage"] - med).abs().median()
    return frame["lwage"] < med - 5 * mad


print(f"flagged on the full sample: {int(far_below(df).sum())} rows")

df_clean = df[~far_below(df)].reset_index(drop=True)
fit_clean = smf.ols("lwage ~ exper + educ + married + union", data=df_clean).fit()

m_out = Margins.linear_scale(
    fit_clean,
    transforms=[drop_outliers(far_below)],
    method="bootstrap",
    n_boot=500,
    rng_seed=0,
)
print(m_out.dydx("educ").summary())
```

Each replicate recomputes the median and MAD on its *own* resample and
re-applies the threshold, so the interval reflects the fact that the set of
flagged rows is itself a random quantity. When the flagged rows barely move
the coefficient — as for the `educ` slope here — the pipeline CI tracks the
plain bootstrap closely; when they do move it, the pipeline is the honest
interval.

## 4. Structural guards

The pipeline rejects combinations that would silently produce wrong numbers,
and it does so at session construction — not three minutes into a bootstrap.

A stage that declares `requires_resampling=True` (like `reimpute`) is
bootstrap-only; there is no resample distribution under the delta method to
re-derive it over:

```{code-cell} python
try:
    Margins.linear_scale(
        fit,
        transforms=[reimpute(imputer, incomplete=df_nan, warn_on_deterministic=False)],
        method="delta",
    )
except ValueError as exc:
    print(exc)
```

A row-altering stage (`drop_outliers`, `trim`) cannot be combined with
session `weights=`: the stage thins the rows but the weight vector is not
thinned with it, so the weighted aggregation would misalign.

```{code-cell} python
try:
    Margins.linear_scale(
        fit,
        transforms=[drop_outliers(far_below)],
        weights=np.ones(len(df)),
        method="bootstrap",
    )
except ValueError as exc:
    print(exc)
```

The same spirit covers the rest of the contract: `survey_design` rejects
row-altering and source-overriding stages (a fixed survey design cannot have
its rows or source frame changed underneath it), `matching=` and
`transforms=` are mutually exclusive, and `ci_method="bca"` is rejected
because the BCa jackknife would operate on the raw frame without running the
pipeline.

## Where to next

- [](../tutorials/mi_via_reimpute.md) — the full `reimpute` contract:
  stochastic-imputer requirement, seeding, and limitations.
- [](../howto/matching.md) — the re-derive-per-replicate pattern that the
  pipeline generalizes.
- [](../explanations/session_precommitment.md) — why the session freezes
  inference parameters once the bootstrap bank is built.