Multiple imputation via `reimpute`¶

The reimpute stage implements bootstrap-then-impute: one imputation per bootstrap replicate, with the imputer re-fit from scratch each time. This injects imputation-model parameter uncertainty into the bootstrap distribution, producing valid confidence intervals without a Rubin combinator.

When to use this¶

Your data has missing values that you do not want to listwise-delete.
You have a runnable imputer (a Python callable that takes a DataFrame and returns an imputed DataFrame). Frozen, pre-computed imputed frames are not supported — the stage must be able to re-impute a resampled draw.
You are willing to use bootstrap inference (method="bootstrap"). reimpute is invalid under delta-method or simulation inference.

Quick example¶

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from pymargins import Margins, reimpute

# 1. Build data with MAR missingness
rng = np.random.default_rng(42)
n = 400
df = pd.DataFrame({
    "x1": rng.normal(size=n),
    "x2": rng.normal(size=n),
})
df["y"] = 1.0 + 0.6 * df["x1"] - 0.4 * df["x2"] + rng.normal(scale=0.5, size=n)
missing = rng.uniform(size=n) < 0.25
df_nan = df.copy()
df_nan.loc[missing, "x1"] = np.nan

# 2. Fit the model on a single initial imputation
#    (the point estimate comes from this fit)
df_init = df_nan.fillna(df_nan.mean())
fit = smf.ols("y ~ x1 + x2", data=df_init).fit()

# 3. Wrap the imputer so it returns a DataFrame
imp = IterativeImputer(max_iter=10, random_state=0, sample_posterior=True)

def imputer(frame):
    arr = imp.fit_transform(frame)
    return pd.DataFrame(arr, columns=frame.columns)

# 4. Run bootstrap-then-impute
m = Margins(
    fit,
    transforms=[reimpute(imputer, incomplete=df_nan)],
    method="bootstrap",
    n_boot=1000,
    rng_seed=7,
)
r = m.predict(atexog={"x1": 0, "x2": 0})

Key rules¶

1. The imputer must be stochastic¶

A deterministic imputer (e.g. SimpleImputer(strategy="mean") or IterativeImputer(sample_posterior=False)) fills the conditional mean with no residual draw. The bootstrap still re-fits each replicate, so the filled values track each resample’s mean, but the missing residual variance means your CIs will be too narrow.

reimpute runs a cheap construction-time guard: it calls the imputer twice on the same small sample, and if the output is byte-identical it warns you. You can suppress this with warn_on_deterministic=False.

2. Seed the imputer for reproducibility¶

The session rng_seed controls the bootstrap resample indices, but it does not automatically seed your imputer. For reproducible draws you must set random_state on the imputer object itself:

imp = IterativeImputer(random_state=42, sample_posterior=True)

If the imputer exposes random_state=None, reimpute warns at construction.

Important: create a fresh imputer object for each Margins session. Sharing a single imputer instance across sessions can leave internal state from the first session and break reproducibility, even with a fixed random_state.

3. The point estimate is single-imputation¶

The reported estimate comes from the model you passed to Margins (fitted on the initial single imputation). The bootstrap supplies imputation-aware CIs, not a pooled point estimate. This matches how the package already treats bootstrap: the estimate is from the original fit, the CI from the resample distribution.

4. Structural columns must be complete¶

Columns that define the inference design — cluster=, survey_design PSU/strata, and weights= — must not contain missing values. Only substantive columns may be imputed.

5. Bootstrap only¶

reimpute sets requires_resampling=True, so method="delta" or method="simulation" raises at construction.

Comparison to plain bootstrap¶

Bootstrap-then-impute widens the CI for coefficients affected by missingness because each replicate draws a different imputation. Coefficients with no missingness in their predictors are essentially unchanged.

Limitations¶

Frozen frames are not supported on this path. reimpute needs a re-runnable imputer because it re-imputes every bootstrap replicate. If you already hold M precomputed completed frames (e.g. from R mice exported to CSV), pool them with Rubin’s rules via pool_imputations instead — no Python imputer required, just the M completed frames.
BCa CIs are rejected. The BCa jackknife operates on the raw incomplete frame without re-imputing, producing an inconsistent acceleration parameter. Use ci_method="percentile" (default) or "studentized".
Survey designs are incompatible. survey_design + reimpute raises because the fixed-design assumption breaks when the source frame is incomplete.

Multiple imputation via `reimpute`¶

When to use this¶

Quick example¶

Key rules¶

1. The imputer must be stochastic¶

2. Seed the imputer for reproducibility¶

3. The point estimate is single-imputation¶

4. Structural columns must be complete¶

5. Bootstrap only¶

Comparison to plain bootstrap¶

Limitations¶

pymargins

Navigation

Related Topics

Multiple imputation via reimpute¶

When to use this¶

Quick example¶

Key rules¶

1. The imputer must be stochastic¶

2. Seed the imputer for reproducibility¶

3. The point estimate is single-imputation¶

4. Structural columns must be complete¶

5. Bootstrap only¶

Comparison to plain bootstrap¶

Limitations¶

Multiple imputation via `reimpute`¶