# Multiple imputation via ``reimpute``

The ``reimpute`` stage implements bootstrap-then-impute: one imputation per
bootstrap replicate, with the imputer re-fit from scratch each time.  This
injects imputation-model parameter uncertainty into the bootstrap
distribution, producing valid confidence intervals without a Rubin combinator.

## When to use this

- Your data has missing values that you do not want to listwise-delete.
- You have a **runnable imputer** (a Python callable that takes a DataFrame
  and returns an imputed DataFrame).  Frozen, pre-computed imputed frames are
  not supported — the stage must be able to re-impute a resampled draw.
- You are willing to use **bootstrap inference** (``method="bootstrap"``).
  ``reimpute`` is invalid under delta-method or simulation inference.

## Quick example

```python
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from pymargins import Margins, reimpute

# 1. Build data with MAR missingness
rng = np.random.default_rng(42)
n = 400
df = pd.DataFrame({
    "x1": rng.normal(size=n),
    "x2": rng.normal(size=n),
})
df["y"] = 1.0 + 0.6 * df["x1"] - 0.4 * df["x2"] + rng.normal(scale=0.5, size=n)
missing = rng.uniform(size=n) < 0.25
df_nan = df.copy()
df_nan.loc[missing, "x1"] = np.nan

# 2. Fit the model on a single initial imputation
#    (the point estimate comes from this fit)
df_init = df_nan.fillna(df_nan.mean())
fit = smf.ols("y ~ x1 + x2", data=df_init).fit()

# 3. Wrap the imputer so it returns a DataFrame
imp = IterativeImputer(max_iter=10, random_state=0, sample_posterior=True)

def imputer(frame):
    arr = imp.fit_transform(frame)
    return pd.DataFrame(arr, columns=frame.columns)

# 4. Run bootstrap-then-impute
m = Margins(
    fit,
    transforms=[reimpute(imputer, incomplete=df_nan)],
    method="bootstrap",
    n_boot=1000,
    rng_seed=7,
)
r = m.predict(atexog={"x1": 0, "x2": 0})
```

## Key rules

### 1. The imputer must be stochastic

A deterministic imputer (e.g. ``SimpleImputer(strategy="mean")`` or
``IterativeImputer(sample_posterior=False)``) fills the conditional mean with
no residual draw.  The bootstrap still re-fits each replicate, so the filled
values track each resample's mean, but the *missing residual variance* means
your CIs will be too narrow.

``reimpute`` runs a cheap construction-time guard: it calls the imputer twice
on the same small sample, and if the output is byte-identical it warns you.
You can suppress this with ``warn_on_deterministic=False``.

### 2. Seed the imputer for reproducibility

The session ``rng_seed`` controls the bootstrap resample indices, but it does
**not** automatically seed your imputer.  For reproducible draws you must set
``random_state`` on the imputer object itself:

```python
imp = IterativeImputer(random_state=42, sample_posterior=True)
```

If the imputer exposes ``random_state=None``, ``reimpute`` warns at
construction.

**Important:** create a fresh imputer object for each ``Margins`` session.
Sharing a single imputer instance across sessions can leave internal state from
the first session and break reproducibility, even with a fixed
``random_state``.

### 3. The point estimate is single-imputation

The reported ``estimate`` comes from the model you passed to ``Margins``
(fitted on the initial single imputation).  The bootstrap supplies
imputation-aware **CIs**, not a pooled point estimate.  This matches how the
package already treats bootstrap: the estimate is from the original fit, the
CI from the resample distribution.

### 4. Structural columns must be complete

Columns that define the inference design — ``cluster=``, ``survey_design``
PSU/strata, and ``weights=`` — must not contain missing values.  Only
*substantive* columns may be imputed.

### 5. Bootstrap only

``reimpute`` sets ``requires_resampling=True``, so ``method="delta"`` or
``method="simulation"`` raises at construction.

## Comparison to plain bootstrap

Bootstrap-then-impute widens the CI for coefficients affected by missingness
because each replicate draws a different imputation.  Coefficients with no
missingness in their predictors are essentially unchanged.

## Limitations

- **Frozen frames are not supported on this path.** ``reimpute`` needs a
  re-runnable imputer because it re-imputes every bootstrap replicate. If you
  already hold M *precomputed* completed frames (e.g. from R ``mice`` exported
  to CSV), pool them with Rubin's rules via
  [``pool_imputations``](pooling_imputations.md) instead — no Python imputer
  required, just the M completed frames.
- **BCa CIs are rejected.** The BCa jackknife operates on the raw incomplete
  frame without re-imputing, producing an inconsistent acceleration parameter.
  Use ``ci_method="percentile"`` (default) or ``"studentized"``.
- **Survey designs are incompatible.** ``survey_design`` + ``reimpute`` raises
  because the fixed-design assumption breaks when the source frame is
  incomplete.