--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 kernelspec: display_name: Python 3 language: python name: python3 --- # scikit-learn models ```{code-cell} python import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.ensemble import GradientBoostingRegressor from pymargins import Margins from pymargins.adapters import SklearnBootstrapAdapter rng = np.random.default_rng(42) n = 1000 df = pd.DataFrame({ "age": rng.normal(50, 10, size=n), "female": rng.binomial(1, 0.52, n), "treated": rng.binomial(1, 0.40, n), }) df["age_sq"] = df["age"] ** 2 df["y"] = ( 10.0 + 0.2 * df["age"] + 0.001 * df["age_sq"] - 0.5 * df["female"] + 0.8 * df["treated"] + rng.normal(size=n) ) ``` `pymargins` supports scikit-learn estimators through a bootstrap-only adapter. Because sklearn models do not expose a parametric coefficient vector or covariance matrix, **only bootstrap inference is available**. Delta-method and Krinsky–Robb simulation are not supported. :::{important} For linear models (OLS, GLM, etc.) prefer statsmodels, linearmodels, or another parametric backend. scikit-learn support is intended for non-parametric estimators — random forests, gradient boosting, support vector machines — where bootstrap is already the natural inference strategy. ::: ## Basic workflow Fit the sklearn estimator as usual, wrap it in `SklearnBootstrapAdapter`, and open a `Margins` session with `method="bootstrap"`: ```{code-cell} python X = df[["age", "female", "treated", "age_sq"]] y = df["y"] model = LinearRegression() model.fit(X, y) adapter = SklearnBootstrapAdapter(model, X_train=X, y_train=y) m = Margins(model, adapter=adapter, method="bootstrap", n_boot=200, rng_seed=42) print(m.predict(atexog={"treated": [0, 1]}).summary()) ``` ## Slopes with derived terms If your model includes interactions, polynomials, splines, or other transformations, pass a `formula=` and `data=` so that `dydx()` can re-evaluate derived terms when it perturbs a column. The formula-built design matrix must match the columns the model was trained on. Here we fit without an intercept and suppress the intercept in the formula so the column counts align: ```{code-cell} python model_poly = LinearRegression(fit_intercept=False) X_poly = pd.DataFrame({ "age": df["age"], "female": df["female"], "treated": df["treated"], "age_sq": df["age_sq"], }) model_poly.fit(X_poly, df["y"]) adapter_poly = SklearnBootstrapAdapter( model_poly, formula="y ~ 0 + age + female + treated + I(age**2)", data=df, target_name="y", ) m_poly = Margins( model_poly, adapter=adapter_poly, method="bootstrap", n_boot=200, rng_seed=42, ) slope = m_poly.dydx("age") print(slope.summary()) ``` Without `formula=`, the adapter falls back to simple column selection. If derived terms are present this raises a clear error rather than silently returning an incorrect slope. ## Non-parametric estimators The same pattern works for tree-based models. Point estimates come from the fitted model; uncertainty comes from resampling: ```{code-cell} python gbr = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42) gbr.fit(X, y) adapter_gbr = SklearnBootstrapAdapter(gbr, X_train=X, y_train=y) m_gbr = Margins(gbr, adapter=adapter_gbr, method="bootstrap", n_boot=100, rng_seed=42) print(m_gbr.predict(atexog={"treated": [0, 1]}).summary()) ``` ## Limitations - **No delta / simulation** — the adapter returns dummy coefficients and covariance; all uncertainty is bootstrap-based. - **No custom `vcov=`** — bootstrap does not use a parametric covariance matrix, so passing `vcov=` raises an error. - **Formula verification** — when `formula=` is provided, the adapter verifies that the formula-built design matrix reproduces the model's predictions. Mismatches (e.g. an unexpected intercept) raise `ValueError` with a diagnostic message. - **Parallel bootstrap** — `n_jobs > 1` is supported via thread pools. Set `n_jobs=1` if the underlying sklearn estimator is not thread-safe.