scikit-learn models¶
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from pymargins import Margins
from pymargins.adapters import SklearnBootstrapAdapter
rng = np.random.default_rng(42)
n = 1000
df = pd.DataFrame({
"age": rng.normal(50, 10, size=n),
"female": rng.binomial(1, 0.52, n),
"treated": rng.binomial(1, 0.40, n),
})
df["age_sq"] = df["age"] ** 2
df["y"] = (
10.0
+ 0.2 * df["age"]
+ 0.001 * df["age_sq"]
- 0.5 * df["female"]
+ 0.8 * df["treated"]
+ rng.normal(size=n)
)
pymargins supports scikit-learn estimators through a bootstrap-only
adapter. Because sklearn models do not expose a parametric coefficient
vector or covariance matrix, only bootstrap inference is available.
Delta-method and Krinsky–Robb simulation are not supported.
Important
For linear models (OLS, GLM, etc.) prefer statsmodels, linearmodels, or another parametric backend. scikit-learn support is intended for non-parametric estimators — random forests, gradient boosting, support vector machines — where bootstrap is already the natural inference strategy.
Basic workflow¶
Fit the sklearn estimator as usual, wrap it in
SklearnBootstrapAdapter, and open a Margins session with
method="bootstrap":
X = df[["age", "female", "treated", "age_sq"]]
y = df["y"]
model = LinearRegression()
model.fit(X, y)
adapter = SklearnBootstrapAdapter(model, X_train=X, y_train=y)
m = Margins(model, adapter=adapter, method="bootstrap", n_boot=200, rng_seed=42)
print(m.predict(atexog={"treated": [0, 1]}).summary())
================================================================
Margins Result (bootstrap, level=0.95)
================================================================
estimate std err statistic P>|z| [95% Conf. Int.]
----------------------------------------------------------------
treated=0 22.2613 0.0388 22.2613 0.000 22.1899, 22.3444
treated=1 23.0697 0.0574 23.0697 0.000 22.9531, 23.1762
================================================================
n = 1000
κ: max=inf
Slopes with derived terms¶
If your model includes interactions, polynomials, splines, or other
transformations, pass a formula= and data= so that dydx() can
re-evaluate derived terms when it perturbs a column.
The formula-built design matrix must match the columns the model was trained on. Here we fit without an intercept and suppress the intercept in the formula so the column counts align:
model_poly = LinearRegression(fit_intercept=False)
X_poly = pd.DataFrame({
"age": df["age"],
"female": df["female"],
"treated": df["treated"],
"age_sq": df["age_sq"],
})
model_poly.fit(X_poly, df["y"])
adapter_poly = SklearnBootstrapAdapter(
model_poly,
formula="y ~ 0 + age + female + treated + I(age**2)",
data=df,
target_name="y",
)
m_poly = Margins(
model_poly, adapter=adapter_poly,
method="bootstrap", n_boot=200, rng_seed=42,
)
slope = m_poly.dydx("age")
print(slope.summary())
==========================================================
Margins Result (bootstrap, level=0.95)
==========================================================
estimate std err statistic P>|z| [95% Conf. Int.]
----------------------------------------------------------
age 0.3140 0.0042 0.3140 0.000 0.3060, 0.3224
==========================================================
n = 1000
κ: inf
Without formula=, the adapter falls back to simple column selection.
If derived terms are present this raises a clear error rather than
silently returning an incorrect slope.
Non-parametric estimators¶
The same pattern works for tree-based models. Point estimates come from the fitted model; uncertainty comes from resampling:
gbr = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
gbr.fit(X, y)
adapter_gbr = SklearnBootstrapAdapter(gbr, X_train=X, y_train=y)
m_gbr = Margins(gbr, adapter=adapter_gbr, method="bootstrap", n_boot=100, rng_seed=42)
print(m_gbr.predict(atexog={"treated": [0, 1]}).summary())
================================================================
Margins Result (bootstrap, level=0.95)
================================================================
estimate std err statistic P>|z| [95% Conf. Int.]
----------------------------------------------------------------
treated=0 22.2911 0.0401 22.2911 0.000 22.2261, 22.3792
treated=1 23.0330 0.0558 23.0330 0.000 22.9096, 23.1099
================================================================
n = 1000
κ: max=inf
Limitations¶
No delta / simulation — the adapter returns dummy coefficients and covariance; all uncertainty is bootstrap-based.
No custom
vcov=— bootstrap does not use a parametric covariance matrix, so passingvcov=raises an error.Formula verification — when
formula=is provided, the adapter verifies that the formula-built design matrix reproduces the model’s predictions. Mismatches (e.g. an unexpected intercept) raiseValueErrorwith a diagnostic message.Parallel bootstrap —
n_jobs > 1is supported via thread pools. Setn_jobs=1if the underlying sklearn estimator is not thread-safe.