Formula interface for array-fit models¶
statsmodels formula API (smf.glm(...)) handles interactions, polynomials,
and splines automatically — the fitted result carries a design_info
object that pymargins uses to rebuild the design matrix for
perturbations and counterfactuals.
Array-fit models (sm.GLM(y, X)) and linearmodels estimators do not
carry this metadata. Without it, dydx() on a variable involved in an
interaction or polynomial silently produces the wrong slope because the
adapter cannot re-evaluate the derived term.
pymargins solves this with an explicit formula= (and data=)
argument that builds a frozen FormulaSpec at session construction.
When do you need it?¶
Model construction |
Needs |
|---|---|
|
No — native formula support |
|
Yes |
|
Yes |
|
No — native formula support |
|
Yes |
Any sklearn estimator with derived terms |
Yes |
Passing formula= to Margins¶
import statsmodels.api as sm
from pymargins import Margins
# Array-fit OLS with a quadratic
X = pd.DataFrame({
"Intercept": 1.0,
"age": df["age"],
"treat": df["treat"],
"I(age ** 2)": df["age"] ** 2,
})
fit = sm.OLS(df["y"], X).fit()
# Pass the formula so dydx() knows age appears inside I(age**2)
m = Margins.linear_scale(
fit,
formula="y ~ age + treat + I(age**2)",
data=df,
)
print(m.dydx("age").summary())
The formula= string must exactly match the model specification,
including the intercept. If it does not, pymargins raises a
verification error showing the max difference in linear predictor.
Margins.from_formula shorthand¶
For convenience, there is a classmethod that forwards formula= and
data= directly:
m = Margins.from_formula(
fit,
formula="y ~ age + treat + I(age**2)",
data=df,
)
Stateful transforms¶
FormulaSpec freezes transform parameters from the training data, so
counterfactuals and perturbations use the training-time state rather
than recomputing it on the new data:
# Centering: the new data is centered by the training mean, not its own mean
m = Margins.linear_scale(
fit,
formula="y ~ center(age) + treat",
data=df,
)
Categorical levels¶
The full training level set is preserved, even when the evaluation data lacks some levels:
m = Margins.linear_scale(
fit,
formula="y ~ C(region) + age",
data=df, # df has regions North, South, East, West
)
# Prediction on a subset that only contains North and South still
# produces the same design matrix columns (West dummy is all zeros).
print(m.predict(atexog={"region": ["North", "South"]}).summary())
What happens without formula=?¶
If you pass data= but not formula=, the adapter falls back to column
selection (df.reindex(columns=exog_names)). When derived terms are
detected in the column names, pymargins emits a warning:
UserWarning: column-selection fallback cannot reproduce derived terms ...
If no derived terms are present, the fallback is exact and no warning is raised.
For sklearn models without formula=, the adapter raises an error
when derived terms are present, because silently incorrect slopes are
especially dangerous for black-box models where the user cannot easily
audit the design matrix by hand.