Formula interface for array-fit models

statsmodels formula API (smf.glm(...)) handles interactions, polynomials, and splines automatically — the fitted result carries a design_info object that pymargins uses to rebuild the design matrix for perturbations and counterfactuals.

Array-fit models (sm.GLM(y, X)) and linearmodels estimators do not carry this metadata. Without it, dydx() on a variable involved in an interaction or polynomial silently produces the wrong slope because the adapter cannot re-evaluate the derived term.

pymargins solves this with an explicit formula= (and data=) argument that builds a frozen FormulaSpec at session construction.

When do you need it?

Model construction

Needs formula=?

smf.glm("y ~ x1 * x2", data=df)

No — native formula support

sm.GLM(y, X) where X contains x1, x2, x1:x2

Yes

sm.OLS(y, X) with I(age**2) in X

Yes

linearmodels.PanelOLS.from_formula(...)

No — native formula support

linearmodels.PanelOLS(..., exog=X) with derived terms

Yes

Any sklearn estimator with derived terms

Yes

Passing formula= to Margins

import statsmodels.api as sm
from pymargins import Margins

# Array-fit OLS with a quadratic
X = pd.DataFrame({
    "Intercept": 1.0,
    "age": df["age"],
    "treat": df["treat"],
    "I(age ** 2)": df["age"] ** 2,
})
fit = sm.OLS(df["y"], X).fit()

# Pass the formula so dydx() knows age appears inside I(age**2)
m = Margins.linear_scale(
    fit,
    formula="y ~ age + treat + I(age**2)",
    data=df,
)
print(m.dydx("age").summary())

The formula= string must exactly match the model specification, including the intercept. If it does not, pymargins raises a verification error showing the max difference in linear predictor.

Margins.from_formula shorthand

For convenience, there is a classmethod that forwards formula= and data= directly:

m = Margins.from_formula(
    fit,
    formula="y ~ age + treat + I(age**2)",
    data=df,
)

Stateful transforms

FormulaSpec freezes transform parameters from the training data, so counterfactuals and perturbations use the training-time state rather than recomputing it on the new data:

# Centering: the new data is centered by the training mean, not its own mean
m = Margins.linear_scale(
    fit,
    formula="y ~ center(age) + treat",
    data=df,
)

Categorical levels

The full training level set is preserved, even when the evaluation data lacks some levels:

m = Margins.linear_scale(
    fit,
    formula="y ~ C(region) + age",
    data=df,   # df has regions North, South, East, West
)
# Prediction on a subset that only contains North and South still
# produces the same design matrix columns (West dummy is all zeros).
print(m.predict(atexog={"region": ["North", "South"]}).summary())

What happens without formula=?

If you pass data= but not formula=, the adapter falls back to column selection (df.reindex(columns=exog_names)). When derived terms are detected in the column names, pymargins emits a warning:

UserWarning: column-selection fallback cannot reproduce derived terms ...

If no derived terms are present, the fallback is exact and no warning is raised.

For sklearn models without formula=, the adapter raises an error when derived terms are present, because silently incorrect slopes are especially dangerous for black-box models where the user cannot easily audit the design matrix by hand.