Formula interface for array-fit models¶

statsmodels formula API (smf.glm(...)) handles interactions, polynomials, and splines automatically — the fitted result carries a design_info object that pymargins uses to rebuild the design matrix for perturbations and counterfactuals.

Array-fit models (sm.GLM(y, X)) and linearmodels estimators do not carry this metadata. Without it, dydx() on a variable involved in an interaction or polynomial silently produces the wrong slope because the adapter cannot re-evaluate the derived term.

pymargins solves this with an explicit formula= (and data=) argument that builds a frozen FormulaSpec at session construction.

When do you need it?¶

Model construction	Needs `formula=`?
`smf.glm("y ~ x1 * x2", data=df)`	No — native formula support
`sm.GLM(y, X)` where `X` contains `x1`, `x2`, `x1:x2`	Yes
`sm.OLS(y, X)` with `I(age**2)` in `X`	Yes
`linearmodels.PanelOLS.from_formula(...)`	No — native formula support
`linearmodels.PanelOLS(..., exog=X)` with derived terms	Yes
Any sklearn estimator with derived terms	Yes

Passing `formula=` to `Margins`¶

import statsmodels.api as sm
from pymargins import Margins

# Array-fit OLS with a quadratic
X = pd.DataFrame({
    "Intercept": 1.0,
    "age": df["age"],
    "treat": df["treat"],
    "I(age ** 2)": df["age"] ** 2,
})
fit = sm.OLS(df["y"], X).fit()

# Pass the formula so dydx() knows age appears inside I(age**2)
m = Margins.linear_scale(
    fit,
    formula="y ~ age + treat + I(age**2)",
    data=df,
)
print(m.dydx("age").summary())

The formula= string must exactly match the model specification, including the intercept. If it does not, pymargins raises a verification error showing the max difference in linear predictor.

`Margins.from_formula` shorthand¶

For convenience, there is a classmethod that forwards formula= and data= directly:

m = Margins.from_formula(
    fit,
    formula="y ~ age + treat + I(age**2)",
    data=df,
)

Stateful transforms¶

FormulaSpec freezes transform parameters from the training data, so counterfactuals and perturbations use the training-time state rather than recomputing it on the new data:

# Centering: the new data is centered by the training mean, not its own mean
m = Margins.linear_scale(
    fit,
    formula="y ~ center(age) + treat",
    data=df,
)

Categorical levels¶

The full training level set is preserved, even when the evaluation data lacks some levels:

m = Margins.linear_scale(
    fit,
    formula="y ~ C(region) + age",
    data=df,   # df has regions North, South, East, West
)
# Prediction on a subset that only contains North and South still
# produces the same design matrix columns (West dummy is all zeros).
print(m.predict(atexog={"region": ["North", "South"]}).summary())

What happens without `formula=`?¶

If you pass data= but not formula=, the adapter falls back to column selection (df.reindex(columns=exog_names)). When derived terms are detected in the column names, pymargins emits a warning:

UserWarning: column-selection fallback cannot reproduce derived terms ...

If no derived terms are present, the fallback is exact and no warning is raised.

For sklearn models without formula=, the adapter raises an error when derived terms are present, because silently incorrect slopes are especially dangerous for black-box models where the user cannot easily audit the design matrix by hand.

Formula interface for array-fit models¶

When do you need it?¶

Passing `formula=` to `Margins`¶

`Margins.from_formula` shorthand¶

Stateful transforms¶

Categorical levels¶

What happens without `formula=`?¶

pymargins

Navigation

Related Topics

Formula interface for array-fit models¶

When do you need it?¶

Passing formula= to Margins¶

Margins.from_formula shorthand¶

Stateful transforms¶

Categorical levels¶

What happens without formula=?¶

Passing `formula=` to `Margins`¶

`Margins.from_formula` shorthand¶

What happens without `formula=`?¶