# Formula interface for array-fit models

statsmodels formula API (`smf.glm(...)`) handles interactions, polynomials,
and splines automatically — the fitted result carries a `design_info`
object that `pymargins` uses to rebuild the design matrix for
perturbations and counterfactuals.

Array-fit models (`sm.GLM(y, X)`) and linearmodels estimators do **not**
carry this metadata.  Without it, `dydx()` on a variable involved in an
interaction or polynomial silently produces the wrong slope because the
adapter cannot re-evaluate the derived term.

`pymargins` solves this with an explicit `formula=` (and `data=`)
argument that builds a frozen `FormulaSpec` at session construction.

## When do you need it?

| Model construction | Needs `formula=`? |
|--------------------|-------------------|
| `smf.glm("y ~ x1 * x2", data=df)` | No — native formula support |
| `sm.GLM(y, X)` where `X` contains `x1`, `x2`, `x1:x2` | **Yes** |
| `sm.OLS(y, X)` with `I(age**2)` in `X` | **Yes** |
| `linearmodels.PanelOLS.from_formula(...)` | No — native formula support |
| `linearmodels.PanelOLS(..., exog=X)` with derived terms | **Yes** |
| Any sklearn estimator with derived terms | **Yes** |

## Passing `formula=` to `Margins`

```python
import statsmodels.api as sm
from pymargins import Margins

# Array-fit OLS with a quadratic
X = pd.DataFrame({
    "Intercept": 1.0,
    "age": df["age"],
    "treat": df["treat"],
    "I(age ** 2)": df["age"] ** 2,
})
fit = sm.OLS(df["y"], X).fit()

# Pass the formula so dydx() knows age appears inside I(age**2)
m = Margins.linear_scale(
    fit,
    formula="y ~ age + treat + I(age**2)",
    data=df,
)
print(m.dydx("age").summary())
```

The `formula=` string must exactly match the model specification,
including the intercept.  If it does not, `pymargins` raises a
verification error showing the max difference in linear predictor.

## `Margins.from_formula` shorthand

For convenience, there is a classmethod that forwards `formula=` and
`data=` directly:

```python
m = Margins.from_formula(
    fit,
    formula="y ~ age + treat + I(age**2)",
    data=df,
)
```

## Stateful transforms

`FormulaSpec` freezes transform parameters from the training data, so
counterfactuals and perturbations use the *training-time* state rather
than recomputing it on the new data:

```python
# Centering: the new data is centered by the training mean, not its own mean
m = Margins.linear_scale(
    fit,
    formula="y ~ center(age) + treat",
    data=df,
)
```

## Categorical levels

The full training level set is preserved, even when the evaluation data
lacks some levels:

```python
m = Margins.linear_scale(
    fit,
    formula="y ~ C(region) + age",
    data=df,   # df has regions North, South, East, West
)
# Prediction on a subset that only contains North and South still
# produces the same design matrix columns (West dummy is all zeros).
print(m.predict(atexog={"region": ["North", "South"]}).summary())
```

## What happens without `formula=`?

If you pass `data=` but not `formula=`, the adapter falls back to column
selection (`df.reindex(columns=exog_names)`).  When derived terms are
detected in the column names, `pymargins` emits a warning:

```
UserWarning: column-selection fallback cannot reproduce derived terms ...
```

If no derived terms are present, the fallback is exact and no warning is
raised.

For sklearn models without `formula=`, the adapter **raises** an error
when derived terms are present, because silently incorrect slopes are
especially dangerous for black-box models where the user cannot easily
audit the design matrix by hand.