# Gradient backend: autodiff vs wrapped-FD vs FD

`pymargins` computes the delta-method Jacobian by JAX
autodifferentiation when possible, by autodiff over a custom-JVP FD
primitive when the model is a black box, and by full finite
differences as a last resort.

| Backend       | When picked                                     | Pros                                | Cons                                          |
|---------------|-------------------------------------------------|-------------------------------------|-----------------------------------------------|
| `autodiff`    | predict can be expressed in JAX (GLMs, OLS)     | exact gradient and Hessian          | model must be JAX-implementable               |
| `wrapped_fd`  | black-box predict, but `η = X β` is accessible  | exact gradient outside the boundary | one FD call per parameter at the boundary     |
| `fd`          | full black box                                  | works on anything                   | Hessian quality compounds poorly (bad for κ)  |

The session argument `gradient_backend="auto"` picks the best
available path per adapter; the choice is sticky for the session.

## The custom-JVP bridge

For a black-box predict `f(β, X)`, the wrapped-FD path wraps the
*predict boundary itself* in a JAX primitive with a custom JVP that
does central differences. Once the primitive is registered, all
downstream estimand math (averaging over rows, applying `phi`,
forming contrasts) is autodiff. The FD compounding is bounded to one
primitive call.

This is the recommended path for adapter implementers when JAX
reimplementation would be error-prone but the linear predictor
`η = X β` is exposed by the fitted result. The helpers
`make_predict_with_fd_jvp` and `make_glm_jvp_wrapper` in
`pymargins._gradients` factor out the boilerplate.

## When full FD is unavoidable

For models with no exposed `η = X β` structure (rare, but it
happens — some bespoke fitters, certain mixture models), the full-FD
backend differentiates the entire estimand through the model's
predict function. Gradient quality is acceptable; *Hessian* quality
is poor, which means κ is noisier and the fallback decision becomes
less reliable. In these cases, prefer a bootstrap or simulation
session — and consider whether you want to expose the linear
predictor to upgrade the adapter to `wrapped_fd`.

See [](adapter_pattern.md) for the adapter contract that decides the
backend.