Mathematical motivation¶

This page derives, in one place, the statistics Margins computes and the uncertainty attached to each. The motivation matches Stata’s margins-delta-method FAQ and Richard Williams’ Margins01 notes; the goal here is to expose the single Jacobian primitive the implementation reduces to, and to write down the curvature diagnostic that decides when the delta method is unsafe.

The delta method¶

For a fitted parameter vector \(\hat\beta\) with estimated covariance \(\widehat V\), and a (possibly vector-valued) statistic \(g(\beta)\), a first-order Taylor expansion gives

\[g(\hat\beta) \;\approx\; g(\beta_0) + G\,(\hat\beta - \beta_0), \qquad G = \left.\tfrac{\partial g}{\partial \beta}\right|_{\beta_0}.\]

Taking variances yields

\[\widehat{\operatorname{Var}}\bigl[g(\hat\beta)\bigr] \;\approx\; G\,\widehat V\,G^\top .\]

Three things to notice:

The approximation depends only on first derivatives of \(g\).
Once \(G\widehat V G^\top\) is in hand, any linear combination \(C\,g(\hat\beta)\) has covariance \(C(G\widehat V G^\top)C^\top\) — no further differentiation needed. That is why contrasts() is exact under the same approximation.
pymargins computes \(G\) by JAX autodiff when an autodiff path exists for the predict function, by autodiff over a custom-JVP FD primitive when the model is black-box but has a clean \(\eta = X\beta\) linear predictor, and by full finite differences otherwise. See Gradient backend: autodiff vs wrapped-FD vs FD.

A single estimand schema¶

Internal to the library every estimand is a triple \((h, \phi, \phi^{-1})\):

\(h(\beta)\) — the estimand on the inference scale; this is what gets differentiated for delta and what gets evaluated for simulation.
\(\phi\) — back-transform from inference scale to reporting scale, applied to CI endpoints.
\(\phi^{-1}\) — forward transform; converts user-supplied null values onto the inference scale for hypothesis tests.

The pair \((\phi, \phi^{-1})\) is session-level, not per-estimand: every call within one Margins instance is on the same inference scale. That is what makes the session-level κ diagnostic and inter-call composability work.

The Williams (2012) statistic table reduces to a single primitive:

Statistic	\(g(\beta)\)
AAP	\(\tfrac{1}{n}\sum_i f(x_i^\top\beta)\)
APM	\(f(\bar x^\top\beta)\)
APR	\(\tfrac{1}{n}\sum_i f(x_i^\top\beta)\) with some \(x\) fixed
AME	\(\tfrac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}\)
MEM	\(\partial f(\bar x^\top\beta)/\partial x_j\)
MER	AME with some \(x\) held at representative values
Contrast	\(E[f(\cdot)\mid x_j=\ell] - E[f(\cdot)\mid x_j=\text{ref}]\)

Every entry is a linear combination of mean-predictions. The single Jacobian primitive is

\[\frac{\partial}{\partial\beta}\, \frac{1}{n}\sum_i f(x_i^\top\beta) \;=\; \frac{1}{n}\sum_i f'(x_i^\top\beta)\,x_i .\]

Substituting back into the delta method gives a closed form for the covariance of every statistic above.

Beyond delta: simulation and bootstrap¶

The delta method is a first-order approximation: \(G\widehat V G^\top\) is exact only to the extent that \(g\) is locally linear in \(\beta\) near \(\hat\beta\). For statistics that are sharply nonlinear (a probability near 0 or 1, an elasticity at a near-zero prediction, a ratio with a small denominator) Wald intervals can extend beyond the natural support or under-cover.

pymargins exposes two alternatives behind the same method= keyword on the session.

Krinsky–Robb simulation (method="simulation"). Draw \(S\) parameter vectors \(\beta_s\sim N(\hat\beta,\widehat V)\), evaluate \(g(\beta_s)\), and read inference off the empirical distribution:

\[\widehat{\operatorname{Var}}_{\mathrm{KR}}\bigl[g(\hat\beta)\bigr] = \frac{1}{S-1}\sum_{s=1}^S \bigl(g(\beta_s) - \bar g\bigr)\bigl(g(\beta_s) - \bar g\bigr)^{\!\top}.\]

The reported point estimate stays the analytic \(g(\hat\beta)\) (not the Monte Carlo mean), matching Stata’s vce(simulation). Pointwise CIs default to empirical quantiles of the draws — so an asymmetric CI for an extreme probability is natural.

Bootstrap (method="bootstrap"). Refit the model on \(B\) resamples and read inference off the bootstrap distribution. Three resampling schemes:

pairs — \((y_i, x_i)\) resampled IID (default);
cluster — whole clusters resampled, required for within-cluster correlation;
block — moving-block resampling for time series.

Failed refits are caught and counted; a RuntimeWarning fires when the failure rate exceeds 5%.

The κ curvature diagnostic¶

pymargins computes Skovgaard’s relative curvature κ for every estimand (when diagnostics are enabled) and auto-falls-back to simulation when κ exceeds the session threshold (default 0.3). This is a meaningful divergence from Stata’s margins and marginaleffects, both of which always do delta and never tell you when delta is suspect.

Heuristically, κ is the ratio of the second-derivative contribution of \(g\) (Hessian, whitened by \(\widehat V\)) to its first-derivative contribution (gradient, same whitening). When κ is small the linearization underlying the delta method is accurate; when κ is large the symmetric Wald interval can miss the true sampling distribution badly. The thresholds (0.1 / 0.3) are calibrated from the nonlinear-regression literature; expose them as configurable but keep these as defaults.

See The κ curvature diagnostic for the whitening transform and the auto-fallback policy.

Inference scales (`phi`) and the chain rule¶

The Margins constructor commits to a \((\phi, \phi^{-1})\) pair via either the classmethod helpers (Margins.log_scale, Margins.logit_scale, Margins.correlation_scale, Margins.linear_scale) or by passing phi= and phi_inv= directly.

The chain rule fixes how \(\phi\) propagates through the estimand:

For a single prediction \(g(\eta) = \phi(f(X\beta))\), \(\partial g/\partial\beta = \phi'(f(\eta))\,f'(\eta)\,X\).
For an AME on the \(\phi\)-scale, \(\mathrm{AME}_k^{(\phi)} = \tfrac{1}{n}\sum_i \phi'(f(\eta_i))\,f'(\eta_i)\,\beta_k\), whose Jacobian w.r.t. \(\beta\) carries both a curvature term (from differentiating \(\phi'f'\) through \(\eta\)) and a level term (from the explicit \(\beta_k\)).

In practice you do not write the chain rule yourself: JAX does it for you. The session contract is just that you commit to one \((\phi, \phi^{-1})\) for the whole analysis.

Why the response scale matters for DiD (Ai & Norton 2003)¶

In a logit, the coefficient on a group:condition interaction is on the log-odds scale. On the probability scale, the difference-in-differences

\[\mathrm{DiD}(x_*) = \bigl[f(\eta_{1,1}) - f(\eta_{1,0})\bigr] - \bigl[f(\eta_{0,1}) - f(\eta_{0,0})\bigr]\]

is a nonlinear function of every parameter and every covariate profile \(x_*\). You cannot read it off the interaction coefficient. The right tool is contrasts() (or the did() scenario helper), which evaluates the four cells on the response scale with their joint delta-method covariance and forms the DiD as a contrast.

References¶

Stata FAQ, How are the standard errors computed with margins? https://www.stata.com/support/faqs/statistics/compute-standard-errors-with-margins/
Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal, 12(2), 308–331.
Ai, C., & Norton, E. C. (2003). Interaction terms in logit and probit models. Economics Letters, 80(1), 123–129.
Skovgaard, I. M. (1985). A second-order investigation of asymptotic ancillarity. Annals of Statistics, 13(2), 534–551.
Krinsky, I., & Robb, A. L. (1986). On approximating the statistical properties of elasticities. Review of Economics and Statistics, 68(4), 715–719.

Mathematical motivation¶

The delta method¶

A single estimand schema¶

Beyond delta: simulation and bootstrap¶

The κ curvature diagnostic¶

Inference scales (`phi`) and the chain rule¶

Why the response scale matters for DiD (Ai & Norton 2003)¶

References¶

pymargins

Navigation

Related Topics

Mathematical motivation¶

The delta method¶

A single estimand schema¶

Beyond delta: simulation and bootstrap¶

The κ curvature diagnostic¶

Inference scales (phi) and the chain rule¶

Why the response scale matters for DiD (Ai & Norton 2003)¶

References¶

Inference scales (`phi`) and the chain rule¶