Mathematical motivation
=======================
This page derives, in one place, the statistics
:class:`~pymargins.Margins` computes and the uncertainty attached to
each. The motivation matches Stata's `margins-delta-method FAQ
`_
and Richard Williams' `Margins01 notes
`_; the goal
here is to expose the *single Jacobian primitive* the implementation
reduces to, and to write down the curvature diagnostic that decides
when the delta method is unsafe.
The delta method
----------------
For a fitted parameter vector :math:`\hat\beta` with estimated
covariance :math:`\widehat V`, and a (possibly vector-valued) statistic
:math:`g(\beta)`, a first-order Taylor expansion gives
.. math::
g(\hat\beta) \;\approx\; g(\beta_0)
+ G\,(\hat\beta - \beta_0),
\qquad
G = \left.\tfrac{\partial g}{\partial \beta}\right|_{\beta_0}.
Taking variances yields
.. math::
\widehat{\operatorname{Var}}\bigl[g(\hat\beta)\bigr]
\;\approx\; G\,\widehat V\,G^\top .
Three things to notice:
1. The approximation depends only on **first** derivatives of :math:`g`.
2. Once :math:`G\widehat V G^\top` is in hand, *any* linear combination
:math:`C\,g(\hat\beta)` has covariance :math:`C(G\widehat V G^\top)C^\top`
— no further differentiation needed. That is why
:meth:`~pymargins.Margins.contrasts` is exact under the same
approximation.
3. ``pymargins`` computes :math:`G` by JAX autodiff when an autodiff
path exists for the predict function, by autodiff over a custom-JVP
FD primitive when the model is black-box but has a clean
:math:`\eta = X\beta` linear predictor, and by full finite
differences otherwise. See
:doc:`explanations/gradient_backend`.
A single estimand schema
------------------------
Internal to the library every estimand is a triple
:math:`(h, \phi, \phi^{-1})`:
- :math:`h(\beta)` — the estimand on the **inference scale**; this is
what gets differentiated for delta and what gets evaluated for
simulation.
- :math:`\phi` — back-transform from inference scale to reporting
scale, applied to CI endpoints.
- :math:`\phi^{-1}` — forward transform; converts user-supplied null
values onto the inference scale for hypothesis tests.
The pair :math:`(\phi, \phi^{-1})` is **session-level**, not
per-estimand: every call within one :class:`~pymargins.Margins`
instance is on the same inference scale. That is what makes the
session-level κ diagnostic and inter-call composability work.
The Williams (2012) statistic table reduces to a single primitive:
.. list-table::
:header-rows: 1
:widths: 15 60
* - Statistic
- :math:`g(\beta)`
* - AAP
- :math:`\tfrac{1}{n}\sum_i f(x_i^\top\beta)`
* - APM
- :math:`f(\bar x^\top\beta)`
* - APR
- :math:`\tfrac{1}{n}\sum_i f(x_i^\top\beta)` with some :math:`x` fixed
* - AME
- :math:`\tfrac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}`
* - MEM
- :math:`\partial f(\bar x^\top\beta)/\partial x_j`
* - MER
- AME with some :math:`x` held at representative values
* - Contrast
- :math:`E[f(\cdot)\mid x_j=\ell] - E[f(\cdot)\mid x_j=\text{ref}]`
Every entry is a linear combination of mean-predictions. The single
Jacobian primitive is
.. math::
\frac{\partial}{\partial\beta}\,
\frac{1}{n}\sum_i f(x_i^\top\beta)
\;=\; \frac{1}{n}\sum_i f'(x_i^\top\beta)\,x_i .
Substituting back into the delta method gives a closed form for the
covariance of every statistic above.
Beyond delta: simulation and bootstrap
--------------------------------------
The delta method is a first-order approximation: :math:`G\widehat V
G^\top` is exact only to the extent that :math:`g` is locally linear
in :math:`\beta` near :math:`\hat\beta`. For statistics that are
sharply nonlinear (a probability near 0 or 1, an elasticity at a
near-zero prediction, a ratio with a small denominator) Wald
intervals can extend beyond the natural support or under-cover.
``pymargins`` exposes two alternatives behind the same ``method=``
keyword on the session.
**Krinsky–Robb simulation** (``method="simulation"``). Draw
:math:`S` parameter vectors :math:`\beta_s\sim N(\hat\beta,\widehat
V)`, evaluate :math:`g(\beta_s)`, and read inference off the
empirical distribution:
.. math::
\widehat{\operatorname{Var}}_{\mathrm{KR}}\bigl[g(\hat\beta)\bigr]
= \frac{1}{S-1}\sum_{s=1}^S
\bigl(g(\beta_s) - \bar g\bigr)\bigl(g(\beta_s) - \bar g\bigr)^{\!\top}.
The reported point estimate stays the analytic :math:`g(\hat\beta)`
(not the Monte Carlo mean), matching Stata's ``vce(simulation)``.
Pointwise CIs default to empirical quantiles of the draws — so an
asymmetric CI for an extreme probability is natural.
**Bootstrap** (``method="bootstrap"``). Refit the model on :math:`B`
resamples and read inference off the bootstrap distribution. Three
resampling schemes:
- pairs — :math:`(y_i, x_i)` resampled IID (default);
- cluster — whole clusters resampled, required for within-cluster
correlation;
- block — moving-block resampling for time series.
Failed refits are caught and counted; a ``RuntimeWarning`` fires when
the failure rate exceeds 5%.
The κ curvature diagnostic
--------------------------
``pymargins`` computes Skovgaard's relative curvature κ for every
estimand (when diagnostics are enabled) and auto-falls-back to
simulation when κ exceeds the session threshold (default 0.3). This
is a meaningful divergence from Stata's ``margins`` and
``marginaleffects``, both of which always do delta and never tell you
when delta is suspect.
Heuristically, κ is the ratio of the second-derivative contribution
of :math:`g` (Hessian, whitened by :math:`\widehat V`) to its
first-derivative contribution (gradient, same whitening). When κ is
small the linearization underlying the delta method is accurate;
when κ is large the symmetric Wald interval can miss the true
sampling distribution badly. The thresholds (0.1 / 0.3) are
calibrated from the nonlinear-regression literature; expose them as
configurable but keep these as defaults.
See :doc:`explanations/kappa_diagnostic` for the whitening transform
and the auto-fallback policy.
Inference scales (``phi``) and the chain rule
---------------------------------------------
The :class:`~pymargins.Margins` constructor commits to a
:math:`(\phi, \phi^{-1})` pair via either the classmethod helpers
(``Margins.log_scale``, ``Margins.logit_scale``,
``Margins.correlation_scale``,
``Margins.linear_scale``) or by passing ``phi=`` and ``phi_inv=``
directly.
The chain rule fixes how :math:`\phi` propagates through the
estimand:
- For a single prediction :math:`g(\eta) = \phi(f(X\beta))`,
:math:`\partial g/\partial\beta = \phi'(f(\eta))\,f'(\eta)\,X`.
- For an AME on the :math:`\phi`-scale,
:math:`\mathrm{AME}_k^{(\phi)} = \tfrac{1}{n}\sum_i \phi'(f(\eta_i))\,f'(\eta_i)\,\beta_k`,
whose Jacobian w.r.t. :math:`\beta` carries both a curvature term
(from differentiating :math:`\phi'f'` through :math:`\eta`) and a
level term (from the explicit :math:`\beta_k`).
In practice you do not write the chain rule yourself: JAX does it for
you. The session contract is just that you commit to one
:math:`(\phi, \phi^{-1})` for the whole analysis.
Why the response scale matters for DiD (Ai & Norton 2003)
---------------------------------------------------------
In a logit, the coefficient on a ``group:condition`` interaction is on
the *log-odds* scale. On the *probability* scale, the
difference-in-differences
.. math::
\mathrm{DiD}(x_*)
= \bigl[f(\eta_{1,1}) - f(\eta_{1,0})\bigr]
- \bigl[f(\eta_{0,1}) - f(\eta_{0,0})\bigr]
is a nonlinear function of every parameter and every covariate
profile :math:`x_*`. You cannot read it off the interaction
coefficient. The right tool is
:meth:`~pymargins.Margins.contrasts` (or the
:func:`~pymargins.did` scenario helper), which evaluates the four
cells on the response scale with their joint delta-method covariance
and forms the DiD as a contrast.
References
----------
- Stata FAQ, *How are the standard errors computed with margins?*
https://www.stata.com/support/faqs/statistics/compute-standard-errors-with-margins/
- Williams, R. (2012). *Using the margins command to estimate and
interpret adjusted predictions and marginal effects.* Stata Journal,
12(2), 308–331.
- Ai, C., & Norton, E. C. (2003). *Interaction terms in logit and
probit models.* Economics Letters, 80(1), 123–129.
- Skovgaard, I. M. (1985). *A second-order investigation of asymptotic
ancillarity.* Annals of Statistics, 13(2), 534–551.
- Krinsky, I., & Robb, A. L. (1986). *On approximating the
statistical properties of elasticities.* Review of Economics and
Statistics, 68(4), 715–719.