Geometry of the Nonlinear Jacobian Chain (The Methematical Model of ResNet and Transformers)
This project builds a geometric lens for modern deep learning by studying the Riemannian structures generated by Jacobian chains in nonlinear compositional models. The central object is the depth-wise product of layer Jacobians, which induces a pullback metric on data/feature space and encodes how local volumes, directions, and curvature are transported through the network. We ask when depth preserves near-isometry, when it creates anisotropy or singularities, and how training reshapes these geometric invariants. The goal is a principled bridge between depth, curvature/spectra, stability, and practical trainability—yielding measurable diagnostics and theory-guided design principles.
Project Overview
This project studies deep nonlinear neural networks through the geometry of their Jacobian chains.
Given a network as a composition of layers, the end-to-end differential is a product of Jacobians (a “Jacobian chain”). This chain induces natural Riemannian metrics—on input space, feature space, and (via Fisher-type constructions) parameter space—which provide a principled language to describe stability, trainability, expressivity, and generalization in deep models.
At a high level, we treat depth not only as “more layers,” but as a geometric mechanism: repeated composition shapes a metric field whose spectral profile and curvature encode how signals, gradients, and local volumes evolve through the network.
Core Questions
We focus on a few concrete, geometry-first questions:
-
Metric induced by the Jacobian chain. For a network map $f:\mathcal{X}\to\mathcal{Y}$, the pullback metric
\[g_x = (Df(x))^\top Df(x)\]defines a Riemannian geometry on $\mathcal{X}$ (and analogously on feature manifolds across depth).
How does $g_x$ evolve with depth and training? What invariants are stable? -
Dynamical isometry and beyond. Classical trainability heuristics aim for “well-conditioned” Jacobians at initialization.
Can we formulate trainability as a geometric condition (e.g., bounded distortion, controlled curvature, or near-isometric transport across depth), and track when/why it fails? -
Curvature as a descriptor of representation. If the metric field becomes highly anisotropic, curved, or singular in regions of data support, the network can exhibit brittle behavior.
Which curvature-like quantities correlate with robustness / generalization? Can they be turned into measurable diagnostics or regularizers? -
Nonlinearity and compositional geometry. In nonlinear networks (e.g., residual blocks), $Df(x)$ is data-dependent and spatially varying.
How does nonlinearity generate curvature, even if each layer is “locally simple”?
Mathematical Objects We Use
Jacobian chain (end-to-end differential). For a depth-$L$ composition $f = f_L \circ \cdots \circ f_1$:
\[Df(x) = J_L(h_{L-1}) \cdots J_1(x), \quad J_\ell(\cdot) = D f_\ell(\cdot)\]Pullback metric / distortion.
\[g_x = (Df(x))^\top Df(x), \quad \text{distortion}(x) \sim \kappa(Df(x)) \text{ or } \sigma_{\max}/\sigma_{\min}\]Geometric statistics (local, data-dependent).
- Singular value spectrum of $Df(x)$ along data.
- Log-volume change $\log\det(Df(x)^\top Df(x))$.
- Curvature proxies (e.g., variation of the metric field along geodesics / data trajectories).
- Fisher-type metrics on parameters: $G(\theta) \approx \mathbb{E}[\nabla_\theta \log p_\theta \nabla_\theta \log p_\theta^\top]$.
Current Stage Results
Stage A — Formalization (done / ongoing).
- A clean definition of the Jacobian-chain–induced geometry on inputs and intermediate feature spaces.
- A dictionary connecting “signal propagation” $\leftrightarrow$ metric distortion and “exploding/vanishing gradients” $\leftrightarrow$ degeneracy of $g_x$.
Stage B — Theory-guided diagnostics (ongoing).
- Practical metrics computed on real networks/data: spectra of $Df(x)$ across depth and training time.
- Stability signatures under perturbations (input noise / weight noise / data augmentation).
Stage C — Geometry-driven principles (planned).
- Geometric conditions stronger than dynamical isometry (e.g., bounded curvature growth).
- Potential regularizers or architectural constraints motivated by metric/curvature control.
Selected References
- Poole et al. Exponential expressivity in deep neural networks through transient chaos. NeurIPS (2016).
- Schoenholz et al. Deep Information Propagation. ICLR (2017).
- Amari. Information Geometry and Its Applications. Springer (2016).
- Bronstein et al. Geometric Deep Learning. (2021).