Axiomatic Foundations for Deep Learning Guarantees
The field of deep learning has built its theoretical house on sand, and we are only now beginning to notice the cracks.
We possess empirical validation at scale—networks that work, that generalize, that solve problems we thought intractable. Yet we lack the foundational axioms that would let us prove anything about them. This is not a minor gap. It is the difference between engineering and science. A bridge engineer can calculate load-bearing capacity from first principles. A deep learning researcher cannot tell you with certainty why a particular architecture will or will not converge, why it generalizes to unseen data, or what guarantees hold when you scale it by an order of magnitude.
The usual response is to shrug and call this "empirical science." But that framing obscures what is actually happening: we are operating within an implicit axiomatic system that nobody has made explicit. We assume gradient descent finds useful minima. We assume overparameterization helps generalization. We assume certain architectural choices preserve expressivity. None of these are proven. They are working hypotheses, dressed up in mathematical language.
The problem deepens when you try to build formal guarantees on top of this unstable ground. Existing theoretical frameworks—PAC learning, VC dimension, Rademacher complexity—were designed for shallow classifiers and statistical learning. They do not scale to the phenomena we actually observe in deep networks. A network with more parameters than training examples should overfit catastrophically according to classical theory. Instead, it generalizes. The theory predicts one thing. Reality does another. This is not a sign that theory needs refinement. It is a sign that we are missing the axioms that would explain what is actually happening.
What would a rigorous axiomatic foundation look like? It would begin by formalizing the space of functions that neural networks can represent, not asymptotically or in the limit, but for finite networks of finite depth and width. It would establish axioms about the geometry of loss landscapes—not empirical observations about specific problems, but structural theorems about what must be true for any network satisfying certain conditions. It would define what generalization means in the presence of implicit regularization, and prove that certain training procedures necessarily induce it.
This is not theoretical luxury. It is practical necessity. As we move toward systems that must provide formal guarantees—in safety-critical domains, in formal verification, in systems that interact with other automated reasoners—we cannot rely on "it works in practice." We need theorems. And theorems require axioms.
The challenge is that custom axiomatic systems for deep learning cannot simply borrow from classical analysis or probability theory. Those frameworks assume properties that neural networks violate. You cannot apply concentration inequalities to quantities that are not concentrated. You cannot use VC dimension to bound generalization when the effective complexity of the hypothesis class changes during training. You need axioms that capture what is actually true about deep learning: that it is a form of adaptive, nonconvex optimization with implicit bias; that it operates in regimes where classical statistical assumptions break down; that its behavior is determined by properties of the data distribution, the architecture, and the optimization trajectory in ways we have not yet formalized.
Some progress exists. Work on neural tangent kernels, implicit bias in gradient descent, and the loss landscape structure of overparameterized networks has moved the needle. But these remain isolated results, not yet unified into a coherent axiomatic framework.
The path forward requires treating deep learning theory as a foundational mathematical enterprise, not as an application of existing theory. We need researchers willing to ask: what minimal set of axioms, if true, would explain everything we observe about deep learning? What theorems follow necessarily from those axioms? Where do our empirical observations contradict them, and what does that tell us about which axioms are wrong?
Until we do this work, deep learning will remain what it is now: powerful, useful, and fundamentally mysterious.