Hilbert Spaces and Representation Learning: A Formal Treatment
The assumption that representation learning operates in Euclidean space has quietly shaped how we think about neural networks, and it is fundamentally incomplete.
When we train a deep network, we are not simply mapping inputs to outputs through a series of linear transformations and nonlinearities. We are constructing a sequence of operators that act on elements of increasingly abstract function spaces. The standard narrative—that hidden layers produce "features" in some high-dimensional vector space—obscures what is actually happening: we are building a representation within a Hilbert space where the geometry encodes semantic relationships through inner products, not through Euclidean distance alone.
This distinction matters because Hilbert space structure reveals constraints and possibilities that Euclidean thinking conceals.
What Everyone Gets Wrong
The prevailing view treats learned representations as points in ℝⁿ. Practitioners speak of "embedding spaces" and "latent dimensions" as though they were simply coordinate systems. This framing is convenient for implementation but mathematically misleading. A Hilbert space is not just a vector space with a dot product; it is a complete inner product space with a metric topology. The completeness property—that every Cauchy sequence converges—is not decorative. It guarantees that the space is closed under limits, which is essential when we consider what happens during training as we move through parameter space.
When a neural network learns a representation, it is implicitly defining an operator T: X → H, where X is the input domain and H is a Hilbert space. The network does not simply embed inputs as vectors; it constructs a linear operator (composed with nonlinearities) whose range lies in H. The properties of this operator—its spectrum, its adjoint, its kernel—determine what the network can and cannot represent.
The error is treating the learned space as though it were a static container. It is not. It is the image of an operator, and the operator's structure constrains everything.
Why This Matters More Than People Realize
Operator algebra provides a language for understanding generalization. When we regularize a network—through weight decay, dropout, or architectural constraints—we are implicitly controlling the operator norm or the spectral properties of the learned transformation. A network with bounded operator norm cannot produce arbitrarily large outputs; a network whose operator has a small condition number is stable under perturbation of inputs.
This has immediate consequences for how we should think about overparameterization. The conventional wisdom is that wide networks generalize well because they have implicit bias toward simple solutions. The Hilbert space perspective suggests something more precise: wide networks can represent a richer set of operators, but the optimization landscape biases learning toward operators with favorable spectral properties—those whose singular values decay rapidly, whose adjoints are well-behaved, whose kernels are small.
Representation learning in this framework is not about finding a good coordinate system. It is about constructing an operator with the right spectral signature for the downstream task.
What Changes When You See It Clearly
Once you accept that learned representations live in Hilbert spaces governed by operator algebra, several things become visible that were previously obscured.
First, the relationship between network depth and representational capacity is not about "stacking features." It is about composing operators and how their spectral properties compound. A deep network is a product of operators, and the spectrum of the product is constrained by the spectra of the factors.
Second, the geometry of representation space is not Euclidean. Distances measured by the inner product induced by the learned operator are the relevant metric, not L2 distance in the ambient space. This changes how we should think about nearest neighbors, clustering, and interpolation in learned spaces.
Third, the question of what makes a representation "good" becomes formally tractable. A representation is good when the operator from input space to representation space has a spectrum that separates signal from noise—when the singular values corresponding to task-relevant directions are large and those corresponding to noise are small.
The mathematics was always there. We simply chose to ignore it in favor of geometric intuition borrowed from Euclidean space. That choice was pragmatic once. It is no longer.