Symbolic Regression as a Formal System: Guarantees Beyond Curve Fitting

The field treats symbolic regression as an optimization problem when it should be treated as a formal system with provable properties.

Most work in symbolic regression—from genetic programming to neural-guided search—frames the task as finding equations that minimize error on observed data. This is fundamentally a curve-fitting perspective. You generate candidate expressions, evaluate them against a dataset, and keep the ones that fit best. The implicit assumption is that a good fit on training data constitutes success. But this conflates two entirely different problems: approximation and characterization.

When you're building a formal system, you care about something different. You care whether the discovered expression has structural properties that guarantee certain behaviors. You care whether it satisfies constraints that go beyond "fits the data well." You care whether it can be proven to hold under specified conditions. The gap between these two framings is not semantic—it determines what kinds of guarantees you can actually make about your results.

Consider a concrete example. Suppose you discover through symbolic regression that some physical process follows the equation f(x) = x³ - 2x + 1. A curve-fitting approach asks: does this minimize squared error? A formal systems approach asks: does this expression satisfy the differential constraints implied by the underlying physics? Does it preserve invariants? Can you prove it generalizes beyond the training domain? These are not questions you can answer by looking at residuals.

The reason this distinction matters more than people realize is that it determines the entire validation pipeline. In curve-fitting, you split your data, train on one part, test on another, and declare victory if test error is low. This works reasonably well when you have abundant data and the true function is smooth. But symbolic regression is often applied precisely when these conditions fail—when you have sparse observations, high noise, or you're trying to reverse-engineer a mechanism from limited measurements. In these regimes, generalization guarantees from held-out test sets evaporate. You need structural guarantees instead.

A formal systems approach would ask: what properties must the expression satisfy to be considered valid? These might include dimensional consistency, boundary condition satisfaction, monotonicity constraints, or conservation laws. Some of these can be checked analytically. Others can be verified through formal methods—using interval arithmetic, abstract interpretation, or SMT solvers to prove that an expression satisfies a specification across an entire domain, not just at sampled points.

This reframes the search problem entirely. Instead of "find the expression that fits best," you ask "find the simplest expression consistent with these formal constraints." Simplicity becomes measurable—typically through expression size or complexity metrics. Consistency becomes verifiable—you can prove it or disprove it. The search space becomes structured by the constraints themselves, which can dramatically reduce the computational burden compared to unconstrained optimization.

The practical implication is that symbolic regression systems should be built around constraint solvers and formal verification tools, not just error metrics. You specify your domain knowledge as formal constraints. The system searches for expressions satisfying those constraints while optimizing for simplicity and fit. When it returns a result, you have not just an equation that happened to work on your data—you have an equation with proven properties.

This is not to say empirical validation becomes irrelevant. It remains essential. But it becomes secondary to structural validation. You first verify that a candidate expression satisfies your formal constraints. Only then do you check how well it fits observed data. This ordering matters because it prevents overfitting to noise while ensuring that whatever you discover has the properties you actually care about.

The field's current practice—treating symbolic regression as sophisticated curve-fitting—works adequately when you have clean data and weak priors about the true form. But for the problems where symbolic regression is most valuable—reverse-engineering mechanisms, discovering laws from sparse observations, ensuring safety-critical properties—this approach is insufficient. You need guarantees that transcend the training set. You need a formal system, not a fitting algorithm.