Streaming Algorithms: Computing on Data You Cannot Store

The fundamental assumption of classical computer science—that you can store your input—is becoming obsolete.

For decades, algorithm design operated within a comfortable fiction: data arrives, you load it into memory, you process it, you output results. This model shaped everything from sorting networks to dynamic programming. But the moment your dataset exceeds available memory, or arrives faster than you can persist it, or costs money every second it sits in storage, the entire theoretical framework collapses. Streaming algorithms represent not a minor optimization but a categorical shift in how we think about computation itself.

Most researchers encounter streaming algorithms as a specialized topic, something you study after mastering the classics. This is backwards. Streaming is the natural computational model for the world we actually inhabit. A network packet arrives once. A sensor reading is generated and transmitted. A user interaction happens and is gone. The luxury of random access to your entire dataset is the exception, not the rule. Yet we continue teaching algorithms as though storage were free and time were expensive, when the inverse is increasingly true.

The conceptual difficulty lies in what streaming forces you to abandon. You cannot sort data you see only once. You cannot compute exact frequencies without storing every distinct element. You cannot find the median without keeping half your dataset in memory. These impossibilities are not bugs in the streaming model—they are features that expose the true cost of classical algorithms. When you cannot store, you must choose: approximate, or sample, or maintain only summary statistics. This choice is not a compromise. It is clarity.

Consider the problem of counting distinct elements in a stream. Classically, you use a hash set: O(n) space, O(1) per-element time, exact answer. Streaming? You cannot afford O(n) space when n is a billion. The HyperLogLog algorithm instead maintains a fixed-size sketch—a few kilobytes—and estimates cardinality with 2% error. For most applications, this error is irrelevant. The insight is that you never needed the exact count. You needed a number good enough to make decisions. Streaming algorithms force you to ask what "good enough" means, and that question often reveals that classical exactness was solving the wrong problem.

The mathematical machinery is elegant precisely because it is constrained. Streaming algorithms rely on randomization, sketching, and sampling in ways that classical algorithms treat as last resorts. Bloom filters, count-min sketches, reservoir sampling—these are not approximations bolted onto exact algorithms. They are fundamental techniques that achieve their power through probabilistic guarantees. A Bloom filter trades false positives for space efficiency. A count-min sketch trades accuracy for speed. These are not weaknesses. They are design choices that make sense when your constraints are real.

What makes streaming algorithms difficult to teach is that they require comfort with uncertainty. A classical algorithm either solves the problem or it does not. A streaming algorithm solves it with high probability, or within a factor of (1+ε), or with bounded error. This probabilistic framing unsettles people trained on deterministic proofs. But this discomfort is productive. It forces you to think about what you actually need from your computation, rather than what is theoretically possible.

The practical impact is already enormous. Every major cloud platform uses streaming algorithms for network telemetry, anomaly detection, and cardinality estimation. Machine learning systems process data in mini-batches because full-batch training is streaming in disguise—you cannot fit ImageNet in GPU memory, so you stream it. Recommendation systems maintain sketches of user behavior because storing exact interaction histories is economically infeasible.

Yet streaming remains peripheral in most computer science curricula. Students learn Dijkstra's algorithm in detail but never encounter the streaming version. They study exact pattern matching but not the probabilistic variants used in practice. This gap between theory and reality is not accidental. It reflects a discipline still organized around the assumption that computation is the bottleneck, when in fact data movement and storage are.

The question is not whether streaming algorithms matter. They already dominate applied computing. The question is whether we will acknowledge this in how we teach, research, and think about algorithms. The data you cannot store is not an edge case. It is the future, already here.