The Complexity Hierarchy of Transformer Operations
Most researchers treat transformer computational cost as a monolithic problem—attention is O(n²), so we optimize it. This framing obscures what actually matters: the operations inside a transformer have fundamentally different complexity profiles, and conflating them leads to wasted effort on the wrong bottlenecks.
The standard analysis groups all transformer work into a single narrative. Attention scales quadratically with sequence length. Feed-forward layers scale linearly with hidden dimension. Embedding operations are negligible. This taxonomy is not wrong, but it is incomplete. It treats complexity as a property of the operation itself, divorced from the computational substrate where it executes and the data flow constraints that govern real systems.
Consider what happens when you actually run a transformer. The attention mechanism computes Q, K, V projections—three linear transformations, each O(n·d) where n is sequence length and d is hidden dimension. Then comes the quadratic softmax over attention weights. Then the weighted sum of values. The quadratic term dominates asymptotically, but on modern hardware with bounded sequence lengths (say, n ≤ 4096), the linear projections often consume comparable wall-clock time. Why? Because matrix multiplication is highly parallelizable and benefits from dense tensor operations, while the softmax requires sequential reduction and normalization. The theoretical complexity class tells you nothing about which one your GPU actually spends time on.
This matters because optimization strategies differ radically depending on which operation you target. If you want to reduce the quadratic attention cost, you use approximations: sparse attention, low-rank decomposition, kernel methods. If you want to reduce the linear projection cost, you use quantization, pruning, or architectural changes like bottleneck dimensions. These are not interchangeable. Spending engineering effort on sparse attention when your actual bottleneck is the Q, K, V projections is like optimizing a function's asymptotic complexity when the constant factors dominate your runtime.
The hierarchy becomes clearer when you separate operations by their sensitivity to different hardware constraints. Some operations are compute-bound: they benefit from higher arithmetic intensity and can saturate modern accelerators. Matrix multiplication in feed-forward layers typically falls here. Other operations are memory-bound: they are limited by bandwidth to and from memory, not by raw compute capacity. Attention softmax and layer normalization are often memory-bound. A third category is communication-bound: in distributed settings, the cost of moving data between devices dominates. Embedding lookups and certain reduction operations can become communication-bound at scale.
This taxonomy cuts across the traditional complexity classes. Two operations with identical O(n) complexity can have opposite optimization strategies if one is compute-bound and the other is memory-bound. An O(n²) operation that is memory-bound may benefit more from algorithmic restructuring than from approximation. An O(n) operation that is compute-bound may benefit from quantization or mixed precision.
The practical implication is that complexity analysis alone cannot guide optimization. You need to know: what is the actual hardware constraint? What is the data layout? What is the batch size? What is the sequence length? These details determine whether a theoretically superior algorithm actually runs faster.
For transformer researchers, this means the standard complexity hierarchy—attention quadratic, feed-forward linear, embeddings negligible—is a starting point, not a conclusion. It tells you the asymptotic behavior under unlimited resources. It does not tell you where your system is actually slow. The operations that dominate your wall-clock time depend on the regime you are operating in: long sequences with small batch sizes have different bottlenecks than short sequences with large batches.
The field has begun recognizing this implicitly. Flash Attention succeeded not because it improved the O(n²) complexity of attention, but because it restructured memory access patterns to move the operation from memory-bound to compute-bound. This is a different kind of optimization than the complexity class suggests. It is not about reducing the number of operations; it is about changing which resource constrains execution.
Until you map your specific transformer configuration onto your specific hardware and identify which operation is actually limiting throughput, complexity analysis is incomplete. The hierarchy exists, but it is not the one in the textbooks.