
Once upon a time, back when applications were simple, networks were slow, and everyone swore that simple object access protocol (SOAP) was the future, compression meant survival. Compression meant encoding, restructuring or modifying data to reduce its size, allowing it to occupy less storage space or be transmitted more efficiently. You compressed because bandwidth was expensive, storage was limited, and users valued rapid response. Smaller payloads meant faster delivery, happier customers, and fewer surprise invoices from your cloud provider.
Back then, compression was noble. It made everything better.
But the AI era has a remarkable talent for taking our comfortable assumptions, holding them up to the light, and revealing that the world has rotated while we weren’t looking. And nowhere is that more evident than in how we think about compression today.
Traditionally, compression was about performance. Then it was about bandwidth. Today? Compression is about not bankrupting yourself on inference.
Let’s start with the obvious: yes, bandwidth still costs money. Cloud egress is infamous, and data transfer bills can still produce heart palpitations. But be honest and compare the cost of moving a megabyte across the wire with the cost of generating 10,000 tokens on a top-shelf large language model (LLM). One is a forgotten rounding error on the monthly bill. The other is a sternly worded message from finance asking why you’ve suddenly consumed the budget for Q3.
In the AI world, we don’t compress to make things smaller. Every token generated is an act of cognition and cognition, for machines, is expensive. So, we compress to make them cheaper to “think about.”
The new economics of compression
LLMs have redefined bottlenecks in ways that feel almost disrespectful to the past three decades of systems engineering. It used to be that you optimised network paths, minimised payloads, and pre-compressed assets so your application wouldn’t take six days to load on a 3G connection.
Now the slowest, most expensive component in the system isn’t the network at all. It’s the brain.
Every token an LLM emits demands GPU cycles, VRAM, energy, latency, and money. Lots of money, depending on which model you’ve fallen in love with this quarter. The cost of generating text now dwarfs the cost of transporting it, which means we’ve inverted the compression value chain.
We compress not to shrink the data, but to reduce the number of “thoughts” an AI has to “think”.
And honestly? That’s a very funny sentence, but it’s also the operational truth.
Where compression lives now
In the olden times, when we traversed networks uphill, both for uploading and downloading data, compression lived at the edge of the network in specialised devices. Eventually, it consolidated on application delivery controllers and took on names like “minification” and “HTTP compression.” For a time, it was specialised functionality. Today? It’s just part and parcel of application delivery.
But, thanks to AI, we’re seeing the emergence of new compression techniques. Probably because we have to. We’re no longer just compressing text using well-known algorithms. We’re striking out words like a Chicago- or AP-style editor with a pen full of red ink and something to prove.
Prompt compression
This is the new heavyweight champion. You shrink the prompt to shrink the invoice. Irrelevant details? Gone. Redundant context? Deleted. Overly chatty instructions? Trimmed like an overgrown hedge. The shorter the prompt, the fewer tokens consumed, and the happier your procurement department.
Output compression
“Be concise” has quietly graduated from a writing preference to a cost-control strategy. Short answer = cheap answer. Long answer = someone’s paying for that verbosity.
Embedding compression
You’re not reducing bytes here, you’re reducing dimensionality, which reduces memory footprint, retrieval cost, and everything your vector store is quietly billing you for every minute.
Model compression
Pruning, quantisation, distillation. In another era, these were academic curiosities. Today, they serve one purpose: to run it cheaper.
If it also runs faster? Wonderful. If it fits on a smaller GPU? Miraculous. But the point is, and always has been, to lower the compute burn.
Compression as control
When your system’s most expensive operation is thinking, you start treating thoughts like a limited resource. This is the inverse of every performance model we’ve been taught. Network is cheap. Storage is cheap. CPU is cheap. Memory is cheap enough that we barely pretend to manage it anymore.
But GPU inference? That’s the new oil. And yes, I hate that cliché but if the shoe fits, wear it.
And like oil, we now have a global economy dedicated to extracting every last drop efficiently. Compression is no longer a nicety; it’s a pillar of operational AI.
It’s how you stay inside budget, scale responsibly, prevent accidental million-dollar token overruns, and prevent agents from rewriting War and Peace because you forgot to set max tokens.
We compress now not because our networks can’t handle the load, but because our AIs can’t handle the invoice. The future isn’t about making data smaller; it’s about making thinking cheaper.
Compression no longer serves the network. It serves the ledger. And if that gives rise to operational accounting, then don’t be surprised. I won’t.


