When a technical team says “everything’s stable,” management relaxes. And in vain: this phrase can be neither verified nor challenged. It sounds like a guarantee, though it is only a feeling, one into which everyone reads their own meaning.
Oleksandr Zhartovskyi, a distributed-systems engineer who spent a long time building high-load platforms, has seen how costly this gap is for IT companies. In most of them, he is convinced, everything needed to measure reliability in precise numbers already exists. Let’s examine the beliefs that give product leaders a false sense of control — and what to replace them with.
“Stable” proves nothing
The word “stable” hides different meanings, and they rarely coincide between people. An engineer means a technical state: no alerts, services aren’t going down, errors are within the norm. A manager hears a business promise: there will be no failures, customers won’t leave. The first speaks of the system’s ability to survive failures; the second, of their absence.
The illusion is fed by a deception: if a system runs for months without incidents, it is considered reliable. In reality, the silence means only that the system hasn’t yet been loaded hard enough to find its limit. For example, a platform holds 1,000 requests per second for a year without a single complaint. Marketing launches a campaign, traffic jumps to 5,000 — and everything collapses within minutes. The system wasn’t reliable all that year. It simply never ran up against its ceiling, and no one knew where that ceiling was.
The same goes for tests, which people love to cite as proof of quality. Unit tests check what the developer anticipated. Cascading failures, memory exhaustion, network breakdowns stay out of frame, because no one thought about them when writing the test.
“High test coverage shows how much work went into testing,” Zhartovskyi says. “It says nothing about whether the system will hold up under a load no one anticipated.”
Mature teams look for the limit themselves, instead of waiting for customers to find it for them at peak hour. Load tests push two to three times more traffic than expected; deliberate failures shut down part of the services to see whether the rest holds.
“The trap appears where reliability isn’t expressed as a number. As long as it lives in words, it can’t be measured, compared, or promised to anyone. And even once a number finally appears, you have to know how to read it. ‘99.9% availability’ sounds solid and abstract until someone translates the figure into time: that’s almost 9 hours of downtime per year, or 43 minutes a month. And that’s where the conversation becomes concrete: an hour of downtime a month — is that acceptable, or is it lost contracts? That question now has to be answered by the business,” Zhartovskyi explains.
Three words that turn reliability into a number
The lack of a shared language hits hardest when the business sets a seemingly clear task. Zhartovskyi gives an example: management asks to “increase the system’s throughput.” Everyone nods. But the word hides three different things — requests passed to an external service, events saved to the database, operations processed by a key module. Until the team has agreed on what exactly it is measuring, one engineer optimizes writes to the database, a second speeds up the transfer outward, a third widens the queue.
To avoid the confusion, engineers use three things: a quality indicator (SLI), a target level of reliability (SLO), and a budget of allowable failures (Error Budget). Two of them are management decisions.
SLI is the indicator you measure. A specific value tied to what the customer feels. Instead of an abstract “speed” — something measurable: “the share of requests processed faster than 200 milliseconds” or “the time from payment to order confirmation.” Choosing the SLI is choosing what exactly matters to the user.
SLO is the target you consider acceptable for that indicator. Say, “99.5% of requests faster than 200 ms over a month.” The figure seems technical, though in fact it’s about money: each additional “nine” costs several times more. Going from 99.9% to 99.99% means cutting allowable downtime from 43 minutes a month to 4 — and that can mean months of the team’s work and doubled infrastructure. Whether that difference is worth the money is for the business to decide.
The error budget — the most useful of the three tools, and the one managers usually haven’t heard of. If your target is 99.9%, then the remaining 0.1% is the permitted amount of failures, those same 43 minutes of downtime a month. It’s an allowed norm, a built-in resource. As long as the budget isn’t spent, the team can take risks: ship new features, experiment, speed up releases. The moment the budget is used up, all effort goes to stabilization, and new features wait.
You already have this — it just needs to be written down
The best news for a manager is that almost nothing has to be implemented from scratch. In a mature team all the components of an SLO already exist, just scattered in different places. Somewhere there’s monitoring with load and error metrics. Somewhere alerts are set on critical thresholds. Somewhere there lives an unwritten agreement to keep a certain level of availability, and there’s a long-standing understanding with the business about what load the system is obliged to withstand. Just one step is missing — bringing this together into a single recorded document that everyone can see.
“An unwritten SLO lives in the heads of five to ten people who have worked together for a long time,” Zhartovskyi explains. “While the team is small, it works. The moment new people and remote teams arrive, every conversation starts with figuring out what we even mean. Writing down the SLO is a way to stop renegotiating from scratch every time.”
For a manager, a recorded figure turns into several entirely practical levers. It becomes the language of conversation with a client and an investor. The words “we’re reliable” can’t be sold or written into a contract. “99.5% of requests faster than 200 milliseconds, measured and confirmed” — can.
It provides an honest limit where the business is inclined to overpromise. If the system depends on external services with their own constraints, the SLO fixes the real ceiling of what’s possible. A manager armed with this figure won’t promise a major client something the platform physically can’t withstand — and won’t find that out from a loud failure.
And most important — metrics stop working only as a reaction to a breakdown. A recorded target turns them into an early-warning system: you can see the trend, see the approach to the limit, and a decision is made before the customer notices the problem.
Today Zhartovskyi holds the position of Head of System Applications Development and works in the United States, where he consults engineering teams. The approach he advocates comes down to a single idea: an SLO doesn’t make a system more reliable on its own. It makes reliability visible — translating it from the language of feelings into the language of numbers, in which a manager and an engineer finally talk about the same thing. So it’s worth starting simple: gather what the team already measures and write it down in one document. Most likely, it will turn out that the company already has an SLO. It just needs to be said out loud.


