AI Business Strategy

The Reliability Gap: Why Your AI System Will Fail in Production (And How to Prevent It)

By Karan Luniya

When people talk about AI failures in production, the conversation usually focuses on model performance. The model was wrong. The predictions were inaccurate. The output was biased. These are real problems and they matter. 

They are not the problem I spend most of my time thinking about.  

I am a backend infrastructure engineer. I build the systems that keep real-time applications running at scale. And in my experience, the AI failures that cause the most disruption in production environments are not model failures. They are operational failures: systems that were not designed for the demands of continuous availability, that cannot handle the traffic patterns that AI workloads create, or that break in ways that have nothing to do with the model’s accuracy and everything to do with the infrastructure it runs on. 

At DoorDash, I build backend infrastructure for mission-critical logistics systems. Every order, every driver assignment, every delivery estimate flows through systems I help design. At Conviva, I built multi-cluster architecture that supported over 20 million concurrent viewers and 100 million daily sessions without disruption for streaming services including Hulu, Disney+, and Sky Network. What I have learned from these environments is that deploying AI into a production system changes the operational requirements in ways that traditional software engineering practices were not designed to handle. 

Most AI deployments are failing at the infrastructure layer, not the model layer. And most organizations do not realize it until something breaks. 

What AI Changes About Operational Demands  

Before getting into specific failure modes, it is worth being precise about what is different about running AI in production compared to traditional software.  

In a conventional backend system, the code is deterministic. A request comes in, the system processes it, and the output is a direct function of the input and the programmed logic. Testing is tractable because you can enumerate inputs and verify outputs. Deployment is relatively safe because the behavior of the system is predictable. 

AI inference is different. The model’s behavior is statistical, not deterministic. The same input can produce different outputs across model versions. The system’s resource consumption varies based on input complexity, not just request volume. And the latency profile of an inference request can be significantly higher and more variable than a traditional API call. 

These differences create operational demands that do not exist for conventional backend services. 

When I led the design and rollout of multi-cluster architecture at Conviva for major streaming services, one of the key challenges was managing traffic spikes that did not follow traditional patterns. Streaming traffic has predictable peaks during live events, but the patterns are learnable and the infrastructure can be provisioned accordingly. AI inference traffic does not always behave that way. An unexpected surge in complex queries, a batch of unusually long input sequences, or a shift in the distribution of requests can cause resource consumption to spike in ways that are difficult to predict from monitoring traditional backend metrics. 

This is why the infrastructure that supports AI inference needs to be designed differently from the infrastructure that supports conventional services. 

The Five Infrastructure Failure Modes I Keep Seeing 

In distributed systems engineering, there are categories of failure that repeat across environments and architectures. After building systems at Conviva and now at DoorDash, I have identified five failure modes that appear consistently in AI production deployments.  

Latency spikes cascade into timeouts. AI inference latency is more variable than traditional API latency. A model that averages 50 milliseconds per inference can occasionally take 500 milliseconds when processing complex inputs. In a synchronous system where downstream services wait for inference results, these latency spikes create backpressure that cascades upward through the call chain. Services that depend on the AI system start timing out, triggering retries, which increase load on the inference service, which worsens the latency spike. I have seen this pattern play out in systems that were not designed with async processing and circuit breakers in place. 

The solution is architectural. Inference requests should be handled asynchronously where possible, with proper timeout and retry logic that accounts for the higher variability in AI response times. Service meshes and load balancing strategies that work fine for traditional services may not be sufficient for AI inference workloads without modification. 

Resource exhaustion from input variability. Traditional backend services consume resources in proportion to request volume. AI inference consumes resources in proportion to the complexity of each request. A request with a long input sequence, a complex document, or an unusually detailed image will consume significantly more compute than a simple request. Without visibility into this variation, teams provision capacity based on average request complexity and discover the problem only when a distribution shift causes resource exhaustion. 

Building resilient systems requires instrumentation that captures input complexity alongside request volume, and capacity planning models that account for the variance, not just the mean.  

Model versioning breaks backward compatibility silently. Software deployments follow semantic versioning. When a service changes its behavior, the change is intentional and the new behavior is the intended output. With AI models, a new version can produce different outputs for the same inputs without any explicit change to the system behavior. If downstream services depend on specific output formats or value ranges, a model update can break integrations without triggering any traditional deployment failure signal. 

At DoorDash, where real-time logistics decisions depend on ML-driven systems, we treat model deployments like we treat any infrastructure change: with staged rollouts, canary deployments, and rollback capability. This is not how most organizations deploy models today.  

Data pipeline failures are invisible until they are not. Many AI systems depend on data pipelines that feed training data, feature stores, and real-time inference inputs. When these pipelines fail, the AI system continues to operate, but on stale or degraded data. Unlike an API failure, which is immediately visible, a data pipeline failure can degrade model quality gradually, making it difficult to detect until the output divergence is significant enough to trigger user complaints or business metric impacts. 

Building reliable AI systems requires monitoring the health of the data supply chain, not just the inference endpoint. 

Scaling is not just vertical. Traditional backend services scale reasonably well with horizontal replication. Adding more instances increases capacity roughly linearly. AI inference scaling is more complex because the inference runtime has resource characteristics that do not always benefit from simple horizontal scaling. GPU memory constraints, batch sizing trade-offs, and model parallelism requirements mean that scaling an AI inference system requires understanding the specific hardware and software characteristics of the inference environment, not just the request distribution. 

This is why the team that can scale a microservice architecture reliably may struggle with AI inference scaling if they do not have the right expertise. 

What Resilient AI Infrastructure Actually Looks Like 

Building systems that handle 20 million concurrent viewers, as I did at Conviva, or millions of real-time logistics decisions, as I do at DoorDash, requires a consistent approach to reliability that applies to AI infrastructure as much as to any other critical system.  

Observability must go deeper than request metrics. Traditional monitoring tracks request rates, error rates, and latency percentiles. AI infrastructure also needs to track model performance metrics, input distribution statistics, data pipeline health, and resource utilization patterns that vary with inference complexity. Without this deeper visibility, the on-call engineer responding to an incident does not have the context to understand what is actually happening. 

Chaos engineering applies to AI systems. At Conviva, we ran chaos experiments to understand how our multi-cluster architecture behaved under component failures. This practice is even more important for AI systems, where the failure modes are less well understood and the blast radius of a failure is larger because of the business dependencies on model outputs. Teams should be deliberately introducing failures, network partitions, and degradation scenarios to understand how their AI systems behave under stress. 

The deployment pipeline matters as much as the model. Organizations spend significant effort on model development and comparatively little on model deployment infrastructure. The teams that run AI reliably in production have invested as much in the deployment pipeline, testing framework, and rollback mechanisms as they have in the model itself. Model deployment is not a software deployment with a different artifact. It requires its own engineering discipline. 

Cross-functional ownership reduces handoff risk. In many organizations, AI systems are owned by separate data science and engineering teams, with a handoff point where the data science team delivers a model and the engineering team deploys it. This handoff is where things break. The engineering team does not fully understand the model’s behavior, and the data science team does not fully understand the operational requirements. Organizations that build reliable AI systems have engineers who understand AI and data scientists who understand operations. The silos do not survive contact with production. 

The DoorDash Example  

Real-time logistics is one of the more demanding environments for AI infrastructure. Every order involves real-time decisions about driver assignment, routing, and delivery timing. Those decisions are powered by ML models that must produce results in milliseconds, handle traffic spikes during peak meal times without degradation, and maintain reliability across thousands of concurrent operations. 

The infrastructure that supports this has to be designed for the operational profile of AI inference, not retrofitted from a traditional backend architecture. That means async processing for non-time-critical inference. It means caching and feature pre-computation to reduce inference latency. It means staged rollouts and shadow mode testing for model changes. It means monitoring that goes beyond endpoint health to track data pipeline integrity and model performance drift. 

What it does not mean is treating AI as a black box that the infrastructure team has to keep running without visibility into how it works. The engineers who build the infrastructure need to understand the ML systems they support, and the data scientists who build the models need to understand the operational constraints of the systems they run on. 

Building for Reliability Is a Choice 

The gap between a working AI model and a reliable AI system is significant. Closing it requires investment in infrastructure that does not directly improve model quality but that determines whether the model can be trusted in production. It requires engineering practices like chaos testing, staged rollouts, and comprehensive observability that are standard in infrastructure engineering but not yet standard in AI development. And it requires organizational structures that break down the silos between data science and engineering. 

This is a choice. Organizations can continue to treat AI as a research project that gets handed off to production when the model is ready. Or they can invest in the infrastructure discipline that makes AI a reliable part of their operational stack. 

The organizations that are making that investment now are the ones who will have a structural advantage when AI becomes a critical dependency rather than an experimental capability. 

The reliability gap is real. But it is closeable. 

Author

Related Articles

Back to top button