Interview

Designing for Failure in the Age of AI with Sunil Thamatam

As AI reshapes how modern software is built and operated, the question of resilience has taken on new urgency. Systems are scaling faster, traffic patterns are less predictable, and model driven workloads introduce variability that traditional architectures were never designed to handle. Few engineers have navigated this evolution as deeply as Sunil Thamatam.

Sunil is a principal software engineer with more than twenty years of experience building large scale distributed services for global SaaS companies. His work spans scalable architecture, identity and access management, and high volume platforms that power complex business processing. Over the course of his career at companies including Oracle, Anaplan, and Twilio, he has helped design systems that remained stable through outages, traffic spikes, and shifting business demands.

In this interview with AI Journal, Sunil reflects on how early experiences building middleware systems shaped his thinking about predictability and resource discipline, why foundational architectural decisions determine long term resilience, and how engineering teams must rethink reliability in the age of AI. From fault tolerance tradeoffs and IAM failure modes to AI driven incident response and hybrid system design, he offers a practitionerโ€™s perspective on building platforms that assume failure, scale responsibly, and remain adaptable in a rapidly changing technological landscape.

To start, how did you get started in software engineering, and what early experiences shaped the way you think about building resilient systems in the age of AI?

I started my journey as a computer science graduate with a keen interest in programming languages and the evolution of computer architecture. I started building software that runs as platforms and application servers, where every line of code needs to be carefully crafted; even a verbose log statement (I had experienced this issue) can crash customer systems by consuming more system resources than necessary. These were robust, resilient, and general-purpose software systems called “Middleware Systems” that helped scale e-commerce and enterprise software for 15-20 years.ย 

Over time, cloud providers and services have caught up, with infrastructure spin-up and disposal happening instantly. The need for building systems with less resource hunger and stable run for prolonged periods has slowly disappeared as a measurement, though non-resilient systems keep consuming higher prices (cost) and have mounting tech-debts.ย 

AI definitely helps to iterate faster on every aspect of building resilient systems. An engineer who understands the technology and has prior experience can move 10x faster. Resiliency is a paradigm that every engineer has to keep in mind. Some of the patterns are easy to digest and implement, like simple CircuitBreakers and Retry. It’s not up to the code generator; it’s up to the system designer to decide which course they want to take while building a software system.

When you are designing architectural foundations for fast-growing SaaS companies, what early decisions most strongly influence long-term resilience and reliability?

Most SaaS companies begin with familiar open source stacks, often some variation of LAMP, Linux, Apache, MySQL, and PHP, Python, or Perl. Sometimes components are swapped, such as Nginx replacing Apache or Postgres replacing MySQL, but the overall structure usually remains a simple three-tier model: client, middleware, and database.

This setup makes it easy to move from idea to MVP quickly. It is accessible, well-documented, and cost-effective. The challenge appears when the product gains traction. What worked for early validation does not always scale under real production load.

Very few companies choose their long-term architecture correctly on day one. Early pivots can be relatively inexpensive. Google famously began with Python and, within the first six months, rewrote much of its core in C++. At that stage, the cost of change was manageable.

Over time, however, architectural shifts become far more expensive. Twitterโ€™s migration from Ruby on Rails to a JVM-based stack required significant engineering investment. Meta, rather than abandoning PHP, built its own virtual machine, HHVM, to optimize performance at scale. In both cases, the decision to evolve core technology required substantial time, capital, and organizational effort.

The earlier architectural tradeoffs are addressed, the easier and less costly the evolution tends to be.

You have built platforms that remained stable during outages and traffic spikes. How do you design systems that assume failure while still supporting rapid growth?

Every system starts with a clear problem. While the code may be new, the fundamentals usually map back to core computer science and distributed systems principles.

  1. Make the Right Architectural Choices
    From design to deployment, every component matters: design patterns, read versus write load, data storage, caching, microservices boundaries, async processing, deployment platforms, and resiliency. Small infrastructure decisions can have major cost and performance impacts. Architecture is both technical and business critical. 
  2. Define the Right Scope
    Architects should balance speed with scale. Early designs should support 5 to 10 times growth through horizontal scaling, stateless services, sharding, queue-based load leveling, and regional isolation where appropriate. 
  3. Execute with Discipline
    Architecture must be enforced in execution. Code reviews, transaction boundaries, and distributed systems integrity all matter. Even with AI-generated code, validation is essential. Small, focused teams aligned to clear problem domains drive cleaner systems and stronger accountability.

Based on your experience at Oracle, Anaplan, and Twilio, how do resiliency strategies differ between large-scale enterprise platforms and developer-first SaaS products?

Companies often approach the same technical problem with very different strategies. For example, AWS outlines three primary models for building resilient multi-tenant SaaS systems.

  1. Infrastructure-level isolation, where each customer has dedicated infrastructure. 
  2. Cell architecture, where groups of customers share segmented infrastructure. 
  3. Pure SaaS, where all customers share a common infrastructure layer.

Each model can deliver resiliency and redundancy, but the tradeoffs differ. Some approaches rely more heavily on hardware isolation, others on software-level controls and logical separation.

Customer expectations ultimately drive these architectural decisions. Enterprise clients often require dedicated support, strict audit trails, compliance assurances, and predictable performance, even if it means slower change cycles. Developer first customers prioritize transparent status updates, rapid incident response, stable APIs, and self service recovery.

These differing priorities shape architecture choices all the way down to individual services.

As SaaS platforms begin to incorporate AI-driven features and model-based workloads, what new reliability risks do engineering teams need to plan for at the architectural level?

Traditional software was largely predictable. Well-architected, well-tested systems produced consistent outputs, with limited unintended behavior. AI-driven systems change that dynamic. The same prompt can generate different answers, and without clearly defined boundaries around data access and system integrations, AI agents can create risk quickly.

Key reliability risks include latency variability, where responses may take milliseconds or several seconds, requiring strict timeouts. Cost becomes a reliability factor as well, since token usage can spike unexpectedly, demanding rate limits, per tenant budgets, and spend-based circuit breakers. Model availability is another concern, as third-party APIs can fail, making fallback strategies essential. Output validation is critical to detect hallucinations or inappropriate responses, which requires semantic scoring rather than traditional testing. Security boundaries must also be enforced to prevent prompt injection, data leakage, or unintended system access.

Strong governance and guardrails, using tools such as Nvidia NeMo or LLMGuard, help contain these risks. Like roads and traffic rules enabled automobiles to scale safely, AI systems require structured controls and regulatory boundaries to ensure sustainable and reliable growth.

Fault tolerance is often discussed in theory. What practical tradeoffs have you seen teams struggle with most when implementing it in real production environments?

Tradeoffs for this strategy include the following:

  • Complexity vs. reliability: Circuit breakers and bulkheads make systems more reliable but also harder to debug, test, and reason about. Junior engineers struggle to understand the failure modes.
  • Latency vs. resilience: Retries with exponential backoff increase availability but can turn a 50ms API call into a multi-second timeout cascade under load. Product teams hate the latency hit.
  • Cost vs. redundancy: Running active-active datacenters or keeping warm standby capacity costs real money every month, even when you’re not failing over. Finance pushes back.
  • Development velocity vs. operational safety: Every new resilience pattern โ€” timeouts, retries, fallbacks โ€” adds code, testing surface area, and deployment risk. Teams move more slowly.
  • Consistency vs. availability: CAP theorem isn’t theoretical. Every architect has watched a team struggle with whether to sacrifice strong consistency for partition tolerance or vice versa.

Identity and access management play a critical role in system stability. How do IAM decisions affect incident response and recovery during high-impact failures?

Circular dependency failures occur when your IAM system depends on infrastructure that is itself managed by IAM. A locked-out automation account can prevent the very remediation scripts meant to restore it.

Federated identity collapse occurs when an IdP outage propagates across all systems that rely on it. Organizations with no fallback authentication strategy discover their “distributed” systems are actually centrally dependent on a single auth path.

Privilege escalation during incident chaos is a security risk that incident conditions create, as legitimate responders are granted elevated permissions quickly and informally, often without proper audit trails, creating exposure that outlasts the incident itself.

Separating Infrastructure resources from user management can help mitigate some of the issues. Domain Isolation can help get away from a single point of failure.

AI is increasingly used in monitoring and operational tooling. Where have you seen it genuinely improve system resilience, and where do you think human judgment still matters most?

AI has a clear advantage in incident response. It can analyze metrics, logs, and failure patterns in real time, even at midnight when human teams are offline. Used correctly, it improves both speed and accuracy in root cause analysis.

  1. AI driven auto remediation
    AI can trigger predefined remediation workflows based on patterns it detects, reducing mean time to recovery. 
  2. Intelligent root cause analysis
    By correlating logs, metrics, and error signals, AI can generate incident summaries, timelines, and documentation automatically. 
  3. Human validation remains essential
    AI should assist, not replace, engineering judgment. Human review is critical to validate findings and assert final decisions. 

In the SRE world, AI becomes especially powerful when models are trained on standardized metrics, log formats, and error taxonomies. Consistency in data inputs dramatically improves reliability and signal quality.

Looking five to ten years ahead, architectural longevity depends on flexibility. Well designed systems must allow adaptable access patterns, communication layers, and storage models. Traditional software favors predictability, but that often introduces rigidity. Modern platforms rely on API contracts, and if those contracts are too narrow, systems become tightly bound to specific use cases, requiring frequent redesign.

Architectures that accommodate evolving data formats and unknown future requirements remain more reusable and resilient. Flexible data models, such as key value stores like DynamoDB, Cassandra, or Redis, enable systems to adapt without constant rearchitecture. Designing for change, rather than locking into rigid structures, is what allows platforms to scale reliably over time.

 

Author

  • Tom Allen

    Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

    View all posts

Related Articles

Back to top button