
As AI systems move from controlled research environments into complex, high-stakes production settings, the engineering challenges behind them extend far beyond model design. In this interview with AI Journal, Simerus Mahesh, a distributed systems and infrastructure engineer, shares how experience across both large-scale production environments and early-stage AI builds has shaped his approach to scaling modern AI workloads. Having worked on production infrastructure at companies like PlayStation and Meta, as well as building core systems for a Greylock-backed AI research lab, Mahesh focuses on the reliability, security, and performance of distributed AI systems in real-world conditions.
At the center of his work is a focus on running AI workloads safely and efficiently at scale, from designing sandboxed execution environments using technologies like gVisor to building distributed platforms such as Arceus, a system for training machine learning models across decentralized compute networks. Beyond industry, he has contributed to engineering and research initiatives including the MIT-PITT-RW autonomous racing team, competing in the Indy Autonomous Challenge with a fully autonomous race car, and remains active in the broader developer ecosystem as a hackathon judge and collaborator. In this conversation, Mahesh breaks down the realities of production AI infrastructure, the gap between research and deployment, and what it takes to build systems that remain reliable under constant load and change.
You got your start in distributed systems and infrastructure engineering, what initially drew you to the challenge of running AI workloads in production environments at scale?
I was drawn to this field because I was curious about the gap between elegant research results and the entropy of production environments. Scaling in traditional AI research is often treated as a modelling problem involving enhancements to architecture and datasets. In production environments, however, I’ve noticed that it shifts to a systems problem. Anything can happen in production, as failures can arise from not considering deployment realities, such as noisy neighbours, resource contention, and cold starts. This shift made it clear to me that a whole new set of unique challenges arises when running AI workloads at scale, and that solving them is where much of the really interesting engineering work lies.
Your work sits at the intersection of AI systems, infrastructure, and distributed computing. How would you describe your approach to solving complex engineering problems in this space?
I take a first-principles approach to solving complex infrastructure engineering problems, always starting with the constraints before moving to the actual functionality. My approach has been refined through designing many systems, encountering real failure modes, and justifying each part of the architecture. I ensure designs account for real-world constraints such as variability and failures, guaranteeing stability within defined thresholds. An important part of this process is ensuring systems are observable, which serves to ground future changes to the architecture. With this approach, I’m able to rapidly iterate without sacrificing stability, aligning system modifications with business outcomes.
You have worked across both large-scale environments and early-stage builds. How have those experiences shaped the way you think about designing and scaling systems today?
I’ve had different experiences designing systems in both environments, each uniquely shaping how I think about scale. With large-scale environments, I’ve dealt with non-trivial failure modes and learned to account for how infrastructure behaves under sustained load. With early-stage builds, I’ve gone much deeper into fundamental infrastructure design for optimization and deployment, e.g., rebuilding battle-tested AWS services for a client’s specific on-prem deployment. Both of these experiences taught me how to design systems that are realistic, scalable, and tailored to immediate needs for maximal business impact.
When you think about production engineering for AI systems, what are the biggest gaps between how models are developed in research settings and how they are actually deployed and maintained in real-world environments?
Model capability is usually the primary focus in research settings, typically performed under ideal and unrealistic ‘production’ environments. To deploy these models, production engineers focus more on reliability and cost, creating the necessary guardrails and automations to deal with unpredictable workloads. The gap comes from this mismatch: when operational constraints are not taken into consideration during research. Most issues arise as a result of unsatisfactory infrastructure rather than the model itself, which is why it’s extremely important to keep scalability in mind while developing SOTA models.
You have built systems that support large-scale, distributed AI workloads. Can you walk through what it really takes to keep these systems reliable under constant load and experimentation?
Reliability comes from designing infrastructure that can absorb continuous change without breaking. As with most regular SaaS workloads, there are a few key architectural pillars to keep in mind with AI workloads: isolation, observability, resource control, and guardrails for iteration. Proper isolation ensures that we can support multi-tenancy, which is critical in shared environments. Observability enables us to understand what’s going on under load, enabling failure modes to be detected early. Resource control ensures predictable performance through proper scheduling, quotas, and limits. And to constantly improve our system, we need to set up the necessary pipelines that allow rapid iteration without destabilizing it.
Sandboxing and containerized execution are becoming increasingly important for AI workloads. How do these techniques shape the way modern AI systems are built and operated?
AI workloads are dynamic in that they can execute arbitrary user code and interact with external systems. This poses significant security risks that can affect system stability. Sandboxed execution allows for a secure isolation boundary, enabling multi-tenancy and safe execution. I’ve worked with sandbox environments like gVisor, and I’ve seen the effectiveness of techniques like syscall interception; however, they do come with tradeoffs. Software-based approaches, like ptrace for syscall interception, typically introduce higher latency due to context switching. One solution is to shift the isolation boundary closer to the hardware and leverage hardware virtualization, such as the KVM platform, which reduces latency but incurs higher infrastructure costs because it requires access to bare-metal compute. The real challenge is finding a balance where infrastructure is secure enough to be safe, but efficient enough to be profitable.
From your experience, what are the key differences between building production infrastructure within large, established organizations versus designing systems from scratch in early-stage environments?
The key difference is operating under constraints versus defining them. In established organizations, working on infrastructure typically involves optimizing within established constraints where changes need to be incremental and managed carefully to avoid system-wide disruption. In early-stage environments, you’re typically defining those constraints yourself and making foundational decisions that balance immediate business impact and long-term scalability. High agency and bias towards action are typically needed for these earlier-stage environments to ensure judgment calls are sound regarding whether to move fast to hit a business milestone or to slow down to ensure core abstractions are solid for future growth.
Outside of your core engineering work, you have been involved in research, competitions, and the broader developer ecosystem. How do these experiences influence the way you approach innovation in AI infrastructure?
I’ve been able to gain a more holistic view of how AI systems are built and can fail. Through involvement in research, competitions, and the developer ecosystem, I’ve been exposed to a wide range of failure modes, system designs, and iteration techniques. This has allowed me to think beyond pure isolated components and instead design for complex multi-system interactions. Overall, these experiences shaped my approach to innovation in this domain, drawing on a grounded understanding of the fundamentals I’ve accumulated over the years.



