InterviewEnterprise AI

Sandeep Pal on the Infrastructure Layer Beneath Enterprise AI

Sandeep Pal is a Principal Engineer with more than a decade of experience building distributed systems and cloud-native infrastructure at companies including Snapchat, Microsoft, and Intel. He is the creator of MultiCloudJ, an open-source Java SDK that gives developers a unified interface for writing applications across AWS, GCP, and Alibaba Cloud, and a contributor to Apache HBase, Apache Phoenix, Apache Spark, and GoCloud.dev.

Much of the current conversation around enterprise AI focuses on models, GPUs, and tooling, and Sandeep’s vantage point is the layer underneath: the data pipelines, multi-cloud abstractions, and distributed systems that determine whether AI workloads actually hold up at scale. He sat down with us to talk about the infrastructure complexity hiding behind multi-cloud AI deployments, what experimentation pipelines at Snapchat taught him about data quality bounding model quality, and where he sees the next set of open-source gaps forming in the enterprise AI stack.

As enterprises scale AI workloads across AWS, GCP, and Alibaba, what does the underlying infrastructure complexity look like from an engineering standpoint?

From the outside, multi-cloud AI sounds like a strategic decision. From an engineering standpoint, it is considerably harder than that, and the picture is actually messier than the question suggests. Most enterprises are not simply working across hyperscalers. They are stitching together three categories of providers at the same time: hyperscalers such as AWS, GCP,  and Alibaba; specialized GPU clouds such as CoreWeave and Lambda; and inference-as-a-service platforms such as Together AI and Fireworks. Each tier solves a different problem and brings its own kind of complexity.

The binding constraint underneath all of this is capacity. GPU and accelerator supply is uneven, and the chips are not interchangeable. You may have H100s reserved on AWS in one region, TPUs on GCP in another, Trainium with its own compiler stack, Alibaba GPUs in Asia-Pacific for serving Chinese users, and a new wave of specialised chips like Groq and Etched entering the mix. Scaling out almost always means scaling into a specific provider’s specific region on specific silicon, because that is where the chips actually exist. Workload placement becomes a constrained optimization problem, not a routing problem.

Data residency is another complexity on top of capacity. Operating across these providers might be related to data sovereignty, and rules like GDPR, and China’s PIPL are not policy questions but architectural ones. What is legal to train on in one region may not be legal to move to another, and model weights themselves are now subject to export controls in certain cases. That partitioning has to be enforced in code, and not in documents.

What you end up building is not a unified multi-cloud platform. What you actually build is a thin portability layer over provider-specific implementations, with a placement engine on top that knows where the capacity is, what silicon it is running on, where data is allowed to live, and what the regulatory boundary forbids.

You built data pipelines at Snapchat supporting experimentation at scale. What did that work reveal about the relationship between data infrastructure reliability and the quality of what runs on top of it?

The lesson I took away from Snap is that experimentation infrastructure is one of the few places in software where data reliability and product correctness are not just correlated, they are the same thing. If your pipelines drop even half a percent of events, your dashboards still look fine and your A/B test still ships a “winning” variant, but the variant may not actually be winning. You have quietly converted an infrastructure bug into a product decision, and nobody catches it for months, sometimes never.

Two specific experiences shaped how I think about this. While running experimentation pipelines on Spark, we kept seeing jobs that would either fail intermittently or get stuck indefinitely, forcing restarts of very large pipelines and adding significant cost. I traced it down to a deadlock between Spark’s TaskMemoryManager and UnsafeExternalSorter and contributed the fix back upstream as SPARK-39283. The bug itself was a small synchronisation issue, but fixing it materially improved the stability of every downstream pipeline. That is what software quality at the platform layer actually buys you. It is not just uptime, it is the ability of every layer above to assume the layer below is correct.

The second was a cost optimization effort I led, where I systematically ran the same workloads across different compute offerings on Google Cloud Dataproc. The same job could cost very differently depending on the hardware it ran on, and matching workload to hardware ended up saving millions of dollars in annual infrastructure cost while also making runs more predictable.

These two experiences point to the same broader lesson, which carries directly into AI infrastructure today. The quality of what runs on top of a data system is bounded by the reliability of the system underneath, and the cost of running it is determined by how well the workload is matched to the hardware. In AI, the first shows up as model quality, since a model trained or grounded on a pipeline that silently drops or stales events will produce confidently wrong outputs. The second shows up directly in GPU spend, where the same training or inference workload can cost very differently depending on the silicon and configuration.

MultiCloudJ abstracts SDK differences across cloud providers. How does that problem change in character when the workloads involved are AI-specific?

The job of a multi-cloud SDK is to give callers a consistent contract over providers whose underlying semantics often differ in subtle ways. Likewise for inference; AWS Bedrock, GCP Vertex, and Azure OpenAI all let you “call a model,” but the details diverge. Models are addressed differently, request and response schemas are different, streaming protocols are different, and tool-use formats, stop reasons, and throttling errors all look different across the three. These are exactly the kinds of provider-specific behaviours an SDK like MultiCloudJ exists to smooth over, so the caller can write to one consistent contract.

That said, the industry is not really focused on multi-cloud abstraction for AI yet. Most enterprises are still in the phase of getting AI into production at all. The pressing priorities are securing GPU capacity, choosing the right model, and proving business value, not portability. Most teams will happily commit to a single provider’s stack if it ships faster. But as the space matures and cost and capability differences become real per-request decisions, the same pressures that drove multi-cloud abstraction in the storage and messaging era will show up here too. There is a real opportunity to get ahead of that curve.

What does not belong inside the SDK, even then, is the harder set of questions AI brings with it. Cost-per-token economics, KV cache locality during a multi-turn conversation, and which provider can actually serve a given request are orchestration problems, not abstraction problems. The SDK’s job is to make every provider callable through a consistent contract. The orchestration layer above it decides which provider to call. That orchestration layer is where most of the interesting AI infrastructure work is happening today, even if multi-cloud AI itself is still a problem the industry has not fully turned to yet.

Vendor lock-in carries different stakes depending on the workload. Where does it bite hardest when model training, inference, and data pipelines are split across providers?

Lock-in is expensive, usually for everyone, but the rate varies a lot by workload. The mistake most enterprises make is treating it as a single decision. In practice, training, inference, and data pipelines each lock you in through a different mechanism, and the escape costs are not comparable.

Training is where lock-in bites hardest. Once you have committed to H100 capacity on one cloud, built your stack around TPUs and XLA, or invested in Trainium and the Neuron SDK, the switching cost is not just rewriting code. It is losing your place in line for accelerators that are still capacity-constrained. The lock-in is contractual and physical, not just technical, and most teams accept it because the alternative is not training at all.

Data pipelines are next, and this is where the bill is most visible. Data feeding training and RAG accumulates in one provider’s object store, with permissions, lineage, and lakehouse formats wired into that provider’s IAM and catalog. Moving the pipeline is straightforward, but moving the data triggers egress fees that can dwarf the compute savings of the move. Open table formats like Iceberg and Delta have helped, but the catalog and network topology around the data are still per-cloud. Gravity wins more often than architecture diagrams suggest.

Inference is, perhaps surprisingly, the softest lock-in, provided the application is built correctly. With a proper abstraction, whether a multi-cloud SDK, a gateway, or an internal routing layer, swapping a model or provider should be a configuration change, not a code change. Most teams have not built that abstraction yet, which is what creates the appearance of lock-in. What remains is commercial, in the form of reserved capacity contracts, and operational, in the sense that guardrails tuned to one model have to be re-tuned for another.

The pragmatic stance most enterprises will be landing on will be asymmetric. Accept lock-in where it is rational, which is training. Design data pipelines around open formats from day one, because once the data has gravity, you are not moving it back. And invest in a proper inference abstraction early, so that switching models or providers stays a configuration decision rather than a migration project.

Open-source projects like Apache HBase, Apache Spark, and MultiCloudJ each solved specific gaps in the distributed systems ecosystem. Where do you see the equivalent gaps forming now for enterprise AI infrastructure?

What is striking about HBase, Spark, and MultiCloudJ is that each one named a gap the industry was already feeling but had not yet articulated. HBase gave us random reads and writes on a batch file system. Spark gave us in-memory compute when disk I/O became the bottleneck. MultiCloudJ, which we open-sourced at Salesforce, gave the Java ecosystem portable abstractions for storage, pub/sub, and secrets across AWS, GCP, and Azure.

I see three equivalent gaps forming in the enterprise AI harness today.

The first is the serving substrate, an “HBase for inference.” We have engines like vLLM and TensorRT-LLM, but no unified orchestration layer above them. We need an enterprise layer that manages KV-cache residency, provides multi-tenant isolation, and handles graceful failover across heterogeneous accelerators without breaking application logic.

The second is the retrieval plane, an “AI query optimiser.” Vector stores have made similarity search production-ready, but they are still just indexes. The gap is a retrieval engine that intelligently routes across vector, keyword, and structured lookups while enforcing freshness and permission-respecting access controls as first-class citizens.

The third, and in my view the most underserved, is observability, the “OpenTelemetry of AI.” Traditional APM stops at the API call. AI workloads need prompt-to-token lifecycle telemetry that traces model versions, retrieval results, and tool calls alongside cost and latency. Platforms like Bedrock offer model-agnostic APIs within their own walls, but the industry needs a vendor-neutral open standard so enterprises can swap providers without losing their historical signals.

The pattern points to a familiar shift. As models become commoditised, the moat moves to the infrastructure. The real value, and the next great open-source evolution, will be in the harness that makes these models reliable, observable, and portable at enterprise scale.

Storage, identity, and messaging abstractions are relatively mature in multi-cloud architecture. Which layers of the stack are still the hardest to standardize when AI workloads are involved?

Storage, identity, and messaging became commoditised because they are functional abstractions. They define what a system does, which is inherently portable. AI workloads, however, expose physical abstractions, the where and how of execution, which are much more resistant to standardisation. Inference is becoming easier to route through model-agnostic gateways, but the upstream layers remain significantly fragmented.

Training and silicon lock-in is the hardest layer to abstract, because the moat has moved from the API down to the compiler and kernel. Frameworks like PyTorch provide a common interface, but peak performance is tightly bound to hardware-specific stacks such as NVIDIA’s CUDA, Google’s XLA, and AWS’s Neuron. Even subtle differences in how hardware families handle numerical precision, like BFloat16 rounding, mean that moving a training run across providers is not a configuration change. It is a re-optimisation exercise that often incurs a real performance tax.

Data gravity is the second. S3 and Azure Blob look interchangeable on a spreadsheet, but the reality changes when you try to move a multi-petabyte training corpus or a massive replicated vector index. The industry lacks a standardised zero-copy protocol between clouds, so multi-cloud elasticity gets strangled by egress fees and the speed of light. In practice, the data picks the cloud, and the training cluster follows.

The third is the fragmentation of MLOps tooling. In traditional cloud architecture, Kubernetes acts as a universal operating system that abstracts the underlying infrastructure. In AI, we lack that common substrate. Every provider has its own walled garden in SageMaker, Vertex AI, or Azure ML, and these platforms do not yet share a unified metadata standard for model versions, audit trails, or guardrails. For an enterprise, moving a workload is not just about moving code, it is about rebuilding an entire governance and compliance framework from scratch, because the paper trail itself is not portable.

You’ve worked across Salesforce, Snapchat, Microsoft, and Intel at different points in the cloud’s maturity. How has the infrastructure conversation around large-scale data systems changed, and where does AI fit into that arc?

Each stop captured a different era of the same question, where does data live and what does it cost to move it. At Intel in 2014, the industry had gone horizontal in principle, but it was all on-prem. DevOps engineers stood up distributed VM clusters on bare metal for us. I was working on Spark, Hadoop, and the Trusted Analytics Platform, making distributed analytics actually fast and trustworthy on commodity clusters, and treating trust and security as first-class concerns rather than afterthoughts.

By Microsoft in 2017, the conversation had moved from “can distributed systems work?” to “how do we run mission-critical commerce at global scale?” The hard problems became consistency, replication, and managing state across a global fabric where customers expected cloud economics without giving up enterprise trust posture. At Snap in 2021, infrastructure became real-time and elastic by default. We were not planning quarters out, we were absorbing traffic spikes in minutes, and building experimentation pipelines at that scale taught me that reliability is the foundation of truth.

My Salesforce journey mirrors the industry’s biggest shift. In 2019 we were still first-party, running our own data centres, much like the Intel days. When I boomeranged back in 2022, public cloud had unlocked effectively unlimited capacity, and the conversation moved from “where do we find capacity?” to “how do we orchestrate it?”

AI is what finally justifies all that unlimited capacity. It has not replaced the infrastructure conversation; it has raised the stakes on every part of it. Inference is in the hot path, data errors surface as hallucinations instead of bad dashboards, and cost discipline has shifted from racks and stacks to tokens and GPUs. The lessons that carry forward are specific.

Reliability of the layer underneath determines the quality of everything above it, data quality is bounded by pipeline correctness rather than by what dashboards show, and cost is unlocked by matching the workload to the right hardware. Those were true for Spark and Hadoop in the 2010s, and they are equally true for inference servers and retrieval pipelines today.

Author

  • Tom Allen

    Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

    View all posts

Related Articles

Back to top button