CloudInterview

Nitish Mane on Architecting Cloud Infrastructure for AI-Driven Connected Vehicles

Nitish Mane is a backend infrastructure and cloud platform engineer with more than a decade of experience designing, scaling, and operating large distributed systems across AWS, GCP, Azure, and OCI. His career has been defined by building cloud native platforms and real-time data pipelines that operate reliably in globally distributed, high-stakes environments. As a Staff Infrastructure Engineer at Lucid Motors, he architected the company’s cloud footprint in the KSA government region, enabling connected vehicle services that actively support customers across the Middle East and the EU. Working within strict data residency requirements and without access to the typical hyperscaler options, he designed a multi-region, Kubernetes native platform that balanced compliance, scalability, and operational resilience.

Beyond foundational infrastructure, Mane also led the deployment of large-scale vehicle telemetry pipelines and built an internal LLM-powered smart diagnostics system designed to reduce service center visits while preserving strict privacy controls. His work required integrating AI workloads directly into regulated automotive cloud environments, ensuring that performance, latency, and governance standards were never compromised. In this interview with AI Journal, Mane shares how he approached multi-cloud trade-offs, capacity planning for systems that vehicles depend on in real time, and observability strategies for globally distributed intelligent platforms.

You architected Lucid Motors’ cloud footprint in the KSA Gov region, supporting connected vehicles across the Middle East and the EU. What were the biggest architectural challenges in building infrastructure that vehicles in multiple regions actively depend on?

Designing a multi-cloud, multi-region platform always starts with trade-offs: speed of delivery, available cloud services, and most critically, local compliance.

In the KSA government region, data residency was non-negotiable. Vehicle data could not leave the country, and there was no AWS or GCP presence. After evaluating local providers, OCI was the strongest option in terms of maturity, compliance alignment, and long-term scalability.

That choice came with constraints. Some managed services we were used to, like AWS ACME certificate management or PostgreSQL, weren’t available, so we built internal alternatives. For example, we implemented our own certificate management using Let’s Encrypt and standardized on MySQL. Since all workloads were Kubernetes-native, we used Argo CD with a central control plane to manage cluster and application deployments consistently across regions.

Capacity planning was another major concern. Vehicles actively depend on these systems, so we reserved OCI capacity upfront to guarantee resource availability and predictable performance.

Connected vehicles generate massive streams of telemetry. How did you design the data ingestion and processing pipelines to ensure low latency, high reliability, and scalability across regions?

Telemetry volume can explode very quickly, so the first optimization was at the edge. We pushed aggregation as close to the vehicle as possible by processing and filtering signals directly on the TCU. Only high-value signals were transmitted, and others could be dynamically enabled by the vehicle team if needed.

Data was ingested through EMQX, a low-latency MQTT broker running in a highly available Kubernetes setup. From there, telemetry flowed into Kafka, where we increased replication and redundancy to ensure reliability. Downstream consumers handled aggregation and long-term storage.

Because the entire pipeline was Kubernetes-based and fully automated, we were able to replicate this architecture across regions with minimal friction, scaling horizontally as vehicle volume increased.

You also worked on an LLM-powered smart diagnostics system at Lucid. How did you integrate AI workloads into a production automotive cloud environment without compromising performance or compliance?

The core goal was to reduce service costs and unnecessary service center visits. We saw many vehicles brought in for issues that owners could have resolved themselves with better diagnostic guidance.

We built an internal AI agent trained on vehicle diagnostic data that could answer user questions and guide troubleshooting. From a compliance perspective, it was critical that no vehicle data be left on Lucid’s infrastructure. The model was hosted entirely in-house, and we did not rely on external LLM APIs.

This allowed us to tightly control latency, performance, and data access while meeting strict privacy and regulatory requirements. It also gave us flexibility to continuously fine-tune the model as vehicle behavior evolved.

When deploying infrastructure in regulated environments such as the KSA government region, how do you balance data sovereignty requirements with the need for global system consistency?

Data sovereignty always comes first. Our baseline rule was simple: customer and vehicle data must never leave the country.

We audited all applications and databases to eliminate unnecessary storage of PII and enforced strict data-access boundaries. For globally used applications like Lucid Garage, we used a multi-region MongoDB architecture that allowed a single application instance to read from region-specific databases.

This approach preserved a consistent global user experience while ensuring that each region’s data stayed local. Users could seamlessly switch regions to view vehicle reports without violating residency requirements.

In connected-vehicle ecosystems, infrastructure decisions directly impact the customer experience. How do you design backend systems that remain resilient when millions of real-world edge devices are involved?

Resilience starts long before production. We built a robust CI/CD pipeline that first deploys every change to an internal fleet of vehicles used by engineers and test drivers.

We closely monitored metrics, logs, and alerts during this phase to catch regressions early. Only after systems proved stable did changes roll out to production vehicles. This entire workflow was automated using GitLab CI and Argo CD.

This approach allowed us to detect breaking changes early, reduce incident frequency, and protect the end-customer experience—especially important when updates affect real vehicles in the field.

Your background spans AWS, GCP, Azure, and OCI. From an AI and distributed systems perspective, how do you evaluate which cloud platform is best suited for telemetry-heavy, AI-enabled workloads?

At scale, the “best” cloud is usually the one that is available in the region you need and offers the strongest commercial terms.

All major cloud providers offer comparable building blocks for AI and distributed systems. The real differentiators are regional presence, regulatory alignment, and cost. For example, OCI was the only viable option in KSA, while AWS was the only realistic choice in Hong Kong.

If an organization already has significant workloads on a provider, continuing with that ecosystem usually delivers faster results and better pricing. For new initiatives, I strongly recommend running early proofs-of-concept with multiple vendors. That upfront effort often saves significant time and money later.

Observability becomes critical when AI systems and real-time pipelines operate at scale. How do you approach monitoring, tracing, and debugging in a globally distributed intelligent systems?

I strongly favor open-source observability stacks because the ecosystem is evolving rapidly. Today, combining Prometheus, Grafana, and Loki provides deep visibility into metrics, logs, and alerts, and newer MCP-style integrations allow engineers to query these systems using natural language.

For tracing, OpenTelemetry is foundational. It decouples instrumentation from vendors, making it much easier to switch tools as requirements change.

Looking ahead, I believe AI-driven SRE agents will play a major role—automatically identifying gaps, detecting anomalies, and even recommending or executing remediation steps. Observability is moving from passive dashboards to active intelligence.

As AI becomes increasingly embedded in vehicles and industrial platforms, what infrastructure shifts do you believe engineers must prioritize to support the next generation of intelligent, connected systems?

Engineers need to shift from designing static infrastructure to building adaptive, policy-driven platforms.

This means prioritizing edge intelligence to reduce latency and bandwidth costs, designing data pipelines that treat data residency as a first-class constraint, and embracing automation everywhere—from capacity planning to incident response.

Equally important is decoupling systems through Kubernetes, open standards, and vendor-neutral tooling. This flexibility allows teams to adopt new AI models, observability tools, and hardware accelerators without re-architecting everything.

Ultimately, the next generation of connected systems will be defined less by individual technologies and more by how quickly infrastructure can evolve alongside AI capabilities.

As AI becomes more deeply embedded in vehicles and industrial systems, Mane sees infrastructure evolving from static, region-bound deployments into adaptive, policy-driven platforms. Engineers must design with data sovereignty as a first principle, automate aggressively, and build on open standards that allow rapid shifts in tooling, models, and scale. In connected ecosystems where backend decisions directly affect the customer experience, resilience and accountability are not optional features but architectural requirements. Mane’s perspective underscores a broader industry shift, one in which the success of AI-driven systems depends as much on disciplined infrastructure design as on the intelligence of the models themselves.

 

Author

  • Tom Allen

    Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

    View all posts

Related Articles

Back to top button