Most conversations about AI in operations start with the model, but Rohit Wadhwa starts with the plumbing. A Senior Software Engineer with roughly fifteen years in enterprise systems, Wadhwa architects the checkout and high-concurrency data platforms behind one of the largest retailers in the world, where a few seconds of latency at the register translates into real customers stuck in real lines across thousands of stores.

His answer to that pressure was “Checkout Doctor,” an LLM-based assistant that reads the chaos of a checkout incident, scattered logs, firing alerts, mismatched telemetry, and hands an on-call engineer a plain-English root-cause analysis in a fraction of the time a human would need. What makes his account worth reading is that he is candid about where the technology let him down. The model was too slow and too prone to inventing an error code to be trusted with a live fix, so he narrowed its job to gathering and summarizing, and left the operational decisions to predictable code.

That instinct, treat AI as an assistant rather than an authority, runs through everything he discusses here: why the migration to a NoSQL architecture on Azure Cosmos DB had to come before any AI could earn its keep, how event-driven design turns retail shrink from an after-the-fact forensic exercise into something catchable while the customer is still at the register, and where the line sits between a use case ready for autonomy and one that still needs a human holding the wheel. His closing read on younger engineers, that a clever prompt is worthless without disciplined systems engineering beneath it, is the throughline stated plainly.

You built an LLM-based tool that automates incident triage for checkout systems at a major retailer. What convinced you that large language models were the right approach for SRE work, and where did they fall short of what you expected?

When a major checkout system slows down or fails, on-call engineers are immediately flooded with a massive mess of unstructured data. You have raw logs scattered across Google BigQuery, error alerts firing off. Traditional scripts just can’t make sense of that chaos because the text isn’t uniform. That’s what convinced me that LLMs were the right tool for the job. We built an AI-driven assistant called “Checkout Doctor” using Large Language Models and the Model Context Protocol (MCP). Because it understands natural language, an engineer can just type what’s happening in plain English. The agent then dynamically figures out which BigQuery datasets to search, scans the patterns, and pulls all that messy context together into a clear root-cause analysis. It does the heavy lifting of reading the chaos much faster than a human can.

But where the models completely fell short of expectations was reliability and speed.

In theory, you want the AI to not just find the problem, but also fix it. In reality, LLMs are predictive text models—they guess the next word. When you are running registers at a massive scale, you cannot afford a tool that confidently “hallucinates” or makes up an error code. They are also just too slow. Waiting 5 or 10 seconds for an LLM to think through a problem feels like an eternity when customers are literally stuck in line at a store. We quickly realized the AI couldn’t be trusted to make final operational decisions on its own. We had to adjust our expectations and limit its role strictly to gathering data and summarizing the incident, leaving the actual decision-making to safe, predictable code. Also, unless the model is trained on multiple GCP tables and its data, it does not accurately determine the correct result.

A lot of companies are bolting AI onto existing operations without rethinking the underlying architecture. When you migrated a national-scale checkout platform from relational monoliths to Azure Cosmos DB, how much of that work was a prerequisite for AI to be useful at all?

The honest truth is, if we hadn’t done the hard work of moving to Azure Cosmos DB, putting AI on the platform would have been a total waste of time.

A lot of companies treat AI like magic fairy dust, you can just sprinkle on top of old systems to make them smart. But back when we were using our old relational monolith SQL, our data was completely bottlenecked. If a checkout register crashed, trying to find out why meant running heavy, slow database searches across a tangled SQL system. At our massive scale, those searches took too long and would clog up the database right when we were already in a crisis. An AI tool can’t sit around waiting for a slow database to send it information while lines are growing at the registers.

To fix this, we migrated to a multi-region, active-active NoSQL architecture using Cosmos DB. This completely eliminated single points of failure and pushed our uptime to “five nines” (99.999%). But we couldn’t just lift and shift the data; we had to rethink how it was structured. We designed a smart partitioning strategy based on storeId to evenly spread the workload across the country, and we optimized our Request Unit (RU) throughput. Because Cosmos DB costs are based purely on how many RUs you consume, this optimization gave us incredible cost-performance efficiency.

All of that architectural groundwork was the actual prerequisite for AI. It turned our scattered, slow-moving data into a highly efficient , highly available, real-time stream that is available in milliseconds. Without fixing the underlying plumbing, partitioning your data properly, and making it highly available across regions, your AI is essentially a Ferrari with no gas.

Retail shrink is a $112 billion problem. How does an event-driven architecture change what’s actually detectable in real time, and what can engineering catch that traditional loss-prevention methods miss?

Traditional loss prevention is completely reactive. It’s what we call “post-transaction forensics”—basically, looking at database logs or security camera footage days after a theft has already happened and the merchandise is long gone. At that point, the $112 billion shrink crisis has already taken its toll.

An event-driven architecture completely flips this script by moving us into real-time prevention. Instead of waiting for a receipt to print, every single scan, item void, or scale mismatch at a register/self-checkout triggers an instant message. We built a unified pipeline that handles these events across 5,000 stores in the US and Canada. Even though we have multiple generations of hardware—some old registers emitting raw text and newer ones using JSON—we use an adapter pattern to translate all those messy signals into a single standard format in milliseconds.

Because the data flows in sub-second timeframes, engineering can catch complex, high-risk patterns while the customer is still standing at the register.

The biggest breakthrough here is our “Pause/Resume” architecture, which we designed under our platform’s Work Intelligence guidelines. We isolated quick register actions from heavy background workflows. If our cloud orchestrator catches a major anomaly—like an item bypassing a scanner—it instantly sends an atomic “Station Action” command to pause that specific register before the customer can pay. At the exact same second, it pushes a notification to an associate’s mobile device (Checkout Mobile/Upfront).

We even built a tiered model to keep things smooth: for a clear violation, the lane pauses instantly, but for a minor pattern mismatch, it sends a silent notification so the associate can just do a polite visual check. This lets engineering catch in-flight fraud that traditional methods miss entirely, all while keeping the store running smoothly.

When you’re operating at national scale, a triage tool that’s wrong even occasionally can cascade. How do you build trust in an AI system that’s making or recommending operational decisions, and what guardrails matter most?

To prevent that, you can’t just treat the AI like a black box that engineers are expected to blindly trust.

Whenever our triage tool, “Checkout Doctor,” flags an incident or suggests a fix, it cannot just give a generic recommendation. It has to clearly display the exact evidence it used—the specific BigQuery logs, firebase telemetry events, the exact error patterns, and the telemetry timestamps that led to that conclusion. When an on-call engineer can see the AI’s “thought process” in plain English, they instantly trust the tool much more.

As for guardrails, two things matter most:

First, we built an automated circuit breaker based on a tiered confidence model. If the AI’s confidence score falls below a certain threshold, or if the suggested fix involves a high-risk action—like restarting a service—the AI is blocked from doing anything on its own. It gracefully degrades and immediately hands off control to a human engineer.

Second, we strictly isolated our telemetry pipeline from the core Point-of-Sale (POS) loop. This means even if the triage tool experiences a network lag or completely goes down, the physical checkout registers in the stores are entirely unaffected and keep running autonomously offline. The AI is there to assist and look for patterns, but it can never become a single point of failure for the retail business.

You’ve migrated legacy systems and introduced AI tooling on the same platforms. Which of those two is genuinely harder, and why do you think most coverage gets that balance wrong?

Migrating the legacy systems is infinitely harder.

Most tech coverage gets this completely wrong because AI is flashy. It dominates the headlines because it feels like magic. People love to write about dropping an LLM into a business and watching it automatically solve problems. But AI is just the top layer of the house. If the foundation is rotted, the house collapses.

When you migrate a national-scale checkout platform from a legacy, relational monolith to a modern database like Azure Cosmos DB, you are essentially swapping out a plane’s engine while it’s flying at 30,000 feet. You have to move massive amounts of data, completely rewrite how that data is structured using smart partitioning strategies, and ensure “five nines” (99.999%) uptime so that millions of real customers can keep buying groceries without a hitch. If you mess up a legacy migration, the business grinds to an immediate, costly halt.

Building the AI tooling on top of that—like our “Checkout Doctor” assistant—is actually the straightforward part once the data is clean. The AI relies entirely on having a fast, highly available, and standardized stream of real-time data. If you don’t do the grueling architectural groundwork first, your AI has nothing of value to read. Tech media loves to focus on the smart passenger in the car, but as engineers, we know that building the highway is what actually gets you there.

For engineering leaders watching the agentic AI wave, what’s the difference between a use case that’s ready for autonomous handling today versus one that still needs a human in the loop?

For engineering leaders right now, the biggest mistake is buying into the hype that agentic AI can handle everything. The difference between an autonomous use case and one that strictly needs a human comes down to two things: the type of data and the blast radius.

A use case is ready for full autonomy today if it operates in a closed, highly predictable environment where the data is structured, and the risk of a mistake is low. For example, automatically scaling up cloud infrastructure when traffic spikes, routing an internal IT support ticket to the right team, or isolating a single malfunctioning self-checkout register. If the AI gets it wrong, the system can automatically roll it back via deterministic code, and nobody notices. The blast radius is tiny.

On the other hand, you absolutely need a human in the loop the second you hit open-ended, non-deterministic environments with a high blast radius.

Take incident triage during a major retail outage. If an AI agent reads a bunch of logs and decides on its own to deploy an unverified code patch or modify core production database routing schemas during peak holiday shopping hours, a single mistake could shut down checkout lanes across thousands of stores. That is why tools like our “Checkout Doctor” are designed to assist, not control. The AI does the heavy lifting of gathering the data, analyzing the BigQuery logs, and suggesting the root cause in plain English, but a human engineer must always make the final operational decision.

If a mistake can cost the business millions of dollars in a matter of minutes, then AI belongs in the passenger seat, not behind the wheel.

You serve as a hackathon judge and peer reviewer alongside your day job. What patterns do you see in younger engineers’ assumptions about AI that you think will age badly?

The biggest pattern I see—both when judging hackathons and doing peer reviews—is that younger engineers often treat the AI model or the prompt as the entire architecture. They build these incredibly clever prompts, but the underlying data structures, error handling, caching, and network limits are completely ignored.

There are two specific assumptions they make that I think will age badly:

First, they assume that API endpoints and LLMs are always fast, cheap, and perfectly reliable. When you build a prototype in a hackathon, it’s easy to chain three different model calls together because you’re only processing a few requests. But when you try to scale that concept to a production environment handling millions of events per second during peak holiday shopping, those systems collapse. The latency spikes, token costs skyrocket, and the non-deterministic nature of AI means the system breaks in ways they can’t predict.

Second, they treat AI like a replacement for fundamental software engineering. I see a lot of younger developers using AI to generate code blocks without actually understanding the underlying structure. They focus so much on the “smart” feature that they skip learning core systems design—like how to structure a partitioning strategy or build a fault-tolerant data pipeline.

AI is an incredible accelerator, but it isn’t magic. The engineers who thrive long-term will be the ones who realize that a smart prompt is useless without robust, disciplined systems engineering underneath it.

Author

Tom Allen

Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

View all posts

Tom Allen 3 hours ago

9 minutes read

Author

Related Articles

Building the Supply Side of AI: Miky Bayankin on What It Takes to Bring GPU Clusters Online

Mindsprint CEO Is Rebuilding the IT Services Playbook From the Inside Out

Building the Measurement Layer for an Agent-Run Web, A Conversation with Cameron Witkowski

Q&A With CRO of Appriss Retail on the Two Barriers Blocking Real Ai Adoption In Retail