Interview

What Breaks When AI Agents Move From Demo to Production

Enterprises are moving AI agents from demos to production, and much of the risk lies in that transition. Shruti Anand has spent more than a decade working on the systems that make it manageable, building developer platforms and enterprise data infrastructure at Snowflake, Splunk, and HPE before turning her focus to agentic AI and developer productivity at a leading technology company. Her background spans security, governance, and data engineering for some of the world’s largest enterprise customers, and it shapes how she thinks about a question many teams are confronting for the first time: when can an agent act on its own, and when does a human need to stay in the picture? In this conversation, she shares her framework for deciding how much autonomy an agent should get, where human oversight genuinely reduces risk, and what tends to break when agents leave the demo environment. 

You spent years building governance and security into enterprise data platforms before moving into agentic AI. What did that work teach you about trust that you think the current agent conversation is missing? 

There are multiple nuances that are introduced in the world of agents. In a world where people are talking about models or agents (accuracy, reliability), enterprises care about whether the entire system (human + agents) behaves predictably. From a user perspective, you are not at a point where you can blindly trust an agent’s output; you need to know if the agent is:  

  1. a. Reliable, as in can the agent claim what it is doing without hallucination 
  2. b. Bias/harmfulness, is the agent producing any o/p that could be biased or that can be harmful
  3. c. Transparency, can it tell what it did and why? 
  4. Accountability: in case of a mistake who takes the ownership 
  5. Combinatorial Risk: Risks that are only aware when there are combinations of agents or agents and skills etc. 

Organizations today are finding ways to truly determine when agents can be self-healing and working through a use case autonomously versus not. If you look at startups coming up , there is a huge surge in agent security and governance-related startups, determining if there are enough policies in place to know what agents are doing and acting within boundaries, and in case of a threat, they can be audited as well.   

When you’re designing an agent, how do you actually decide how much autonomy it gets? Is there a principle or a test you apply, or is it case by case?  

When it shouldn’t be autonomous: one-shot high-risk use cases like migration. Think of migrating production data from one database to another, you can trust an agent to generate a plan, write a script, can you execute it blindly and delete data from the source, absolutely not. This needs humans in the loop meaning the steps need to be executed by the human.  

When there is a bug, an agent can autonomously figure out the bug, fix it, run test cases, raise a PR, but it needs a human to review the diffs, and another human to review the PR. This is more human “on” the loop condition.  

 

Cases where I have seen autonomous agents work well is to do things where the threshold impact third-party or first-party user experiences is very, very low. You can imagine like an agent seeing that a PR was checked in and all the use cases in a JIRA ticket were addressed, it notifies the JIRA owner and closes the ticket.  

So, to summarize, the biggest decision-making aspect is the risk threshold. Whenever you have to make a decision, think about whether the agent’s action is irreversible or not. The more it leans towards irreversible, the more you need to have risk boundaries.  

Where does human-in-the-loop genuinely matter? And where have you seen it added as reassurance without reducing any real risk? 

Human-in-the-loop is an architectural discussion, not just a check box. I would take four pointers into account: 

  • Irreversible “mutable” action: making an error in read is far less bad than making an error in a write action, like sending an email with links that show sensitive information to external stakeholders  
  • Agents lack context: Agents can work through a problem statement, but they don’t know the organizational context, politics, sensitivity of different groups, etc., like is the PRD aligned with VP’s top-level OKRs or something that GM mentioned as a key priority in another meeting, probably not! 
  • Ground truth calibration: If one has to ensure their agent/skills/tools become better they have to calibrate it with human-made judgments  

A lot of teams are pushing agents from demo into production right now. What breaks in that transition that people don’t anticipate? 

Moving from a single developer workflow to a production-level workflow means things fail at multiple levels  

  • Identity and auth: developer run the workflow using sudo with high-level privileges, sees things works locally, but then of course things start to break in production 
  • Scale and token efficiency realization: an agent may perform well with a handful of tools and skills. But as adoption grows, the number of tool calls, context size, and orchestration complexity increase dramatically. Latency, token costs, and cascading failures become real production concerns that simply don’t show up in small-scale testing. 

Evaluating an agent is harder than evaluating a traditional feature, since the output isn’t deterministic. How do you approach evaluation in a way that holds up? 

There are multiple layers of evaluations that have to be set up, and there is not one metric which could help with this. The thing to keep in mind is we are not looking to evaluate a system/function call, we are looking to evaluate a whole system which can change its behavior:  

  • Evaluate trajectories across different runs and not just single o/p  
  • Use multiple metrics like task success rate, trajectory quality, guardrail adherence (what tools did it call where were they within trust boundaries) 
  • Judge the o/p against multiple judges (multiple LLM judge across multiple models, humans in the loop) 

You’ve built for some of the largest enterprise customers in the world. How do their expectations around governance and accountability shape what an agent is allowed to do?  

The way highly regulated enterprise think fintech and health care where each piece of information can result in disastrous events is multi-step: 

  1. Use case intake and risk classification: is this high risk, what kind of data would it be working against, how customer facing it is, in case of an impact/breach what are the potential impacts (are these contained) 
  2. Then there are teams with their own security process e.g. IT vs core use cases can have varied level of security invigilation lasting months.   

All these steps have to happen before AI is even onboarded in a dev system. During dev, the teams go through (1) model validation (bias/fairness/cost, etc.), (2) testing against synthetic data or de-identified data, and (3) risk management processes.  

Finally, multiple checks and balances in pre-production and production rollouts with administrative privileges to ensure that in case of any potential incident, the change could be stopped from further propagating or reversed (logging, incident management, periodic revalidation, drift management)  

For product and platform teams trying to adopt agents responsibly without grinding to a halt, what’s your advice on where to start? 

Start by tiering your use cases, starting with low-stakes, evolving processes to know when to keep humans in the loop, humans on the loop, and completely autonomous. Reserve anything that’s customer critical, data related and irreversible operations as final steps and evolve your bar/threshold based on your use case. Invest early in observability, logging, permissions, and incident management. 

Author

  • Tom Allen

    Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

    View all posts

Related Articles

Back to top button