As we enter the final months of this year, agentic AI is building a head of steam. It’s a topic that’s being hotly debated in business and technology circles – none more so than in QA and software development, where teams have been tasked with training and testing the AI models that will underscore new agentic AI systems. Agentic AI is changing the way we build and test applications. Unlike more traditional AI models or rule-based automation, agentic AI systems are designed to pursue goals autonomously and make decisions in real time. This distinction means that QA professionals and developers are being asked to rethink assumptions about how software behaves, how it’s evolving and how it should be tested.
Defining agentic AI
Before diving in, we should clarify exactly what agentic AI is, because not everyone has a clear understanding of what it means. Simply put, agentic AI refers to systems that are autonomous, proactive and focused on the pursuit of specific goals. They often include components like planning, memory, tool use and feedback loops that enable them to operate independently and adaptively.
It’s very different from traditional AI and generative AI (Gen AI) models that are both typically reactive, relying on human prompts or input. For instance, traditional AI is usually trained to do one thing really well, such as filtering spam, recommending related products or recognizing images. It executes based on pattern recognition. Gen AI takes pattern recognition a step further by drawing on deep learning models, extensive training data and natural language understanding to interpret user prompts (e.g., text, audio or video) and produce the requested output. However, it’s unable to take any further action without additional prompts. Agentic AI, on the other hand, is able to work toward a goal and complete multi-step processes autonomously.
Because agentic AI systems can operate independently, there’s a risk they could make unauthorized decisions or operate in ways that weren’t intended, especially if goals are poorly defined or misunderstood. That’s why human oversight and comprehensive testing are integral to the development of these applications.
Setting new rules
The arrival of agentic AI has turned traditional coding, development and testing models upside down, which has implications for developers and QA professionals who will need to apply a new set of principles to agentic AI. For example, in traditional software, code execution follows a predefined, deterministic path. The system executes logic exactly as written by the developers. It doesn’t deviate, improvise or make decisions beyond simple conditionals. Agentic AI, on the other hand, operates at a higher level of abstraction.
For instance, rather than following hard-coded rules, AI agents interpret high-level goals (e.g., book the cheapest flight to Paris at noon tomorrow) and decide how to accomplish them. Agentic AI involves a process of decision-making, planning and adaptation.
While this autonomy introduces flexibility, it also introduces non-determinism. The actions agents take depend on current context, prior memory, available tools and how well they understand the task. As a result, the same input may lead to different behavior depending on subtle environmental changes. Subsequently, developers and QA teams will need to shift their thinking to align with agentic AI – to test and debug intent, strategy and behavior, and not just code correctness.
Behavioral debugging
In terms of debugging, traditionally, developers would typically focus on stack traces and variable state at the time of failure. But, when it comes to agentic systems, failures often occur because of poor reasoning, ambiguous planning or misused tools. Therefore, developers will have to shift from debugging code execution to debugging agent behavior. They will need to review decision traces, such as, “Why did the agent choose this path? Was the plan coherent? Did the agent correctly interpret its goal and tools?”
For behavioral debugging, developers will need to capture the sequence of steps the agent planned, which tools it selected and the inputs/outputs of each invocation. Besides understanding how a high-level instruction (e.g., arrange travel) was broken into subtasks (e.g., find flights, compare hotels, reserve transportation), developers need insight into memory integrity to validate that information is stored, retrieved and updated accurately.
Testing agentic AI systems
QA teams will also have to change their approach to testing. Traditional QA is based on a known set of inputs and expected outputs, but with agentic AI, behavior is less predictable. That means QA teams will have to test against whether the agent can achieve its goal under different conditions and if it behaves in an acceptable way.
With agentic AI, human oversight becomes necessary. QA teams can augment their resources with a community of qualified testers that will introduce unexpected variables and uncover edge cases. Adding a diverse pool of human testers into this process is especially important for assessing agentic AI to avoid potential widescale inaccuracy, bias, toxicity and other risks that can arise with poorly trained systems.
In addition to traditional KPIs, QA teams must assess new evaluation metrics like accuracy, relevance and hallucination rates. They can also consider other metrics, such as “Task Success Rate,” which evaluates whether the agent completed its goal – or, “Coherence and Consistency” to determine whether the agent’s reasoning was logical and its memory accurate.
It’s possible to automate some of these metrics, but in most cases, it requires human-in-the-loop validation. This is because autonomous agents tend to be subjective in their evaluation. Humans are much better at evaluating subjective versus objective results. The combination of human testers and new infrastructure, including sandbox environments to isolate agents, can be very effective with the ability to observe behavior in controlled simulations under predefined conditions. When it comes to agentic AI, QA and testing must be focused on understanding the rationale behind decisions, spotting poor reasoning and coaching the system to improve.
Changing behavior
Agentic AI represents a major evolution in how software systems operate and has already set new rules for software development, QA and testing that account for autonomous, non-deterministic behaviors. It’s no longer just about catching errors – it’s about fostering responsible and intelligent behavior. This is why it’s essential to keep humans in the loop to help ensure that AI agents pursue their goals correctly, while making ethical and accountable decisions. Thorough testing and monitoring of their behavior is crucial for their development, and key to reducing risk and achieving positive outcomes.