AI adoption is an irreversible trend in corporate America, driven by the promise of increased productivity. We decided to try it. Our specific focus was Agentic AI, autonomous systems capable of planning, reasoning and executing a sequence of actions, which we attempted to implement at our organization.

The goal at Kids Included Together, a nonprofit with twenty-eight years of experience working with more than 600 organizations, was ambitious: to improve customer satisfaction and increase staff capacity by offloading inquiries for internal staff, all while ensuring factual accuracy from source data.

The reality was far more challenging than anticipated. Achieving “success” required redefining it and making three significant compromises: retraining our staff, redoing our source documents and accepting a mere 25% success rate for our agents. Our initial expectations of “transformational capabilities” gave way to a sobering realization of the technology’s current, significant limitations.  

Challenges in Developing AI Search Agents

Our project aimed to create AI agents to serve as digital search assistants for internal staff, making information retrieval more efficient and automating report creation within our organization’s digital boundaries.

The Complexity of “Agentic Confusion”

We quickly learned that the simplified ‘no-code’ promise was too good to be true. Initially aiming for a single agent, the complexity and “agentic confusion” of terms across our extensive documentation necessitated the creation of multiple specialized agents.

Technical Obstacles to Accuracy

Several technical hurdles emerged in ensuring agents return results that accurately exist in the source data and match the search criteria:

Repeatability and Consistency: Ensuring the same search criteria reliably yields the same output remains a significant hurdle.

Context Length Limitations: Strategies had to be developed for processing large documents to manage the inherent context length limitations of the models.

Search Criteria Balancing: A one-size-fits-all approach was ineffective, requiring a highly configurable balance between simple keyword matching and sophisticated semantic matching.

Information Loss: A major problem is the “missed in translation” effect, where information handlers in an agentic workflow may lose or ‘forget’ information, leading to final inaccuracies.

Overpromises and Underperformance

The optimism surrounding Agentic AI is often fueled by curated demonstrations that imply a rapid, nearly flawlesstransition to a new operational model.

In practice, the flawless execution in demos often relies on cherry-picking both the source data and the specific tasks to guarantee a favorable result. Unlike deterministic traditional software, the outcome of agentic development is frequently non-deterministic—the objective is eventually met, but the process is unpredictable, and the correct answer in one scenario does not guarantee repeatable computations. This requires users to be taught specific keywords to allow the AI to return the correct answer—a flaw that forces the user to help the AI find the right information.

One solution is to provide suggested prompts, but this begs the question: if you have to tell the users what to ask, why not just create a set of searchable FAQs? This necessary training to narrow the scope of questions acts as a hidden limitation, allowing agentic projects to pick their own scope regardless of the initial business requirements.

Implementation & Support Headaches

Traditional software support is straightforward; a bug is a faulty gear that can be found and fixed with a specific patch. With AI agents, the path from error to solution is opaque or non-existent.

Generative AI doesn’t follow strict rules; its creations are unpredictable. If an AI tool gives a bad answer, the IT support team can’t find a single “bug.” Their only limited fixes:

Adjusting the ‘Temperature’: Nudging the AI to be more “creative” (risky) or more “regimented” (safe).

Prompt Engineering: Trying to find the right way, amongst vast possible permutations, to ask the question correctly.

The critical issue is that fixing a bad answer for one person often breaks the accuracy of the AI for other people asking different questions.

The “Ask a Human” Failure

In our testing, the AI frequently got its facts wrong. The counter-intuitive “failproof” step experts proposed was to embed this instruction in every AI answer: “Please have a human check this, and here’s the document I used.”

This mandatory human check fundamentally upended the entire point of building the AI tool.

Instead of a clever assistant that provides a validated, final answer, the AI becomes more of a fancy search engine. The user must still do all the hard work: reading the source, putting the facts together and checking if the answer is correct. The burden of work is simply transferred back to the user – the exact task the AI was supposed to eliminate. When the user still has to do the heavy lifting of fact-checking and validation, the value of the AI tool disappears.

Accuracy and Testing Issues

Maximizing the performance of agentic AI hinges on selecting the optimal prompt, data sources, LLM and LLM settings for a tightly constrained set of tasks. This necessity places a substantial burden on IT teams to develop and manage many specialized agents, diminishing the perceived productivity gains.

A core problem arose when documents contained overlapping categories: AI agents struggled to correctly separate these, often failing to tie the query to the appropriate agent.

Traditional software testing is limited to known use cases. However, a system that allows open-ended questions introduces an infinite number of use cases, making comprehensive testing impossible. It becomes an anecdotal “it works until it doesn’t” scenario. This fundamental lack of complete validation shifts the burden: users must validatethe answers provided.

Third-Party Agent Concerns

Turning to third-party agents does not solve the validation challenge. Unlike traditional software, which comes pre-tested, AI agents behave dynamically. Their performance can vary drastically based on the specific data sources and user requests they encounter. The IT project manager cannot be certain of a third-party agent’s performance without the same rigorous, internal testing and validation, data source by data source, as if they had built the agent themselves.

It’s All in the Training

The pursuit of reliable AI output currently requires a costly and fundamental shift: we are not only retraining users to craft better queries, but we are also systematically rewriting source documentation. This process did uncover pre-existing organizational weaknesses that we addressed. Ultimately, much of the perceived success of AI implementation is currently achieved through extensive human and organizational adaptation, effectively engineering the agents’ desirable outputs.

 Redefining Success

Only two of the eight agents could be classified as successful, but only after significantly lowering the bar. We ultimately redefined our success metric, shifting from the initial goals of “significantly increasing productivity” or “freeing up humans for human work” to the more achievable “making our employees’ jobs a little easier.”

Confusion Surrounding Costs

The transactional pricing for AI implementations is notoriously difficult to follow. Pricing can be by user, interactions, triggers or any combination. Compounding this, there is typically no caching of answers with agentic AI. When the same question is asked repeatedly, you pay for the AI transactions every single time an answer, repeat answer or incorrect answer is provided.  

Worth the Limitations?

After extensive configuration and testing, the critical question remains: is the investment worth it when subject matter experts (SMEs) still deem the answers incorrect or only somewhat correct? Organizations need a robust process to evaluate the true value of agentic AI versus simpler alternatives, such as improving keywords, tags and search-enabled FAQs.  

Summary

Reflecting on the path, the journey toward creating a truly productive AI agent is like the classic arcade game, Frogger. You push forward, only to be struck by an unforeseen complication that forces you back to the start line to rethink your strategy—be it adjusting ‘temperature’ settings, wrestling with documentation, or finding the right topic boundaries.

The final reward is not the high score we originally aimed for, but the immense relief of having made it to the other side with a meager success rate. As the credits inevitably roll, the essential disclaimer appears: “Everything shown needs to be human-checked before it is acted on as fact.”

Dr. J.A. Halick. has over 25 years of experience in education, digital transformation, eLearning, and project management. (PMP Certified.

Author

AIJ Thought Leader

View all posts

AIJ Thought Leader 5 December 2025

5 minutes read

Challenges in Developing AI Search Agents

The Complexity of “Agentic Confusion”

Technical Obstacles to Accuracy

Implementation & Support Headaches

The “Ask a Human” Failure

Accuracy and Testing Issues

Third-Party Agent Concerns

It’s All in the Training

Redefining Success

Confusion Surrounding Costs

Worth the Limitations?

Summary

Author

Related Articles

AI-Driven Security Protocols: Navigating the Transition to Fully Autonomous AI Workflows

Consumers trust AI with life decisions. But do they trust brands to use it responsibly?

Reframing Screen Time: Using AI to Transform How Kids Learn

Five No-Regret Actions to Bypass “AI Paralysis” and Drive Immediate Value — Regardless of Where You Are in Your AI Journey

 Redefining Success