A terrible power we possess as humans is our ability to justify decisions after the fact. If you are quick enough and don’t tread too far outside the bounds of acceptable, you can always come up with a reason for why you did something.
That’s not a good thing. In the past 20 years, business gurus and therapists alike have increasingly campaigned for “being wrong” as a skill one can improve and excel at. It’s not that it’s good to be wrong. Rather, it’s good to be able to self-reflect and acknowledge when you made a decision without the correct reasoning in the first place, regardless of whether you can justify it after the fact. You can be wrong and still get away with it. You can be wrong and still convince others you were right.
Guess what? This is decidedly NOT a skill that any LLM I’ve encountered to date has mastered. The opposite is true: much like humanity, chatGPT is a master bullshitter, able to justify and provide “reasoning” for any decision it makes after the fact. This reasoning can often pass the smell test, and, in small samples, seem like it is keeping your agentic system accountable.
This is a lie. It’s akin to a math test where you need to show your work, but you put the answer down first and work backward from there. Or, in the world of machine learning, overfitting your model and not realizing it.
Keeping your AI transparent and accountable requires specificity, forethought, granularity, and trackability.
Specificity
If you want your AI to remain transparent, you need to keep your needs specific. In the context of an agentic system, that means ensuring it is solving a narrow scope of problems. If you have other problems, build other agentic systems. Hospitals aren’t run on a single spreadsheet (ok, bad example) but instead numerous systems tied together. Big Four auditors don’t audit a single spreadsheet; they audit the processes involved in making that spreadsheet.
Forethought
Whether you are using a tree of thought, coming up with a decision tree yourself, or something else, it is foolish to think state-of-the-art agentic systems can accomplish complex tasks while remaining accountable. Some entity needs to agree upon and monitor the ways that decision-making is supposed to be made in your system.
This shouldn’t be a surprise to anyone: you don’t train an employee on 10 different ways to do the same task. You train them on the right way to perform that task. AI systems are not smarter than humans (yet), so why would you trust an agentic system to always have to do the creative task of deciding how to accomplish a task?
This level of control is onerous and belies many of the promises made by deep research and agentic systems. But if you want to be able to trust your system to consistently perform tasks for your business without human intervention, you need to narrow the scope of what you want to accomplish with a single agentic system.
Granularity
Going hand-in-hand with forethought: when you are designing your decision tree, you need to ensure it is sufficiently granular. This involves a fair amount of systems thinking, but LLMs are terrible at consistently making value judgments and, therefore, choosing an optimal outcome.
In my work, I identify the most granular decision points possible in my system. It can exponentially increase the cost of my agentic systems, but it allows me to fine-tune and trust that my system is capable of consistently making decisions.
I was horrified back in the days of ChatGPT3.5, realizing that despite its creative writing prowess, asking 3.5 to rank something from one to 10 across a given criteria would result in slightly better-than-random results. I wouldn’t have been able to identify that issue unless I had broken down my requests into extremely granular pieces.
Don’t ask the AI to “evaluate this document.”
Ask the AI to:
“evaluate the first paragraph based on [criteria 1]”
“evaluate the first paragraph based on [criteria 2]”
“evaluate the second paragraph based on [criteria 1]”
… and so on.
And then ask the AI to combine those decisions.
“Based on these decisions about the first paragraph, decide this _____.”
And then finally put it all together:
“Based on these decisions about the paragraphs in the document, decide this _____.”
You will see exponentially higher costs and improved consistency in your results, especially if you fine-tune models for those particular needs.
Trackability / Audit Logs
Unless you can track and assess these granular decisions, you have added to your costs without any provable reliability. Not to get into implementation details, but with every granular request, I ask my AI systems to also include:
flag_for_review: a flag that, if true, will save this granular decision for later review by a human.
flag_reason: why did the AI flag this decision?
Combined with random sampling and review of decision points, I can prove that my system is reliably and probably making the correct decision at every step of the chain of reasoning I need to accomplish the task my system is trying to accomplish.
Does it matter? Is such fine-grained auditability worth the cost and time?
Or rather, can you afford for your AI system to be wrong without you knowing it? This is completely fine for many, many use cases. If there is a human-in-the-middle, other forms of accountability, or your AI is simply a tool to increase efficiency (rather than replace humans), maybe not.
But if people are depending on your service – if you are in a position of power and people are taking your advice on faith, or, worse, your AI decision-making is an unchecked, foundational part of your business model, you have a professional and ethical duty to maintain accountability.
As a data scientist, I genuinely find it insane how voracious companies are for risk when LLMs are involved, especially compared to 5-10 years ago. Back then, it was an uphill battle to allow even rigorous, highly predictable black-box machine learning algorithms into business models. Business leaders wanted proof and accountability.
Where did that go?
Bring it back.