
Enterprise AI failures are often blamed on the model. Varun Kumar Nomula sees the problem differently. In his view, the real test of AI begins after the prototype, when systems meet complex data, regulated workflows, human decision-making and the need for verifiable trust.
Nomula is an AI and machine-learning professional whose work spans data science, cloud architecture, healthcare AI, enterprise analytics and responsible AI governance.
Across research and applied enterprise environments, his work has focused on building AI systems that are technically capable, reliable, explainable, auditable and aligned with real-world business and regulatory needs.
His broader contributions include work in AI-driven clinical decision support, LLM evaluation, healthcare analytics, natural language interfaces for complex data environments and governance frameworks for responsible AI adoption.
In this interview, Varun K N discusses why trustworthy AI is becoming an enterprise requirement, what organisations often miss when moving from experimentation to production, and how evaluation, governance and oversight must evolve as AI systems begin supporting higher-stakes decisions.
Q1. Having worked across enterprise AI deployments at scale, what systemic failure patterns have you observed that organizations often underestimate, and how have you addressed them in your own work?
The field has largely misdiagnosed why enterprise AI fails. Most post-mortems focus on model quality: hallucinations, accuracy, benchmarks. In my experience, the failure almost never starts with the model. It starts with the accountability architecture that was never designed.
In practice, off-the-shelf frameworks consistently fall short of what regulated environments require: coded data structures, domain-specific schema constraints, queries needing interpretation before execution. The model is capable. The surrounding system needs to be engineered for trust.
Published work I have contributed to on AI-driven clinical decision support reinforces this pattern. These systems fail not because the AI is weak, but because implementation ignores training data bias, gradual displacement of clinical judgement, and absence of proportional human oversight. The organisations that scale AI successfully stop asking, “Does the model work?” and start asking, “Can we govern this system over time?” That shift changes everything downstream.
Q2. At the intersection of AI and regulated industries such as healthcare, what evaluation, compliance, and operational challenges make trustworthy AI especially difficult to implement?
AI in healthcare exposed a gap I had not seen as clearly elsewhere: the distance between a system that performs well on a benchmark and one that is fit for purpose in a clinical workflow. That gap is where trust breaks down.
My research has approached this from multiple directions. In peer-reviewed work evaluating top AI models on vaccine sentiment and hesitancy classification across tens of millions of social media posts, even the highest-performing model struggled with sarcasm, implicit language and multi-construct hesitancy, failures that, in a public health context, could misdirect communication strategies at a population scale.
In parallel, work I have contributed to on clinical decision support documented how systems encode training data bias invisibly into recommendations. And in natural language to database query systems I have helped design, a general-purpose model cannot reliably operate on healthcare data without domain-specific grounding built into the architecture. In regulated environments, a wrong output does not just produce a bad answer; it produces a misleading one that influences real decisions. Governance has to start at the design stage, not the review stage.
Q3. In high-stakes AI and LLM use cases, what techniques or evaluation approaches do you consider most important for improving reliability, transparency, and accountability?
Most reliability frameworks solve at the wrong layer. Fine-tuning, better prompting, guardrails: those matter, but they address the least controllable part of the system. In my experience, reliability in high-stakes environments lives in the architecture surrounding the model, not inside it.
In published research testing four leading LLMs across zero-shot, one-shot and few-shot paradigms on vaccine discourse, one of the key findings was that increasing training examples produced statistically insignificant performance differences. Smarter evaluation design matters more than more data.
In medical imaging research I have contributed to, deep learning improved diagnostic accuracy, but only when evaluation accounted for demographic diversity in training data, not just aggregate scores. Bias at the evaluation stage becomes bias at the clinical stage.
In practice, the techniques that proved most critical in natural language to structured query systems were schema-aware retrieval, coded-attribute resolution, interpretation generation before execution and full audit logging. Together, these transform a capable model into a governable one: auditable, testable, correctable. That combination is what makes AI trustworthy enough for real enterprise adoption.
Q4. You work across AI product, strategy, governance, and ethics — what emerging shifts do you believe practitioners and executives are still underestimating?
The most underestimated shift is from AI as a capability to AI as infrastructure. Most teams are still evaluating individual use cases, whether a specific tool performs well enough for a specific task. That framing is already obsolete. AI is being embedded into how data is accessed, how research questions get answered, how clinical decisions get supported. The governance question is no longer whether to adopt AI; it is whether the organisation can manage it as a continuous operational responsibility.
The second shift is evaluation debt. In published research analysing tens of millions of social media posts across nearly a decade of vaccine discourse, one of the key findings was that online discussion volume is an unreliable predictor of behavioural change. That maps directly to AI deployment: surface metrics are not sufficient. Models drift. Populations shift. Without continuous evaluation infrastructure, organisations do not know whether the system they trusted six months ago still performs the way they believe.
The third is clinical judgement displacement. Work I have contributed to on AI-driven decision support documented how over-reliance gradually erodes the independent judgement that makes human oversight meaningful. The answer is not to limit AI capability; it is to design oversight that remains substantive rather than ceremonial.
Q5. How do you see your own role in shaping where the field goes next, and what conversations are you most eager to be part of?
The conversation I am most focused on changing is one the field keeps avoiding: that trustworthy AI requires rethinking the entire stack, not just the model.
Most governance conversations happen at the wrong altitude. Executives debate policy. Engineers debate architecture. Rarely does someone work across both simultaneously, with published research at the data layer, the evaluation layer, the clinical systems layer, the imaging layer and the production deployment layer. That cross-stack perspective is what I have spent years building, and it consistently reveals the same thing: the failures are not random. They are structural. Organisations are deploying AI systems on top of data infrastructure, evaluation practices and oversight models that were never designed for what AI is now being asked to do.
My published work makes that case layer by layer. Interoperability and data standards as AI governance prerequisites. Evaluation design mattering more than model selection. Clinical judgement displacement as a systemic risk hiding inside efficiency gains. Demographic bias entering at the training stage and surfacing invisibly at the clinical stage.
The conversation I want to drive is the one that connects those layers into a coherent discipline: not AI ethics as philosophy, not AI engineering as isolated practice, but responsible AI as an operational framework that practitioners can actually implement. That framework does not fully exist yet. Building it, in peer-reviewed research and in production systems, is the work I am most committed to.



