Interview

Building Trustworthy AI for Public Health, Insights From Early Federal Deployments

As artificial intelligence moves from experimentation to real-world deployment, few environments demand more rigor than public health. In this interview, Anindita Nath, an interdisciplinary data scientist with over a decade of experience across AI, machine learning, and biomedical informatics, shares how their work is helping shape the next phase of AI adoption in high-stakes systems. From co-leading one of the earliest agentic AI evaluation efforts in the federal public health space to building large-scale LLM-enabled tools for surveillance and metadata analysis, Nath’s work sits at the intersection of innovation, accountability, and real-world impact.

Drawing on experience across generative AI, natural language processing, and scalable data platforms, Nath offers a grounded perspective on what it actually takes to deploy AI systems that are not just advanced, but reliable, measurable, and safe. In this conversation, they break down how agentic AI is being evaluated in practice, how public health teams are beginning to integrate these systems into daily workflows, and why governance and human oversight remain central to long-term success.

To start, how did you first get involved in applying AI and machine learning within public health systems, and what led you to co-lead one of the earliest agentic AI evaluation efforts in the federal space?

My background in interdisciplinary data science and health AI led me into this work. Over the past decade, I have worked across biomedical informatics, clinical NLP, predictive analytics, scalable AI/ML systems, and LLM applications in real-life settings. I have long focused on using machine learning and large language models to turn complex data into reliable, actionable insights.

At the CDC, that work became more operational because public health teams rely on large, siloed datasets, surveillance streams, metadata repositories, scientific evidence, policy documents, and reporting workflows. The challenge is in building systems that help analysts and epidemiologists work effectively under time pressure. That is where my experience in predictive analytics, clinical NLP, LLM-enabled applications, and production data workflows became especially useful.

The CDC was already building a stronger AI-enabled culture through enterprise GenAI tools, an active AI Community of Practice, and expanding public health use cases. I supported modernization efforts, including LLM-based metadata search, surveillance inventory automation, entity extraction and classification, Power BI reporting pipelines, and natural-language query tools across Azure OpenAI, Databricks, Palantir Foundry, Streamlit,  and Power Platform.  Across these projects, the goal was the same: reduce manual work, speed analysis, and make public health data easier to use for decisions.

The Agentic AI Deep Research evaluation project mattered because it tested a new class of models capable of multi-step reasoning, web research, synthesis, and report generation. As a compute team lead, I helped evaluate OpenAI’s Deep Research model on real-world public health tasks, focusing on performance rather than hype. In public health, a fluent model is not enough; it must also be useful, factual, trustworthy, transparent, and safe for high-stakes work. To my knowledge, this was one of the earliest evaluations of agentic AI in both the federal and public health spaces, focused on whether the model could support real workflows in a measurable, reliable, and well-governed way.

When you talk about “agentic AI” in the context of public health, what does that actually look like in practice across real workflows, surveillance, and decision-making environments?

In public health, agentic AI is more than a chatbot. It can take a complex question, break it into steps, search across sources, synthesize evidence, generate structured outputs, and support decision-making workflows. For example, it can help summarize scientific evidence, compare intervention strategies, draft public health communication materials, support policy review, or identify relevant surveillance metadata. In my other CDC work, this same idea appeared in tools like LM-powered metadata retrieval and visualization systems, where analysts could ask natural-language questions and receive faster, more organized access to program information. The practical value is helping experts navigate complexity at scale.

As part of this early evaluation initiative, in collaboration with organizations such as the Centers for Disease Control and Prevention, how did you approach building frameworks and metrics to reliably assess agentic AI performance in high-stakes settings?

We evaluated agentic AI from two perspectives: technical performance and public health accountability. We examined how well the model retrieved information, synthesized evidence, reasoned through complex tasks, and produced useful outputs for public health professionals. We also assessed trustworthiness, transparency for expert review, and risk without proper oversight.

 My role focused on the computation and evaluation workflow, including prompt design, consistent testing, documentation of limitations, and support for review. We combined quantitative metrics with subject-matter expert feedback, since polished answers can still miss context or rely on weak evidence. The goal was safe, reliable support for real public health workflows.

What were some of the biggest challenges you encountered when validating these systems, especially given the complexity and sensitivity of public health data and decision processes?

One major challenge was task heterogeneity. A public-facing communication prompt, a legal analysis, scientific literature review, and an epidemiological investigation do not have the same success criteria. Another challenge was data access: paywalled sources, API restrictions, missing datasets, and unbounded requests sometimes caused failures or shallow outputs. We also observed prompt sensitivity. Complex prompts often need to be decomposed into smaller, computable tasks with clear source criteria, geographic or temporal boundaries, and expected output formats. From a technical standpoint, this reinforced that agentic AI evaluation is not only about model accuracy. It also depends on prompt design, retrieval quality, source traceability, compute constraints, and human-in-the-loop validation.

Your work emphasizes building systems that are measurable and reliable. How do you define success when evaluating agentic AI in environments where errors can have real-world consequences?

I define success as improving efficiency, evidence synthesis, and decision support without compromising public health integrity. Overall, experts have seen strong potential for deep research to boost productivity and support public health workflows, but they also flagged the need for caution around trustworthiness, source quality, and critical review.

To me, success is not full automation or independent model decision-making. It is a system that produces a strong first draft or structured analysis that experts can review and refine. In high-stakes public health settings, AI must be measurable, auditable, source-transparent, reproducible, and human-supervised, with clear escalation when uncertainty is high.

You have led efforts to modernize public health surveillance using LLM-enabled tools. How are these systems changing the way analysts interact with large, siloed datasets and generate insights at scale?

They are changing both the data access layer and the analytic workflow. In surveillance, analysts often spend significant time locating datasets, understanding metadata, cleaning inputs, and preparing reports before they can even begin interpretation. LLM-enabled tools can automate entity extraction, classification, metadata search, report generation, and visualization. 

I built a Microsoft Power Platform AI Builder application that reduced manual monitoring workload by more than 50% while maintaining the accuracy and reliability of the outcome. I also engineered Power BI pipelines for respiratory virus data reporting and developed AI-driven metadata tools that enabled near-real-time insights. These systems strengthen analytics by making data more discoverable, standardized, and ready for downstream modeling, forecasting, and operational decision-making.

Responsible adoption is a major theme in your work. What guardrails or principles do you believe are essential when introducing agentic AI into federal public health infrastructure?

Key guardrails include expert human review, data protection, approved-use limits, source validation, bias monitoring, documentation, and ongoing evaluation. In federal public health settings, governance must be built into the workflow from the start. That means evaluating models before deployment, setting clear rules for sensitive data, reporting limitations transparently, and training staff to understand both AI capabilities and failure modes.

Looking ahead, how do you see agentic AI evolving within public health systems over the next few years, and what role will evaluation and governance play in shaping its long-term impact?

I expect agentic AI to become part of routine public health work, especially in research synthesis, surveillance, predictive analytics, and decision support. Its long-term value, however, will depend on investing as much in evaluation and governance as in innovation. Evaluation must be ongoing, with testing across task types, expert review, clear metrics, and regular validation as models and data environments change. In public health, systems must prove they are accurate, source-grounded, transparent, and useful in real workflows.

Governance makes that possible. It requires clear policies for approved use cases, limits on high-risk tasks, human review, source traceability, privacy and security safeguards, and escalation when evidence is weak. Staff also need training to use these tools critically. The future of agentic AI in public health will depend less on autonomy and more on how responsibly it is integrated into scientific and operational workflows.

Author

  • Tom Allen

    Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

    View all posts

Related Articles

Back to top button