
Varun Debbeti has spent over a decade preparing regulatory submissions for the FDA and EMA, work in which just a single metadata error can block a drug approval entirely. When AI tools started entering clinical workflows, he noticed a problem nobody was writing about: the tools being marketed to pharma were either too risky to use with real patient data, too generic to understand CDISC standards, or too expensive for the global programmers who needed them most. So, he built his own. What he learned in the process says a lot about where AI in drug development is actually headed and where the industry is still getting it wrong.
You’ve built open-source AI tools for clinical data workflows. What problem were you actually trying to solve when you started?
Honestly, I just wanted to use AI to speed up my day-to-day work, but the rules around clinical data made it impossible. We deal with sensitive, patient-level data governed by GDPR, 21 CFR Part 11, and a mountain of internal SOPs. You can’t just copy-paste a dataset into ChatGPT and ask it for SDTM mapping code. That’s a massive compliance breach right out of the gate. Pharma companies need secure, validated environments. Feeding proprietary trial data into a public AI model where it might be used for training is totally not acceptable. I realized if I wanted to use AI safely, I had to build tools that actually respected those boundaries.
So, I took a step back and asked: where in my workflow can AI actually help without ever touching the data? And the answer was, a lot of places. Code generation, code conversion between SAS and R, code explanation for submission readiness, spec generation from program logic, regulatory guidance lookup. These are all process steps that eat up enormous time but don’t require the AI to see a single patient record.
That’s what clinstandards.org became. I built tools around that principle: accelerate the workflow, not the data access. The code generator writes SAS or R code based on your description. The code convertor moves logic between languages. The chatbot uses a RAG model trained on CDISC standards and FDA guidance documents to answer regulatory questions. The UAT data generator creates realistic test datasets so you can validate programs without using production data. The spec generator takes your code as input and produces variable-level specifications for NDA or BLA submissions. All of these operate in a space where AI is useful, and data never leaves your control.
I’ve been doing this work for over twelve years across oncology trials. I was on the submission teams for Ogsiveo® NDA and Yescarta® BLA. When you’ve gone through that process, assembling define.xml, writing SDRG and ADRG documents, mapping raw data to SDTM, deriving ADaM, generating TLFs, you realize how much of the labor is repetitive and rules-based, yet still requires deep domain knowledge. AI fits well there. It doesn’t replace the programmer. It takes the friction out of the parts of the job that slow you down.
Most coverage of AI in clinical trials focuses on patient recruitment or diagnostics. What’s happening at the data standards and regulatory submission layer that isn’t getting enough attention?
Patient recruitment and diagnostics are about generating data. Standards and submission are about certifying data. The risk profile is completely different, and the AI opportunity is completely different. But the industry press barely touches it.
Let me walk through what’s actually happening underneath the surface.
AI-assisted SDTM mapping at scale is the first thing to mention. Raw-to-SDTM mapping is one of the most labor-intensive, error-prone activities in statistical programming. Teams are now experimenting with LLMs to automate variable-level mapping suggestions, ingesting a CRF, a CDASH IG, and a target SDTM IG and producing first-draft mapping specs. The underreported problem is confidence calibration. LLMs produce plausible sounding but wrong mappings with no uncertainty signal, and the downstream consequence is a Technical Rejection Criteria failure at FDA. The tooling hasn’t caught up to the risk model. I’ve seen this play out in the conference papers coming out of PHUSE. There’s exciting work on using SHAP values and XAI to make these AI-driven mapping decisions (trained via Random Forest and XG Boost) transparent and traceable, which is the right direction, but we’re not there yet for production use.
Then there’s define.xml and metadata authoring. Define.xml is one of the most brittle artifacts in a regulatory package: hand-authored, enormous, and almost never fully correct on first attempt. Using AI to automatically generate define.xml from ADaM specs, annotated CRFs, and controlled terminology is gaining a lot of traction. But the stakes are incredibly high. A single XML error will block a regulatory submission entirely. And from a GxP compliance perspective, relying on AI creates a major traceability problem, because proving exactly how the model derived specific metadata is almost impossible.
The USDM and Digital Data Flow pipeline is another one, and it’s massively underappreciated. The TransCelerate/CDISC initiative created USDM as a machine-readable protocol specifically to unlock downstream automation. The logical progression is clear: a structured protocol feeds CDASH, which flows into SDTM, which drives ADaM specs. AI is now being used to execute those handoffs. Too much of the current conversation dismisses USDM as a mere data exchange standard, overlooking its real purpose: stripping the manual, human-in-the-loop steps out of study builds. Given the millions sponsors spend annually on manual data standardization and spec generation, fusing AI with USDM is going to become an absolute necessity.
Something that gets almost zero coverage is FDA’s internal AI use and the asymmetry it creates. FDA has been actively exploring AI tools for reviewing submission packages. Pattern matching across historical submissions, flagging common deficiency signatures. If reviewers are using AI to evaluate submissions, but sponsors are still authoring packages with 2005-era tooling, there’s a growing mismatch. Sponsors who understand what FDA’s AI tools are likely sensitive to, things like define.xml structure, reviewer’s guide completeness, ADRG traceability, will have a real edge.
And then there’s the question of AI-generated SDRG and ADRG documents. How do you qualify an LLM-generated document as part of a regulated submission artifact? That question is still open.
CDISC standards like SDTM and Dataset-JSON are evolving quickly. Where are most statistical programmers struggling to keep up, and where is AI actually helping?
The core problem is that the data-standards ecosystem updates asynchronously. CDISC TAUGs, SDTM model revisions, implementation guides, FDA’s Technical Conformance Guide: they all move on different timelines. A programmer working on a Phase 3 submission needs to know which version of SDTM IG they committed to with the agency, what controlled terminology version applies, and what the latest conformance rules say. Keeping all of that straight manually is where people fall behind.
AI can help here. I used a RAG model to build a chatbot on clinstandards.org that ingests all the recent guidance documents and can answer questions about standards and their linkages. Tools like Google NotebookLM are also useful for knowledge retrieval across the full database of guidances. The point is to give the programmer a way to query the standards instead of digging through PDFs. That’s a real productivity gain.
Now, Dataset-JSON. This is the sharpest current pain point, and I wrote about this extensively in my recent article on CDISC Dataset-JSON v1.1. We’re looking at the most significant modernization of clinical data exchange formats since electronic submissions began. After more than 35 years of relying on SAS V5 XPORT files, the industry is finally moving forward. The FDA published a Federal Register notice in April 2025 requesting public comments on Dataset-JSON v1.1 as a new exchange standard, and 42 comments were submitted to the docket. Before that, from September 2023 through April 2024, the FDA collaborated with CDISC and PHUSE on a comprehensive pilot that confirmed Dataset-JSON can work as a drop-in replacement for XPT with no disruption to existing business processes. Data integrity was maintained throughout all conversions, and no showstopper issues were found.
The technical advantages are big. Dataset-JSON eliminates the 8-character variable name restriction, the 200-character string limit, and the 40-character label constraint that have been forcing programmers into cryptic abbreviations for decades. It supports UTF-8 encoding natively, which matters enormously for global trials with data in Japanese, Chinese, or European languages. The file sizes are dramatically smaller too. I showed in my article that an SDTM LB dataset that’s 2,699 KB in XPT compresses to just 640 KB in Dataset-JSON. For large ADaM datasets approaching FDA’s 5GB limit, that difference can eliminate the need for dataset splitting entirely. And the NDJSON variant enables row-by-row streaming for datasets too large to fit in memory. XPT simply can’t do that.
But here’s where programmers are struggling. Most Biotech and Pharma organizations built their entire define.xml plus XPT pipeline over decades. The mental model is XPT-first. Dataset-JSON changes the entire validation surface: JSON Schema validation instead of eDT, different null handling, a completely different file structure with columns and rows arrays instead of flat binary records. Open-source conversion tools exist (Lex Jansen’s SAS macros, Python CLI utilities, R packages from Atorus Research) but a lot of programmers don’t yet have fluency with JSON as a data format, let alone JSON Schema. The gap isn’t technical complexity. It’s that there’s been no forcing function to learn it until submission deadlines create one. And when it does become required, there will be a lag period of at least two years where both XPT and Dataset-JSON are accepted. Organizations that start preparing now, evaluating tooling, updating validation procedures, planning dual-format capability, will be in better shape.
Where AI is actually making a difference right now, a few concrete areas. Controlled terminology lookup and flag derivation. LLMs embedded in IDE environments like GitHub Copilot or Cursor with CDISC context are useful for generating first-draft derivation logic against CT values. Not because the logic is always right, but because it surfaces the right CDISC codes for review faster than manual lookup. The programmer still validates; the lookup friction drops a lot.
Spec document drafting is another one. ADaM variable-level specifications, particularly the boilerplate-heavy sections like variable label, codelist assignment, derivation narrative for standard variables, are a real LLM use case. The generated text is often seventy to eighty percent correct for standard domains and requires focused review rather than authoring from scratch. For NDA applications with integrated summaries that has fifteen-plus ADaM datasets, that’s real time compression. There are enterprise tools like P21E that does same function by helping in populating NCI Codes for codelists, terms, generating valuelevel metadata for QVAL, AVAL, PARAMCD and PARAM etc., but these comes with a high fee structure. AI could do a better job if trained on FDA business rules and FDA validator rules and structured well in my opinion.
And cross-domain consistency checking. Teams are using LLMs as an early-pass consistency checker, pasting domain specs or code and asking for logical inconsistency identification before formal QC. It’s not a validated process, but it’s catching things that would otherwise only surface in peer review.
The conference papers I’ve been reviewing back this up. There’s a paper from PHUSE on SASBuddy that shows how you can decouple AI intelligence from SAS execution, keeping the data local, sending only metadata to the LLM, and getting contextually relevant code back. That architecture pattern, metadata out, code back, data never leaves, is what makes AI practical in our regulated environment. Another paper demonstrated a full AI workflow where the LLM identifies the statistical method, parameterizes the analysis from ADaM datasets, and executes through SAS via SASPy. The main takeaway from that work was keeping the LLM requests deliberately simple and supplementing prompts with structured data from a database. That’s domain expertise driving AI adoption, not the other way around.
You’ve worked on FDA and EMA submissions across oncology and other therapeutic areas. How has AI changed what that process looks like on the ground?
Honestly, on the ground, the change is still incremental. It’s not the revolution people write about on LinkedIn. But the incremental gains are real and they’re compounding.
In my own workflow, the biggest shift has been in spec generation and first-draft programming. When I’m building out an ADaM dataset for an oncology trial, something like ADRS for tumor response or ADTTE for time-to-event endpoints, the derivation logic for best overall response, confirmed response, progression-free survival, and overall survival calculations is complex. There are RECIST criteria, investigator versus central reads, windowing rules. I used to write every spec line from scratch. Now I use AI-assisted tools to generate the first draft of the spec from existing code patterns, and I spend my time reviewing and refining rather than authoring. That’s a different kind of work, and it’s higher-value work.
For submission assembly specifically, the areas where AI has moved the needle are documentation-adjacent tasks. Reviewer’s guide drafting, define.xml metadata population, traceability narratives. These things used to take a team days of tedious manual effort. With LLM assistance, you can get a solid first draft in hours. But every output still needs human review by someone who understands what the FDA reviewers actually look for. That part hasn’t changed.
What also hasn’t changed is the accountability structure. When I submit a package to the FDA, my name is on it. AI didn’t write that code. I did, with AI assistance. The QC was done by humans. The sign-off was done by humans. That chain of responsibility hasn’t been disrupted, and it shouldn’t be.
There’s a lot of optimism about AI accelerating drug development. From where you sit, what’s the most overstated claim — and what’s genuinely underappreciated?
The most overstated claim is end-to-end SDTM mapping automation. The tools claiming to go from CRF to annotated SDTM with minimal human intervention are not production-ready for complex studies. They work tolerably on standard domains with clean source data, your DM, your VS, your basic LB. They fail on non-standard domains, complex SUPPQUAL structures, and studies with unusual visit designs, which are exactly the cases where mapping errors carry the most regulatory risk. A Technical Rejection Criteria failure because an AI confidently mapped something wrong is not a theoretical problem. It’s a real risk that the marketing materials for these tools don’t acknowledge.
There’s also an overstated belief in automated conformance beyond what CDISC CORE rules can check. The gap between syntactic conformance and submission-ready quality is real, and that’s where most Phase 3 rework lives, but LLMs filling that gap reliably is still aspirational. Hallucinated conformance passes are a live risk. The AI doesn’t know that a misinterpretation of data structure can trigger FDA rejection criteria. It doesn’t know about the specific rejection codes, 1734, 1735, 1736, that come from metadata and mapping issues.
And here’s a subtle failure mode that doesn’t get discussed: version-aware standards intelligence. Most LLMs have CDISC standards knowledge frozen at some training cutoff and it’s not version specific. A programmer asking about SDTM IG 3.4 behavior may get an answer that blends 3.2 and 3.3 conventions with no signal that the version boundary matters. In a standards-sensitive domain, that’s dangerous. My tool on ai-center on clinstandards.org addresses this by giving the user flexibility to specify the version in the prompt context.
Now, what’s underappreciated? A few things.
Statistical programmers are uniquely positioned to pick up AI skills and run with them. We already understand data structures, statistical methods, and programming logic. We have the foundation to adopt machine learning techniques faster than most other professional groups. The SASBuddy paper from PHUSE and the work on XAI-driven SDTM mapping both show that when domain experts build AI solutions, the results are materially better than when AI is bolted onto the domain from outside. The paper on using SHAP values for transparent SDTM mapping and ADaM derivation showed that you could make AI decisions interpretable and audit-ready, and it took clinical programming expertise to get there.
The cost story around USDM and Digital Data Flow also doesn’t get enough attention. If AI becomes the connective tissue between machine-readable protocols and automated study build, the savings aren’t marginal, they’re structural. Every sponsor spends millions per year on data standardization. Integrating AI with the USDM architecture could change that math entirely. But the coverage treats it as an interoperability initiative, not a cost story.
Then there’s AI as a learning tool. I’ll give you a specific example. I once used ChatGPT to help me understand why sort order affects p-values for PROC ICLIFETEST when my statistician couldn’t explain it. The answer was that the test assumes data to be independent and identically distributed, so altering the distribution by changing sort order violates that assumption. That’s a case where AI helped me understand a statistical concept that directly impacted my programming decisions. That kind of use, as a learning and reasoning aid, is underrated.
You run clinstandards.org as a community resource alongside your day job. What gap were you filling that existing resources weren’t?
There’s a large global community of statistical programmers, many in countries where access to paid AI tools is limited or where company policies restrict what they can use. I saw three gaps that existing resources weren’t filling.
Free access was the obvious one. Most AI tools that are useful for clinical programming sit behind paywalls or enterprise licenses. A programmer in India or Eastern Europe who wants to learn how AI applies to their SDTM work shouldn’t need a corporate subscription to experiment. Every tool on clinstandards.org is free to use, with no caps on usage.
Consistency and reliability were the second gap. When you use ChatGPT directly, your results depend entirely on how well you write your prompt. One day you get great SAS code, the next day you get hallucinated macro names. I built tools with stable, tested prompts underneath them, prompts that are tuned for clinical programming use cases and produce consistent results. The code generator doesn’t just generate code; it generates code that’s aware of CDISC conventions. The regulatory chatbot doesn’t just search the web; it searches against a curated knowledge base of CDISC standards and FDA guidance through a RAG model.
And then there’s accessibility for non-AI experts. Most statistical programmers are experts in SAS or R, not in prompt engineering. They shouldn’t need to learn how to structure a prompt to get value from AI. The platform gives them a simple UI: type your question, get your output. No fiddling with system prompts or temperature settings.
There’s a learning dimension too. These tools work as learning aids, not just productivity aids. A junior programmer can use the code explainer to understand what a senior programmer’s macro does. They can use the code converter to see how the same logic looks in R versus SAS. That kind of cross-language, cross-concept exposure speeds up professional growth in a way that textbooks don’t.
The architecture is also built to be easily upgradable. When a newer, cheaper model comes out, I can swap the backend, and the tools get better without any change for the user. That matters in a field where model capabilities are improving every few months.
When a clinical programmer starts leaning on AI-generated code in a regulated environment, what do they need to get right that most people aren’t talking about?
The biggest thing nobody talks about is that AI-generated code needs the same QC rigor as any other code. Just because a machine wrote it doesn’t mean it gets a pass. In fact, it might need more scrutiny, because the failure modes are different from human-written code.
When a programmer writes code manually, the mistakes tend to be predictable: typos, wrong variable names, off-by-one errors. When an LLM writes code, the mistakes can be structurally plausible but logically wrong. It might generate a derivation for baseline flag that looks syntactically perfect but applies the windowing rule incorrectly because it’s blending conventions from different SDTM/ADaM IG versions. For example, SDTM v3.2 IG uses XXBLFL variables to populate baseline flags; now, in SDTM v3.4 IG, its function is shifted to XXLOBXFL, and in upcoming future IGs, XXBLFL will be completely removed. That kind of error passes a cursory review. It fails in QC only if the QC programmer independently derives the same variable and catches the discrepancy.
So, the first thing to get right: treat AI-generated code exactly like you’d treat a macro from an open-source library. Validate it. Run it through your double programming process. Don’t skip QC because the code looks clean.
The second thing is the gap between what the AI knows and what the study actually requires. In a recent Phase 1 trial I was involved with, the statistician decided mid-study to update the SAP rules for handling outlier data. AI has no way to catch up to that kind of real-time decision-making. The AI doesn’t know about protocol deviations, CRF completion guidelines to the site, data collection issues that emerged during database lock. People have to make those judgment calls.
The third thing, and this is what I’d like to see the industry work toward, is a tertiary QC layer. When AI is eventually allowed to read production data, we should implement it on top of the existing double programming. Imagine an AI system that can independently derive the best overall response, confirmed response, PFS, and OS from the raw tumor data and compare against both the primary and QC programmer output. That’s not replacing humans. That’s adding a safety net. In oncology especially, where efficacy endpoints from investigator reads versus central reads can diverge, having an AI validation layer could catch issues earlier.
And the industry needs SDKs and frameworks for regulated AI use, similar to what’s happening with R validation in the pharma space. R went from being viewed as non-compliant to having validated packages and risk-based frameworks for regulatory submissions. We need the same infrastructure for AI tools: standardized validation protocols, audit trail requirements, model versioning controls. Companies need a regulated environment where they can confidently deploy AI tools for data workflows, not just code-adjacent tasks. Until that infrastructure exists, every company is making it up as they go, and that’s not sustainable for a regulated industry.



