Artificial intelligence now plays a central role in drug discovery. In antibody research especially, models support sequence design, paratope prediction, affinity maturation, and developability screening. Yet as these systems move from benchmark tasks to real discovery programs, a familiar pattern is emerging: models often perform well within known training space, then weaken when asked to generalize to novel targets, novel binders, or unusual interaction geometries. One reason may be that the field has focused heavily on algorithms while underestimating the importance of the underlying biological data.

This is becoming harder to ignore. Across biotech, conversations about AI still tend to center on architectures, compute, and model optimization. Those factors matter, but they do not remove the ceiling imposed by limited training data. In that sense, AI drug discovery is beginning to look less like a pure modeling contest and more like a data infrastructure race.

The bottleneck is not compute, it’s relevant data

Antibody discovery makes the issue especially clear. Antibody sequence space is enormous and unfathomably diverse, yet the number of publicly available antibody-antigen co-structures remains relatively small, as reflected in resources such as the RCSB Protein Data Bank and the AlphaFold Protein Structure Database. That means many AI systems are trained on a narrow, highly canonical slice of interaction space and then expected to generalize across a much broader biological reality. When they encounter pressing areas in drug discovery, like new CDR loop geometries, unfamiliar epitopes, rare scaffold architectures, or unusual binding modes, performance can degrade. The issue is not necessarily that the models are flawed, but more that even the best models have not seen enough relevant structural diversity.

This reframes a basic question for the field. If multiple groups are training on the same public datasets with the same limitations, there may be a limit to how different their results can be. At that point, the central competitive question shifts: it is no longer which organization has the best model, but which organization can generate the richest and most useful training environment.

Why public structural datasets are no longer enough

Public structural repositories remain indispensable. They made today’s generation of protein and antibody AI possible. But they were not built with large-scale machine learning as the primary use case.

Most legacy structures were generated to answer specific biological questions, not to create balanced, high-volume training corpora. As a result, coverage is uneven. Certain targets and protein classes are well represented, while others remain sparse. For example, antibody-antigen examples are relatively limited, dynamic interactions are underrepresented, and, in many cases, the structures available reflect stabilized or engineered states rather than the kinds of solution-state interactions a discovery team may want a model to learn from. This gap is especially relevant given the continued progress, but also the remaining limitations, in structure prediction and antibody modeling, as discussed in work on AlphaFold2 structures guiding ligand discovery, IgFold, antibody-antigen prediction in real-life applications, and SAAINT-DB.

That distinction matters because AI systems do not learn only from quantity, they learn from the character of the examples they see. If the training universe is narrow, the resulting model may be elegant but brittle. For real-world drug discovery, robustness matters more than benchmark performance on familiar data.

Structural biology was not built for industrial throughput

Traditional structural methods remain foundational. X-ray crystallography and cryo-electron microscopy deliver extraordinary biological insight and will continue to be essential for the tasks to which they are suited. But they are unable to reflect the dynamic nature of protein-protein interactions, and are not optimized to generate tens of thousands of training examples on demand. Even where cryo-EM has become powerful for biologic discovery, as shown in work on functional and epitope-specific monoclonal antibody discovery via cryo-EM, that is different from being naturally configured for sequence inference or industrial-scale dataset generation.

Conventional structural biology is inherently resource-intensive and iterative. They often require protein engineering, purification, stabilization, and repeated optimization. They are excellent for going deep on individual biological questions, but are much less suited to the industrial task of generating broad, diverse structural datasets fast enough to support modern AI development cycles.

That distinction, depth versus breadth, is becoming more important. In an AI context, the limiting factor is often not whether a team can solve one structure with high resolution, it is whether the team can generate enough varied, usable interaction data to improve generalization across many discovery settings. Recent thinking in the field increasingly reflects this shift toward breadth in sequence and interaction space, rather than relying on a relatively small number of carefully solved examples.

Data infrastructure is becoming part of the product

In software, data infrastructure usually refers to the systems that ingest, standardize, store, and activate information at scale. In drug discovery, the meaning is expanding, and includes the experimental and computational systems required to generate biological data in formats that machine learning workflows can readily use. That means standardized workflows, reproducible outputs, high-throughput automation, strong quality control, and a reliable path from empirical measurement to model-ready structural representations. It also means designing data generation around what AI models actually need, rather than treating machine learning as a downstream consumer of whatever data biology happens to produce.

This is where the infrastructure question becomes concrete. Some experimental methods can generate rich structural interaction data at high throughput, but do not naturally output the exact object type many existing models expect. A workflow might measure solvent accessibility changes or epitope-level interaction information, while the model is built to ingest PDB-format structures. Bridging that gap is a core infrastructure problem. The value lies not only in generating new data, but in converting those measurements into representations that can plug directly into existing modeling pipelines.

Some organizations are already building around this shift. Rather than treating structural biology as a downstream validation step, they are investing in systems that generate interaction data at AI-relevant scale, under standardized conditions, and with enough consistency to support model training, fine-tuning, and validation. In practice, that means pairing high-throughput experimental workflows with computational layers that can translate empirical interaction measurements into model-ready structural representations.

Why integrated systems may matter more than standalone models

This is why the next durable edge in AI drug discovery may come from organizations that integrate biology, automation, data engineering, and modeling into a single system.

A model trained on static public data may eventually converge with competitors facing the same inputs. A group that can continuously generate proprietary, experimentally anchored interaction datasets has a different kind of advantage. It can train, fine-tune, and validate on a growing base of new information. Over time, that may prove more meaningful than another incremental improvement in architecture alone.

This does not mean model innovation stops mattering. It means model innovation becomes only one layer in a broader stack. The training data, the workflows that produce it, and the systems that make it AI-ready all start to function as part of the competitive moat.

Biology-first AI needs biology-first datasets

There is also a broader lesson here. Drug discovery data should not be treated as interchangeable. Biological measurements differ in context, relevance, and usefulness for modeling. Data generated under native or solution-state conditions may teach different lessons than data derived from highly stabilized or artificial systems. Data generated to answer one narrow scientific question may be less useful for generalizable machine learning than data generated intentionally to capture breadth and diversity.

That suggests a shift in mindset. Instead of asking only how AI can accelerate existing discovery workflows, the field may need to ask how experimental systems themselves should evolve when the end goal is model performance. Some of the most important AI investments may happen before training begins, in assay design, automation, data standardization, and the generation of better biological starting material.

The race is already underway

Drug discovery will not become a pure data infrastructure business. Biology, chemistry, translational judgment, and clinical execution will still determine who succeeds. But within AI-enabled discovery, infrastructure is moving closer to the center of competition.

Organizations that can generate large, diverse, experimentally grounded structural datasets, and make them usable across modern modeling workflows, are likely to have an increasing advantage. The next phase of progress may come less from asking which model performs best on a fixed benchmark, and more from asking which teams can build the systems that continuously expand the training universe itself.

If that is right, then AI drug discovery is not only becoming a modeling race. It is becoming a race to build the biological data infrastructure that modeling depends on.

Author

AIJ Thought Leader

View all posts

AIJ Thought Leader 23 May 2026

6 minutes read

Is AI Drug Discovery Becoming a Data Infrastructure Race?

By Dan Benjamin, PhD, is Co-Founder and Chief Technology Officer of Immuto Scientific

The bottleneck is not compute, it’s relevant data

Why public structural datasets are no longer enough

Structural biology was not built for industrial throughput

Data infrastructure is becoming part of the product

Why integrated systems may matter more than standalone models

Biology-first AI needs biology-first datasets

The race is already underway

Author

The bottleneck is not compute, it’s relevant data

Why public structural datasets are no longer enough

Structural biology was not built for industrial throughput

Data infrastructure is becoming part of the product

Why integrated systems may matter more than standalone models

Biology-first AI needs biology-first datasets

The race is already underway

Author

Related Articles

The Stack Doesn’t Lie: What Building Owners Are Actually Deploying

Responsible AI in Healthcare: A Framework for Secure, Scalable Adoption

Katalyze AI Raises $10.5M Led by Bonfire Ventures Deploy Agentic Operating System for Pharma

Support Groups Are Going Digital: How AI Is Helping Chronic Illness Patients Find Community, Connection, and Support