
Scroll any life sciences feed andย youโllย see the same headline wearing a different hat:ย โAI will revolutionize drug discovery.โย
Sometimes it will.ย But most teamsย donโtย loseย yearsย because theyย lackย a moonshot. They lose years because brilliant scientists spend thousands of micro-moments stuck in the seams between tools, formats, and workflowsโdoing workย thatโsย necessary, repetitive, and oddly hard to automate well.ย
At Deep Origin,ย weโveย taken two approaches to AI development, one of which is deliberately down-to-earth: use AI where it earns its keep by removing specific, high-friction bottlenecksโespecially the ones that feel mundane. Two examples:ย
- DO Patentย extracts chemical structures from PDFs into usable digital representations (e.g., SMILES), so researchersย donโtย spend their week re-drawing molecules before they can do any real science.ย
- Why does this matter for those not in the life sciences?ย ย
- Because before most drug discovery projects can even begin, chemists oftenย have toย curate their own dataย โ pulling molecular structures out of papers, patents, and slide decks where they exist only as images. That curation step routinely meansย manually re-drawing molecules by handย just to make them usable for modeling or analysis.ย
- Balto, our conversational interface for molecular modeling and simulation, removes the โyou must code / you must learn five interfacesโ tax that blocks many scientists from running serious computation, even when the underlying models already exist.ย ย
- Why does this matter for thoseย not inย drug discovery?ย ย
- Modern drug discovery relies heavily on molecular simulation โ docking, scoring, and property prediction โ but these tools often requireย months to learn, comfort with scripting, and fluency across multiple disconnected interfaces.ย
Weย alsoย build AI in places where quality is existentialโlike docking and virtual screeningโbecause ifย youโreย going to automate a decision that drives lab spend,ย youโdย better be right often enough to trust.ย Weโreย working on a huge moonshot through an ARPA-H award to help replace animal testing with simulations. Butย whatโsย surprised many teams we talk to is this: the fastest ROIย frequentlyย comes from the โsmallโ problems.ย
This post is about why.ย
A little background: drug discovery is a hard place for AI (and why that matters)ย
From the outside, drug discovery can look like an obvious AI win.ย
Itโsย data-rich.
Itโsย high-cost.
Itโs full of optimization problems.ย
So why hasnโt AI already โsolvedโ it?ย
The short answer is that drug discovery is not one problem โย itโsย aย chain of fragile decisions, each made under uncertainty, where errors compound slowly and expensively.ย
A single drug program typically spans:ย
- target identification and biological validationย
- hit discovery and optimization across chemistry, potency, selectivity, and safetyย
- years of preclinical testingย
- multi-phase clinical trials
Timelines routinelyย stretchย 10โ15 years. Costs run intoย theย billions.ย And even with all that effort, the vast majority of programs fail โ often late, and often for reasons that were invisible early on.ย
For AI practitioners, three structural realities make this domain uniquely difficult:ย
- Ground truth arrives late (or never)
In many AI applications, feedback loops are fast. You can A/B test. You can retrain weekly. You know quickly whether a model worked.ย
In drug discovery, the โlabelโ might arriveย five years laterย โ when a molecule fails in the clinic due to toxicity, poor exposure, or lack of efficacy in humans. By then, millions of dollars and thousands of decisions have already been made.ย
This makes naรฏve end-to-end optimization notย just hard, but dangerous.ย
- Data is fragmented, biased, and context-heavy
Unlike consumer or enterprise domains, drug discovery data is:ย
- sparse in the regimes that matter most (novel targets, new chemistry)ย
- biased toward whatย worked well enough to publishย
- scattered across papers, patents, internal reports, and PDFs
ย
Much of the most valuable informationย isnโtย in clean tables โย itโsย embedded in figures, captions, supplementary material, or institutional memory.ย
- Wrong answers are far more costly than slow ones
In many AI systems, a slightly wrong output is tolerable.ย
In drug discovery:ย
- a false positive can send teams down a dead-end for monthsย
- a false negative can kill a molecule that might have helped patientsย
- a poorly understood model can erode trust fast
ย
As a result, adoption hinges not just on accuracy, but onย interpretability, traceability, and workflow fit.ย
This context matters because it explains why progress in AI-driven drug discovery rarely comes from a single sweeping model โ and why so much value is locked up in whatย lookย like โsmallโ problems.ย
Why small bottlenecks dominate outcomesย
Because discovery pipelines are long and interdependent,ย friction anywhere propagates everywhere.ย
If:ย
- chemical structures are locked in PDFs,ย
- only a few specialists can run simulations,
or models are powerful but inaccessible,
ย
thenย even strong AI struggles to make an impact.ย
Thatโsย why many of the highest-ROI applications of AI in drug discoveryย arenโtย flashy breakthroughs โย theyโreย workflow accelerators:ย
- removing translation layers between humans and machinesย
- compressing setup timeย
- reducing manual reworkย
- making advanced tools usable by more people, more often
ย
This is the context in which tools like DO Patent and Balto exist โ and why they matter far more than their surface simplicity might suggest.ย
Drug discoveryย doesnโtย just have hard problems. It hasย hiddenย ones.ย
If you ask a medicinal chemist what slows a project down,ย youโllย hear the big themes: target biology uncertainty, ADMET surprises, synthesis complexity, translational risk.ย
But sit next to a team for a week andย youโllย see a different class of bottleneck:ย
- chasing structures buried in PDF figures (patents, papers, slide decks)ย
- re-drawing molecules intoย ChemDrawย
- copying structures between tools thatย donโtย quite agreeย
- manually constructing datasets from documents before modeling can even beginย
- waiting on the one person who knows how to run a particular workflowย
- โI could do that simulationโฆ if I remembered the flags / environment / file formatsโย
Theseย arenโtย glamorous problems, butย theyโreย measurable, common, and expensive.ย
Even the broader data world has a well-known pattern: a large fraction of time goes to preparation and wrangling rather than โthe interesting part.โย An EMA data-analyticsย reportย cites a survey finding data scientists spendย around 80%ย of their time preparing and managing data for analysis.ย While โdata scienceโย isnโtย identical to โdrug discovery R&D,โ the shape of the problem is familiar to any computational group embedded in a lab: the pipeline is only as fast as its least-automated seam.ย
DO Patent: turning โhours of redrawingโ into โminutes of extractionโย
<!–ARCADE EMBED START–><div style=”position: relative; padding-bottom: calc(49.2188% + 41px); height: 0px; width: 100%;”><iframeย src=”https://demo.arcade.software/83wyMm2v7qUojR3lz5eJ?embed&embed_mobile=tab&embed_desktop=inline&show_copy_link=true” title=”DO Patent – From Document to SMILES inย MInutes” frameborder=”0″ loading=”lazy”ย webkitallowfullscreenย mozallowfullscreenย allowfullscreenย allow=”clipboard-write” style=”position: absolute; top: 0; left: 0; width: 100%; height: 100%; color-scheme: light;” ></iframe></div><!–ARCADE EMBED END–>ย
Patents and papers are full of chemistryโbut much ofย itย is effectively locked in image form. Extracting it isย a known pain pointย in cheminformatics: the literature repeatedly describes chemical structure extraction and patent mining as difficult, time-consuming, and error-prone when done manually. ScienceDirect+1ย
And thisย isnโtย a new pain. Chemists have spent enormousย timeย simply preparing structures for communication.ย C&ENย has described chemical structure preparation as a task someone would do โfor up to four hours at a timeโ at a drafting table in the pre-digital era.ย The medium changed; the taxย didnโtย disappearโit often just moved into new workflows (screenshots, to redraws, to copy/paste, to reformatting, to error-checking. Thenย repeat.).ย
Why it matters:ย
- Time cost: If a chemist spends even 2โ5 hours/week re-drawing structures from documents,ย thatโsย 100โ250 hours/year per person.ย
- Error cost: One transposed bond or missed stereocenter can quietly poison downstream analysis.ย
- Opportunity cost: Those hours are taken from hypothesis generation, design, and interpretationโthe work only humans can do.
The broader ecosystem is now large enough to measure the scale of the problem. For example, theย PatCID paperย describesย 81M chemical-structure imagesย andย 14M unique structuresย extracted from patentsโan illustration of how much chemically relevant information exists in image form. Andย benchmark workย comparing optical chemical structure recognition (OCSR) tools emphasizes that patents are a high-stakes domain where precision and recall matter.ย ย
Our bet with DO Patent is simple:ย if you remove the redraw step, you collapse a recurring multi-hour workflow into minutesโwhile improving traceability back to the original figure (so users canย validate, not just trust).ย
A practical ROI lens (a template you can steal)ย
To keep ROI honest, we like โper scientist, per weekโ mathโbecause thatโs where adoption lives.ย
Use a conservative baseline:ย
- Median chemist salary in the U.S. is reported aroundย $115,000 (2024)ย in the ACS salaryย survey.ย ย
- Fully-loadedย cost is commonly estimated as a multiplier on salary (benefits + overhead). Many HR/financeย referencesย citeย ~1.25โ1.4รย as a typical range.ย ย
If a chemist costs, say, ~$115k ร 1.3 โย $150k/year fully loaded, thatโs aboutย $75/hourย assuming ~2,000 working hours/year.ย
Now assume DO Patent saves:ย
- 2 hours/weekย per chemist (conservative for many teams doing competitive intel or heavy literature work)ย
Thatโs:ย
- 2 ร 52 =ย 104 hours/yearย
- 104 ร $75 โย $7,800/year per chemistย
At 10 chemists,ย youโreย atย ~$78k/yearย in time value recoveredโbefore you count reduced errors, faster cycles, or the compounding benefit of better-structured internal knowledge.ย
Is every recovered hour โcash savedโ? Not directly. But it isย throughput regained,ย and inย drug discovery throughput is often the only resource youย canโtย buy fast enough.ย
Balto: a conversational interface as an acceleration layer for modelingย
<!–ARCADE EMBED START–><div style=”position: relative; padding-bottom: calc(59.081% + 41px); height: 0px; width: 100%;”><iframeย src=”https://demo.arcade.software/rsuVmTmHUP4fEgyTN8ls?embed&embed_mobile=tab&embed_desktop=inline&show_copy_link=true” title=”Balto AI” frameborder=”0″ loading=”lazy”ย webkitallowfullscreenย mozallowfullscreenย allowfullscreenย allow=”clipboard-write” style=”position: absolute; top: 0; left: 0; width: 100%; height: 100%; color-scheme: light;” ></iframe></div><!–ARCADE EMBED END–>ย
A second unsexy bottleneck: access.ย
Most large organizations have world-class modeling toolsโplus a steep gradient between โpeople who can operate themโ and โpeople who need the results.โ In many groups, simulation becomes a service desk: request โ queue โ translation โ run โ interpretation โ back-and-forth.ย
Balto is designed around a pragmatic idea:ย make advanced workflows accessible without forcing every user to become a toolchain expert.ย
Thisย isnโtย โAI that discovers drugs by itself.โย Itโsย AI that reduces:ย
- interface switchingย
- scripting frictionย
- hidden tribal knowledge (โthe way we run this hereโ)ย
- setup overhead (formats, parameters, environment)ย
Why this approach works:ย
- Speed: the fastest way to run more modelingย isnโtย always a faster GPUโitโsย fewer blocked humans.ย
- Consistency: translating intent into repeatable workflows reduces variance.ย
- Learning curve compression: junior scientists ramp faster; senior scientists spend less time teaching the same operational steps.
And it fits a broader industry truth: many of the biggest AI value pools in pharma are productivity-driven rather than โone model to rule them all.โ McKinsey hasย estimatedย generativeย AIย could createย $60Bโ$110B/yearย in economic value for pharma and medical products,ย largely viaย productivity gains across the value chain.ย ย
Balto is our โlocalโ version of that thesis: win back time by lowering the activation energy of computation.ย
Where weย donโtย stay down-to-earth: simulation qualityย
Thereโsย a reason many AI-for-drug-discovery announcements feel like moonshots: the scientific frontier is legitimately hard. Molecular simulation isย a great exampleโifย yourย scoring or pose prediction is wrong, the downstream cost is paid in synthesis, assays, and months of effort.ย
So yes: we invest in modeling accuracy where the cost of being wrong is enormous.ย
Butย weโveย learned something important:ย quality AI and unsexy AIย arenโtย competitors.ย They reinforce each other.ย
- If docking is strong but workflows are inaccessible, you underutilize it.ย
- If workflows are accessible but results are unreliable, you waste lab cycles faster.ย
The product strategy is to build both:ย
- trustworthy core engines, andย
- interfaces + extraction layers that remove friction around them.ย
Why this down-to-earth AI strategy wins (and scales)ย
1)ย Itโsย easier to validateย
A โredraw this moleculeโ workflow has obvious before/after metrics:ย
- time-to-SMILESย
- error rateย
- traceabilityย
- throughputย
The tighter the feedback loop, the faster you improve the product.ย
2) It compoundsย
Saving 90 minutes once is nice. Saving 30 minutesย every weekย across a team becomes budget, headcount flexibility, and faster iteration.ย
3) It aligns with how scienceย actually shipsย
Drug discovery is an ensemble sport. The bottleneck moves. Teams that win are the ones that continuously remove friction across the pipeline, not the ones that bet the company on a single โbreakthrough model.โย
4)ย Itโsย how AI earns trust in regulated, high-stakes environmentsย
In pharma, credibility is built by tools that work reliably on Tuesday afternoonโnot just in a benchmark plot.ย
A closing thought: the future is not just smarter modelsโitโsย less wastedย expertiseย
Drug discovery will absolutelyย benefitย fromย frontierย AI. But in practice, the near-term advantage often goes to organizations that treat AI like anย engineerย treats latency:ย identifyย the hotspots, measure them, remove them, repeat.ย
At Deep Origin,ย thatโsย the philosophy behindย DO Patentย andย Balto:ย
- make the invisible bottlenecks visibleย
- automate the steps nobody brags aboutย
- reserve human attention for the work only humans can doย
If the last decade taught us anything,ย itโsย that โrevolutionโ is usually the accumulation of many unglamorous winsโstacked until theย whole systemย feels different.ย
Andย thatโsย exactly the kind of AI we like building.ย
About Merrill Cook
Merrill Cook is director of marketing at Deep Origin. He draws on experiย Byline for Silicon Valleys Journal Attached ence across a range of marketing disciplines from paid, brand, product marketing, to content, as well as operations and web development. In the past he’s worked for or consulted with 7 seed through series B startups, often as the first marketing hire. He’s also worked as a front end developer, a data journalist, and likes to dabble in woodworking.



