If you lead drug development at a pharmaceutical company, you are being told—from every direction—that AI agents are about to change how your science gets done. The agents can already reason about a biological problem, plan a modeling strategy, invoke the right scientific tools, interpret the results, and propose the next experiment. What seemed experimental a few years ago is becoming the way the work gets done.
So the question on your mind is not whether agentic AI is coming. It is this: can you stake a regulatory submission on it?
That is the question this piece is about, because it is the one that matters most in our industry. A drug program is a chain of decisions that stretches over years and absorbs hundreds of millions of dollars. Every link in that chain has to be defensible—to a regulator, to an internal quality team, to the next scientist who picks up the model two years from now and needs to understand exactly how a conclusion was reached. An AI agent that produces a brilliant insight you cannot reproduce, cannot audit, and cannot trace back to a validated computation becomes a liability in that world, however impressive the insight.
The questions that separate trustworthy agentic AI from the rest
If you are evaluating agentic AI for regulated science, four questions tell you most of what you need to know:
- What actually produces the result—a language model, or a validated scientific engine?
- Can the result be reproduced—exactly, every time, from the same inputs?
- Can every decision be traced and replayed—months or years later, by someone who wasn’t there?
- Does human expertise stay accountable—with the scientist directing the work, not deferring to it?
These are the questions we built Composer, our AI-native platform for model-informed drug development, to answer. At its heart is an AI co-scientist: an agent that reasons about the scientific problem and orchestrates our validated engines through natural language, while the scientist stays in control. Here is how.
Our thesis: the insight is only as trustworthy as the engine beneath it
The conviction at the center of how we build, and the thing our customers respond to most strongly: an AI-generated insight is only as trustworthy as the validated engine that produced it.
A large language model can summarize a mechanism, propose a hypothesis, even sketch a model structure. What it cannot do—and should never be asked to do—is be the science. The numerical simulation, the parameter estimation, the mechanistic ODE solver that a drug decision rests on must come from an engine that has been validated against reference models and benchmark datasets, that produces the same output every time it is run, and that can be audited independently of any AI in the loop. A deterministic, validated engine is a world away from a chatbot dressed up as science.
So in the Composer ecosystem, the AI orchestrates the engine rather than replacing it. The co-scientist is designed to reason about the problem and decide what to do; the validated engine does the computing. We keep a clean line between the reasoning layer, which is probabilistic and fast-moving, and the computation layer, which is deterministic and qualifiable. The AI is intended to be an enhancement—a powerful one—but it is deliberately kept off the regulatory critical path. Every engine in Composer remains fully usable, and fully validatable, without any AI at all.
That architecture is what lets us answer the second and third questions directly.
Reproducibility. Run the same analysis with the same inputs and you get the same result—because the work is done by deterministic engines, not by a model that improvises. When a scientist is ready to move from open-ended exploration to a standardized process, the co-scientist can capture the entire interaction as a deterministic Workflow: an inspectable, versionable artifact that runs the same way every time, with no AI in the execution path. Exploration becomes production without losing rigor.
Replay-ability. Every analysis leaves a complete, traceable record—which engine version ran, with which parameters, producing which result, leading to which decision. A colleague, a regulator, or the same scientist two years later can replay the reasoning end to end. The trail is the explanation. In a field where “right for the wrong reasons” is a genuine scientific failure mode, the ability to show your work is part of the science itself, well beyond any compliance checkbox.
For three decades, our engines—GastroPlus®, MonolixSuite™, ADMET Predictor®, and our mechanistic QSP platforms—have carried exactly this burden of proof inside regulatory filings around the world. They are the foundation the agents stand on.
Why the BioNeMo Agent Toolkit is foundational to this
A validated foundation answers the trust question. But the agents that orchestrate that foundation also need to be genuinely fluent in biology, not merely fluent in language. This is exactly where the NVIDIA BioNeMo Agent Toolkit becomes foundational to our strategy.
General-purpose reasoning gets an agent surprisingly far. But drug development runs on domain-specific understanding: the structure of a protein, the pharmacology of a target, the patterns buried in genomics and clinical data, the mechanistic logic of a disease. The BioNeMo Agent Toolkit—NVIDIA’s life-sciences agent stack, including Nemotron models, NemoClaw blueprints, and OpenShell secure runtime—provides domain-specific models, NVIDIA NIM microservices, libraries, and frameworks built for life sciences. It is what lets an agent act on biological data rather than merely describe it, and it is what bridges general-purpose AI reasoning with real-world biological computation.
This is the combination that matters: our validated engines seek to bring scientific truth that holds up under scrutiny; the BioNeMo Agent Toolkit brings biological breadth and acceleration. Neither is sufficient alone, and the scientist directs both.
How we are using the BioNeMo Agent Toolkit today—and where we are going
The collaboration is already underway across our ecosystem on three fronts.
Grounded in evidence. Our agents are incorporating NVIDIA Nemotron Parse for scientific literature parsing and extraction—pulling parameters, mechanistic relationships, and quantitative data out of the dense PDFs, tables, and figures that hold the field’s accumulated knowledge. Extraction quality is our goal: a misread table becomes a bad parameter becomes a flawed model. By grounding our agents’ knowledge bases in high-fidelity extraction—with provenance back to the source—we let the co-scientist reason over what the literature actually says and let the scientist verify the evidence behind any recommendation. This is how an agent earns the right to propose a parameter rather than hallucinate one.
Accelerated by computation. Together with NVIDIA, we are collaborating on nvQSP to bring CUDA-optimized ODE solvers to our mechanistic modeling engines—GPU-accelerated QSP simulation that reduces the time to evaluate virtual populations and explore competing hypotheses. The science of a QSP or PK/PD model lives in systems of differential equations, and solving them is the computational bottleneck. GPU-accelerated solvers return the same answer the CPU would, while changing what becomes possible: larger virtual populations, broader parameter-space exploration, agentic search across many candidate models in the time it used to take to run one. Faster computation does not change the science; it expands the number of scientific questions a team can afford to ask.
Expanded by specialized biological models. Our vision goes further, and we are building toward it deliberately. We intend to host NVIDIA’s frontier biological models directly within the Composer ecosystem and to power a new generation of Composer Agents on BioNeMo Agent Toolkit infrastructure—new classes of biological reasoning available to the co-scientist as first-class, orchestratable tools, sitting right alongside our validated simulation engines. Every one of those new capabilities inherits the same discipline as the rest of the platform: scoped, version-traceable, and validated for what it is claimed to do, with the AI layer kept off the critical path. Biological frontier models and decades-validated engines, in one ecosystem, under one standard of trust.
What this means for scientists and for discovery
Strip away the architecture and the partnership mechanics, and here is what changes for the person doing the work — which brings us to the fourth question, the one about human accountability.
Agentic AI removes friction from the process, not the scientist from the science. Hypotheses get generated and tested faster, because the distance between “I have an idea” and “I have a result” collapses from weeks to hours. Experimental scientists who were never going to become expert modelers can now direct sophisticated modeling through natural language — the expertise is in the platform, available to them. And the collaboration between human scientist and AI agent becomes a real partnership rather than a novelty, because every recommendation remains inspectable, reproducible, and subject to human review. Scientific accountability stays exactly where it belongs: with the experts directing the work.
That is the future we are building toward: agentic science that moves at the speed of AI and holds up to the scrutiny of a regulator, in the same breath.
Agentic drug development needs a foundation scientists can trust. We have spent thirty years building the validated engines that scientific truth rests on; NVIDIA BioNeMo Agent Toolkit extends its reach into new domains of biology; and the scientist stays firmly at the controls of both. That is the partnership—and the standard—we are bringing to BIO 2026.
By Erik Guffrey, Co-Chief Product & Technology Officer