LLMs Fail Biomedical Code Test: New Agent Hits 74% Accuracy

by Roman Grant

A Nature Biomedical Engineering benchmark shows LLMs under 40% accurate on 293 biomedical coding tasks, but a new iterative AI agent reaches 74% by refining plans first. A collaborative platform lets researchers complete 80% of real study code.

LLMs Fail Biomedical Code Test: New Agent Hits 74% Accuracy

Biomedical researchers hoping to lean on large language models for data analysis face a stark reality: these tools falter badly on real-world coding tasks. A new benchmark from University of Illinois researchers reveals that even top proprietary and open-source LLMs score below 40% accuracy when generating code for biomedical data science, raising alarms about blindly trusting AI outputs in high-stakes research.

The study, published January 22, 2026, in Nature Biomedical Engineering , introduces BioDSBench, a rigorous test set of 293 coding tasks pulled from 39 peer-reviewed studies spanning seven areas: biomarkers, integrative analysis, genomic profiling, molecular characterization, therapeutic response, translational research, and pan-cancer analysis. Tasks demand everything from plotting survival curves to integrative multi-omics visualizations, using real anonymized patient data from cBioPortal and UCSC Xena.

“Large language models (LLMs) can generate impressive data visualizations from simple requests, yet their accuracy remains underexplored,” the authors write, led by Zifeng Wang and Benjamin Danek of Keiji AI and the University of Illinois Urbana-Champaign, with corresponding author Jimeng Sun.

Advertisement

article-ad-01

Benchmark Exposes Cracks in AI Foundations

Eight proprietary models—GPT-4o, Claude 3.5 Sonnet, Gemini 1.5, OpenAI o3-mini—and eight open-source ones like Llama 3, Code Llama, Qwen2.5-coder, and Deepseek-R1 were pitted against BioDSBench under chain-of-thought prompting and retrieval-augmented generation. None broke 40% overall accuracy. “This low accuracy raises serious concerns about the risk of propagating incorrect scientific findings when blindly relying on AI-generated analyses,” the paper warns.

Proprietary models edged out open-source counterparts slightly, but both struggled with biomedical nuances like handling genomic datasets or therapeutic response metrics. The benchmark, hosted on Hugging Face at https://huggingface.co/datasets/zifeng-ai/BioDSBench , includes reference solutions and test cases for reproducibility.

Recent web searches confirm this isn’t isolated: a npj Digital Medicine paper on medical LLMs notes persistent gaps in specialized knowledge, while Scientific Reports highlights scalability issues in oncology tasks.

Iterative Agents Rescue Reliability

To fix this, the team built an AI agent that drafts and iteratively refines an analysis plan before coding, drawing on ReAct reasoning-acting synergy and self-refine feedback loops. This boosted accuracy to 74%, nearly doubling baseline performance. The agent breaks complex tasks into steps: plan, code, test, refine.

Code for the agent lives on GitHub at https://github.com/RyanWangZf/BioDSBench . In a user study, five medical researchers used a new human-AI platform to codevelop plans and execute them, finishing over 80% of analysis code for three real studies. The platform, accessible via https://keiji.ai/contact.html , integrates planning, coding, and execution in one environment; see the demo at https://www.youtube.com/watch?v=c5ZJsFXQ_B0 .

“Benchmarking eight proprietary and eight open-source LLMs under various prompting strategies reveals an overall accuracy below 40%,” per the Nature paper. Figures in the study detail model comparisons (Fig. 2) and adaptation strategies (Fig. 3).

Real-World Ripples and Broader Warnings

X posts from insiders like Stephen Turner echo the findings, linking the paper as essential reading for data scientists. Broader critiques abound: npj Digital Medicine calls out shaky foundations for electronic health records, and npj Precision Oncology scrutinizes oncology applications.

The low scores underscore why biomedical work can’t afford hallucinations—errors in genomic profiling or pan-cancer analysis could mislead drug development. Yet the agentic fix points forward: structured planning tames LLM chaos. Ziwei Yang of Kyoto University and Zheng Chen of Osaka University aided dataset curation.

Funding came from NSF grants and JSPS, with no competing interests declared. Peer-reviewed by Chao Yan and others.

Path Forward for Trustworthy AI Tools

This platform shifts paradigms from solo LLM reliance to collaborative copilots. Researchers can now iteratively refine plans with AI, execute in integrated setups, slashing manual coding time. User study results (Fig. 5) show practical gains, with over 80% task completion.

While limited to 293 tasks and five users, the work scales: datasets from cBioPortal ensure real-world relevance. As npj Artificial Intelligence notes on LLMs in science, deep integration with human goals is key, backed by clear metrics.

Jimeng Sun, corresponding author, emphasizes supervision in acknowledgments. For insiders, BioDSBench sets a new standard—test your copilot before trusting it.

Roman Grant

Roman Grant is a journalist who focuses on AI deployment. They work through comparative reviews and hands‑on testing to make complex topics approachable. They often cover how organizations respond to change, from process redesign to technology adoption. They are known for dissecting tools and strategies that improve execution without adding complexity. They maintain a balanced tone, separating speculation from evidence. They value transparent sourcing and prefer primary data when it is available. They look for overlooked details that differentiate sustainable success from short‑term wins. They also highlight cultural factors that determine whether change sticks. They explore how policies, markets, and infrastructure intersect to create second‑order effects. Their coverage includes guidance for teams under resource or time constraints. They frequently compare approaches across industries to surface patterns that travel well. A recurring theme in their writing is how teams build repeatable systems and measure impact over time. They watch the policy landscape closely when it affects product strategy. Their work aims to be useful first, timely second.

LEAVE A REPLY

Your email address will not be published