Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

Loka Li¹, Duzhen Zhang¹, Xingbo Du¹, Leonard Song¹, Zixiao Wang¹, Assanali Aukenov¹, Noel Thomas¹, Shakhnazar Sailaukan¹, Yonghan Yang¹, Feilong Chen², Jiahua Dong¹, Kun Zhang^1,3, Bin Zhang¹, Le Song¹

¹Mohamed bin Zayed University of Artificial Intelligence ²University of Chinese Academy of Sciences ³Carnegie Mellon University

Paper Leaderboard GitHub Hugging Face

Abstract

Large language model (LLM) agents can now automate parts of machine-learning model building, but biomedical benchmarks still either emphasize question answering, reasoning, and tool use, or cover only narrow slices of biomedical ML coding. We introduce BioXArena, a biomedical machine learning (BioML) coding benchmark that evaluates whether agents can create task-specific model-building code for heterogeneous, often multi-modal biomedical datasets. It contains 76 end-to-end tasks across 9 domains: sequence, single-cell, structure, network biology, chemical biology, perturbation dynamics, phenotype--disease, imaging, and text-integrated tasks. Each task is curated from primary sources into a unified public capsule with hidden labels, held-out graders, and biology-aware metrics on a common 0-to-1 scale; agents must write runnable code, train models, and submit predictions for private test samples. BioXArena emphasizes realistic data interfaces: most tasks combine multiple input sources, and more than half are multi-modal, spanning tables, images, text, molecular sequences, omics matrices, and protein structures. We evaluate 11 agent configurations, including general coding LLMs, biomedical agents, and ML coding agents, in a shared 2-hour, single-GPU sandbox. MLEvolve with Gemini-3.1-Pro obtains the highest average score of 0.666, followed by GPT-5.4 with an average score of 0.636; no agent dominates across all domains. Beyond the main leaderboard, we conduct extensive ablation studies, robustness checks, scaling analyses, cost analyses, and failure-mode analyses to characterize how backbones, scaffolds, budgets, and domains affect BioML coding performance. We will release all tasks, graders, runner scripts, leaderboard results, and agent traces.

Figures

**Figure 1: Overview of BioXArena.** (a) Tasks are curated from journals, conferences, and public databases by ML and biology experts, then packaged as unified public task capsules with hidden private labels and graders. (b) The resulting benchmark contains 76 tasks across 9 biomedical ML domains. (c) The evaluation covers 11 agents, grouped into closed-source general LLMs, open-source general LLMs, biomedical agents, and ML coding agents. (d) All agents run under the same 2-hour, single-GPU sandbox and submit a \texttt{submission.csv} to held-out task-specific graders. (e) Nine evaluation metrics feed the leaderboard, domain heatmaps, failure taxonomy, and cost analysis.

Figure 2: Main-experiment scores and failure profile. Panel (a) averages normalized score only over successfully evaluated tasks, by domain and overall. Panel (b) averages over all 76 tasks and assigns each failed run with score 0 as a penalty. Panel (c) splits each agent's 76 runs into successful \textbf{OK} runs, meaning submissions that pass the task-specific evaluator and receive a valid score.

**Figure 3: Fixed LLM backbone ablation study over different agent scaffolds.** The layout follows Figure 2, but every agent uses DeepSeek-V3.2 as backbone. Panel (a) averages successful-only tasks, panel (b) averages all 76 tasks with failed tasks scored zero as a penalty, and panel (c) shows success/failure categories.

Citation

@article{li2026bioxarena,
  title={{BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks}},
  author={Li, Loka and Zhang, Duzhen and Du, Xingbo and Song, Leonard and Wang, Zixiao and Aukenov, Assanali and Thomas, Noel and Sailaukan, Shakhnazar and Yang, Yonghan and Chen, Feilong and Dong, Jiahua and Zhang, Kun and Zhang, Bin and Song, Le},
  journal={arXiv preprint arXiv:2605.15766},
  year={2026},
  url={https://arxiv.org/abs/2605.15766}
}