Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
1Mohamed bin Zayed University of Artificial Intelligence 2University of Chinese Academy of Sciences 3Carnegie Mellon University
Abstract
Large language model (LLM) agents can now automate parts of machine-learning model building, but biomedical benchmarks still either emphasize question answering, reasoning, and tool use, or cover only narrow slices of biomedical ML coding. We introduce BioXArena, a biomedical machine learning (BioML) coding benchmark that evaluates whether agents can create task-specific model-building code for heterogeneous, often multi-modal biomedical datasets. It contains 76 end-to-end tasks across 9 domains: sequence, single-cell, structure, network biology, chemical biology, perturbation dynamics, phenotype--disease, imaging, and text-integrated tasks. Each task is curated from primary sources into a unified public capsule with hidden labels, held-out graders, and biology-aware metrics on a common 0-to-1 scale; agents must write runnable code, train models, and submit predictions for private test samples. BioXArena emphasizes realistic data interfaces: most tasks combine multiple input sources, and more than half are multi-modal, spanning tables, images, text, molecular sequences, omics matrices, and protein structures. We evaluate 11 agent configurations, including general coding LLMs, biomedical agents, and ML coding agents, in a shared 2-hour, single-GPU sandbox. MLEvolve with Gemini-3.1-Pro obtains the highest average score of 0.666, followed by GPT-5.4 with an average score of 0.636; no agent dominates across all domains. Beyond the main leaderboard, we conduct extensive ablation studies, robustness checks, scaling analyses, cost analyses, and failure-mode analyses to characterize how backbones, scaffolds, budgets, and domains affect BioML coding performance. We will release all tasks, graders, runner scripts, leaderboard results, and agent traces.
Figures
Citation
@article{li2026bioxarena,
title={{BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks}},
author={Li, Loka and Zhang, Duzhen and Du, Xingbo and Song, Leonard and Wang, Zixiao and Aukenov, Assanali and Thomas, Noel and Sailaukan, Shakhnazar and Yang, Yonghan and Chen, Feilong and Dong, Jiahua and Zhang, Kun and Zhang, Bin and Song, Le},
journal={arXiv preprint arXiv:2605.15766},
year={2026},
url={https://arxiv.org/abs/2605.15766}
}