A new study shows how ERA combines large language models with tree search to rapidly build expert-level research software, outperforming leading benchmarks in tasks from single-cell genomics to COVID-19 hospitalization forecasting.
Study: An AI system to help scientists write expert-level empirical software. Image Credit: Molnia / Shutterstock
A recent study published in the journal Nature introduces Empirical Research Assistance (ERA), an artificial intelligence (AI) system that combines a large language model (LLM) with a tree search (TS) algorithm, potentially overcoming the time-consuming and expertise-sensitive challenges associated with manual software development. ERA uses AI and the TS algorithm to automatically design and improve scientific software. The optimized system can generate expert-level solutions across various fields. In some cases, it even outperformed human-developed and benchmark models on specific scorable scientific tasks, including the official CovidHub Ensemble used for coronavirus disease 2019 (COVID-19)-related hospitalization forecasting.
AI Scientific Software Background
Empirical software is crucial across many areas of scientific research. This is because such software allows scientists to model complex systems and diseases. These range from fluid and atmospheric dynamics to social and biological processes. Developing these software systems, however, is a slow, labor-intensive, and expert-sensitive process. Automation could spearhead innovation and improve research efficiency.
ERA Tree Search Study Design
In the present study, researchers developed ERA to automatically generate and refine scientific software by optimizing quality scores. They regarded the creation of scientific software as a “scorable task”. The candidate programs were evaluated based on how well their outputs could maximize predefined performance metrics.
The system generates multiple software candidates, then rewrites and improves them in a feedback loop guided by performance signals from the scoring function. As an advancement over template-based generative programming (GP), ERA uses an LLM as a flexible engine to generate code by integrating domain knowledge from multiple possible solutions. Unlike systems that generate code from scratch, ERA can modify existing software candidates. ERA is also more versatile than AutoML, as it can rewrite almost any software. This includes everything from preparing and organizing data to running complex simulations and solving advanced mathematical problems.
The TS algorithm prioritizes promising candidates, ensuring systematic exploration of alternative implementations. Researchers can inject insights from research papers, textbooks, and search engine results into the LLM prompts. This enables knowledge-guided code evolution. Similar to combining different ideas, the researchers generated ‘recombinations’ of method pairs based on code summaries. They then ran ERA with prompts for these recombinations to improve model solutions.
The team evaluated ERA across various Kaggle playground competitions and six scientific benchmarks. These spanned bioinformatics, epidemiology, geospatial analysis, neuroscience, and numerical computation. They included tasks such as single-cell RNA sequencing (scRNA-seq) batch integration, COVID-19 hospitalization forecasting, time-series prediction, geospatial segmentation, neural activity modeling in zebrafish, and numerical integration problems.
Researchers assessed ERA’s performance using competition rankings and task-specific scoring systems. To predict COVID-19-related hospitalizations in the United States (US), they tested ERA using a rolling validation approach, in which models were optimized and selected using the preceding 6 weeks of data, while training used historical hospitalization records. They also verified performance using short CovidHub summaries without original code and the General Time Series Forecasting Model Evaluation (GIFTEval).
Scientific Benchmark Performance Results
ERA consistently demonstrated expert-level performance across multiple scientific disciplines. The system even outperformed human-developed methods and benchmark systems in several benchmarked tasks. In bioinformatics, the system generated 40 new approaches for scRNA-seq analysis, surpassing leading methods on the OpenProblems leaderboard. One version of the Batch Balanced K-Nearest Neighbors (BBKNN) method developed by ERA improved overall performance by 14% compared with previously published approaches. ERA, importantly, preserved important biological signals during batch correction.
In epidemiology, the system produced 14 forecasting strategies that outperformed the official CovidHub Ensemble in predicting COVID-19-related hospitalizations in the US. ERA achieved a mean Weighted Interval Score (WIS) of 26, outperforming the official CovidHub Ensemble benchmark, which had a mean WIS of 29, with lower scores indicating better performance. The system achieved this by recombining strengths from different modeling approaches. These included pairing statistical trend analysis with epidemiological disease-spread models. Many hybrid strategies developed using ERA’s TS algorithm also performed better than their parent models, highlighting the value of the recombining methods.
The system, furthermore, demonstrated robust performance in time-series forecasting, geospatial image segmentation, brain activity estimation in zebrafish, and numerical integration tasks. In several cases, ERA exceeded leaderboard results from foundation models, deep learning systems, and traditional forecasting approaches. The system’s advantage stemmed from its ability to continuously explore and refine thousands of software variations while integrating external scientific knowledge from research papers, textbooks, and search engines.
Adding problem-specific guidance to the prompts considerably improved performance. As an example, researchers instructed ERA to create its own boosted decision tree (BDT) library without using existing software packages. They manually verified the results, confirming that ERA followed these instructions. The system also performed consistently well without publicly available code.
AI Research Automation Implications
The findings suggest that AI-driven systems such as ERA could dramatically speed up some forms of computational scientific work by reducing the time, expertise, and computational effort needed to develop advanced research software. By rapidly generating and refining high-performing solutions across various fields using a score-based optimization process, ERA may help researchers tackle complex scientific challenges more efficiently. The system can generate expert-level software in hours or days instead of weeks or months, potentially accelerating progress across multiple areas of science.
However, the authors stress that optimizing empirical predictive models is not the same as full scientific discovery, which also requires reasoning about mechanisms, causal relationships, theories, and mathematical frameworks. They also note broader safety risks if such systems lower the expertise barrier for deploying advanced computational models in sensitive domains.
Download your PDF copy by clicking here.