Skip to content

RomeroLab/BioDesignBench

Repository files navigation

BioDesignBench

Evaluating LLM-Driven Protein Design: Agents Lack Iterative Evaluation Depth Jeonghyeon Kim & Philip Romero β€” Romero Lab, Duke University

πŸ“„ Paper: coming soon  Β·  πŸ† Leaderboard: RomeroLab-Duke/BioDesignBench-Leaderboard  Β·  🧬 Reference MCP server: jasonkim8652/protein-design-mcp Β· pip install protein-design-mcp

BioDesignBench is a benchmark for testing whether tool-augmented LLM agents can orchestrate the stochastic, multi-step pipelines of computational protein design. Where existing chemistry-agent and code-agent benchmarks evaluate deterministic tool chains, we focus on the qualitatively different setting in which generative tools (RFdiffusion, ProteinMPNN, Boltz-2) sample from distributions over structures and sequences and a competent practitioner must generate multiple candidates and screen them across complementary biophysical metrics before a design is viable.

We evaluate four frontier LLMs (DeepSeek V3, GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) under guided and unguided MCP-tool presentation modes against deterministic and human baselines on 76 expert-curated tasks drawn from 2024–2026 literature. The headline finding: top-tier agents now beat a hardcoded pipeline, but invoke evaluation tools at only 14% of expert depth, and workflow guidance rescues coverage without rescuing depth.

                                                     Hybrid score (100 pts)
    Human Oracle                β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  74.9
    Human Expert                β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ      61.3
    DeepSeek V3 (unguided)      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ      60.4
    DeepSeek V3 (guided)        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ       58.5
    GPT-5 (unguided)            β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ        55.6
    GPT-5 (guided)              β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ        55.3
    Hardcoded Pipeline          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ        54.2
    Claude Sonnet 4.5 (guided)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ         50.2
    Claude Sonnet 4.5 (unguid)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ           41.2
    Gemini 2.5 Pro              β–ˆβ–ˆ                      8.4

Three principal findings

  1. Top-tier LLM agents now beat a deterministic pipeline. DeepSeek V3 and GPT-5 surpass a hand-engineered hardcoded pipeline (54.2) under both modes. Autonomous protein-design orchestration is no longer infeasible.
  2. Coverage–depth dissociation. Workflow guidance closes the coverage gap (Rescue Index up to +3.01) but leaves utilisation depth unchanged (Rescue Index β‰ˆ 0). Better tool docs cannot teach iterative depth.
  3. Evaluation depth, not tool knowledge, is the bottleneck. Across 836 task–condition observations, evaluation depth per candidate correlates with total score at ρ = 0.685 (p < 10⁻¹¹⁷). LLM agents generate backbone candidates at expert-level rates but evaluate each one at 14% of expert depth. Forced-depth interventions confirm this is causal.

Why the task data is not in this repo

To prevent contamination of future language models, the 76 task specifications, their input PDBs, ground truth, and oracle outputs are deliberately not released here. The benchmark is hosted as a private HuggingFace dataset and agents are evaluated through the public submission flow at the leaderboard URL above. The repo contains:

  • the scoring & evaluation pipeline (biodesignbench/eval/)
  • the agent harness, baselines, and bio-specific agent wrappers (biodesignbench/agents/)
  • the MCP tool provider that maps the 17 reference tools to Anthropic / OpenAI / Gemini function-calling schemas (biodesignbench/tools/)
  • the 2 Γ— 5 taxonomy module (biodesignbench/taxonomy.py)
  • the LLM judge for the 28-point rubric portion (biodesignbench/eval/llm_judge/)
  • all paper figure-generating analysis scripts (scripts/analysis/)
  • the HuggingFace Space leaderboard backend (biodesignbench-leaderboard/)
  • a public demo task for reviewer reproducibility (examples/demo_task/)

Anything that would let you reconstruct a task β€” input files, prompts, ground truth, baseline outputs, results CSVs β€” is held privately by Romero Lab and served at evaluation time only. Researchers requiring per-task data for replication studies may contact the corresponding author under a data use agreement.

Repository layout

BioDesignBench/
β”œβ”€β”€ biodesignbench/                # Python package
β”‚   β”œβ”€β”€ taxonomy.py                # 2 Γ— 5 design matrix (DesignApproach Γ— MolecularSubject)
β”‚   β”œβ”€β”€ eval/                      # 100-point scoring pipeline
β”‚   β”‚   β”œβ”€β”€ tier1/                 #   Bio-coding tasks (unit-test style)
β”‚   β”‚   β”œβ”€β”€ tier2/                 #   Design tasks (4D metrics + Boltz-2 verification)
β”‚   β”‚   β”œβ”€β”€ metrics/               #   approach / orchestration / quality / etc.
β”‚   β”‚   β”œβ”€β”€ llm_judge/             #   28-pt LLM judge panel (PoLL with self-exclusion)
β”‚   β”‚   └── pipeline.py            #   Top-level orchestration
β”‚   β”œβ”€β”€ agents/                    # Agent harness
β”‚   β”‚   β”œβ”€β”€ general_purpose/       #   GPT-5, Claude Sonnet, Gemini, DeepSeek wrappers
β”‚   β”‚   β”œβ”€β”€ bio_specific/          #   Biomni / STELLA / BioML wrappers
β”‚   β”‚   └── baselines/             #   Hardcoded pipeline + human-expert agent
β”‚   β”œβ”€β”€ tools/                     # 17-tool MCP provider with mode toggle
β”‚   β”œβ”€β”€ interventions.py           # Forced-depth & low-diversity intervention specs
β”‚   └── tool_audit.py              # Tool-call trace analysis
β”œβ”€β”€ biodesignbench-leaderboard/    # Gradio HuggingFace Space (backend + UI)
β”œβ”€β”€ examples/demo_task/            # Public demo task for reviewer reproducibility
β”œβ”€β”€ scripts/analysis/              # All paper figure / SI analysis scripts (60 files)
β”œβ”€β”€ docker/sandbox/                # Sandbox image for executing agent-generated code
β”œβ”€β”€ docs/PRD.md                    # Project requirements document
β”œβ”€β”€ pyproject.toml
└── environment.yml

System requirements

  • Operating systems tested: Ubuntu 22.04 LTS, macOS 14 (Sonoma).
  • Python: 3.11 (pinned in environment.yml and pyproject.toml).
  • Required non-standard hardware: NVIDIA GPU (A10G or comparable) for RFdiffusion, Boltz-2, ESMFold, and ProteinMPNN. The scoring pipeline, analysis scripts, and figure-generation code run on CPU.
  • Typical install time on a normal desktop: 10 to 15 minutes for the conda environment; approximately 30 minutes total when including pip extras and the protein-design-mcp Docker image pull.
  • Key dependency versions (full pin list in pyproject.toml and environment.yml): NumPy β‰₯ 1.24, pandas β‰₯ 2.0, SciPy β‰₯ 1.10, scikit-learn β‰₯ 1.3, biopython β‰₯ 1.81, PyTorch β‰₯ 2.0, matplotlib β‰₯ 3.7, seaborn β‰₯ 0.12, anthropic SDK β‰₯ 0.75, openai β‰₯ 1.12, google-generativeai β‰₯ 0.8.

Quickstart (developers)

1. Install

git clone https://github.com/RomeroLab/BioDesignBench.git
cd BioDesignBench

# Conda environment (CPU only β€” no protein-design GPU tools)
conda env create -f environment.yml
conda activate biodesignbench

# Editable install with optional extras
pip install -e ".[dev,agents]"

For the GPU-side protein-design tools (RFdiffusion, ProteinMPNN, Boltz-2, PyRosetta, AF2), install the reference MCP server:

pip install protein-design-mcp
# Source, Dockerfiles, and Modal deploy template:
#   https://github.com/jasonkim8652/protein-design-mcp

2. Configure API keys

cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY / DEEPSEEK_API_KEY

3. Inspect the scoring pipeline

from biodesignbench.eval.pipeline import score_design
from biodesignbench.taxonomy import get_category, DesignApproach, MolecularSubject

# 2 Γ— 5 taxonomy
cat = get_category("dn_bnd_001")
print(cat.approach, cat.subject)
# DesignApproach.DE_NOVO, MolecularSubject.BINDER

# Score a hypothetical design (without task data, only the rubric pipeline)
help(score_design)

4. Run an analysis script

All paper figures and SI analyses are reproducible from the canonical score CSVs (held privately). Each script in scripts/analysis/ is named after the figure it produces:

scripts/analysis/bdb_022_fig2_leaderboard.py        # Figure 2: leaderboard
scripts/analysis/bdb_023_fig3_mode_comparison.py    # Figure 3: coverage–depth dissociation
scripts/analysis/bdb_050_variance_decomposition.py  # Figure 5: variance partition
scripts/analysis/bdb_060_contamination.py           # SI Figure 9: contamination

Demo

A worked example using a public trypsin-binder design task is shipped in examples/demo_task/ so reviewers and new users can run the scoring pipeline end to end without access to the private benchmark tasks. To run it:

biodesignbench score \
  --task examples/demo_task/trypsin_binder.json \
  --output examples/demo_output/

Expected output: a JSON file in examples/demo_output/ containing the six rubric component scores (Approach, Orchestration, Quality, Feasibility, Novelty, Diversity) summing to a total out of 100, a per-task scoring log, and the predicted complex structure as a PDB file.

Expected run time on a normal desktop: approximately 2 minutes for the scoring pipeline alone (using pre-computed structures shipped with the demo); approximately 10 minutes when also running Boltz-2 structure verification on a single A10G GPU.

The demo task is fully public and does not overlap with any of the 76 private benchmark tasks, so running it does not compromise the contamination defense described above.

Submitting an agent for evaluation

Submissions are accepted through the HuggingFace Space:

πŸ‘‰ https://huggingface.co/spaces/RomeroLab-Duke/BioDesignBench-Leaderboard

Unlike most agent benchmarks, submitters do not host an HTTP endpoint. The 76 task descriptions never leave Romero Lab infrastructure. You provide:

  • an LLM provider + API key β€” we run the BioDesignBench agent loop against your chosen model (Anthropic / OpenAI / Google / DeepSeek) inside the leaderboard backend. Your key is scrubbed from our records immediately after the dispatch phase.
  • (optional) a custom MCP URL if you want to evaluate your own tool implementations. Otherwise, the agent calls our reference protein-design-mcp endpoint.

Each submission carries a unique canary token embedded as an HTML comment in every task prompt, so we can retrospectively detect leakage if any future model regurgitates it.

Bring your own tools (Custom MCP)

If you want to benchmark a new tool implementation (a faster structure predictor, a different diffusion backbone, your own stability model) against the same 76 tasks / same scoring rubric used by the paper, stand up an HTTPS endpoint satisfying the MCP contract and paste the URL into the submission form's Advanced: Custom MCP section:

The MCP server β€” ours or yours β€” only ever sees operational tool arguments (sequences, PDB paths, hotspot residues). It never sees the raw task prompt or evaluation criteria.

Rate limit: 1 submission per calendar month per organization. LLM-judge API costs are paid by Romero Lab; please be considerate.

Backend pipeline status

Phase Step Status
A Dispatch tasks β†’ CPU scoring (5/6 components) live
B Boltz-2 structure verification live (Modal-hosted A10G sidecar)
C LLM-judge panel (28-pt hybrid) live
D Finalize + publish live

See biodesignbench-leaderboard/README.md for the Modal companion-app deployment notes.

Citation

@article{biodesignbench2026,
  title  = {Evaluating LLM-Driven Protein Design:
            Agents Lack Iterative Evaluation Depth},
  author = {Kim, Jeonghyeon and Romero, Philip},
  year   = {2026},
}

License

Code: MIT. Task content (held privately): not licensed for redistribution.

Contact

  • Jeonghyeon Kim β€” jeonghyeon.kim@duke.edu
  • Philip Romero β€” philip.romero@duke.edu

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages