Leaderboard
Table 1: Subtask-level average success rate (%) across 8 scientific instrument simulators
Foundation
Multimodal foundation models without an agent framework, averaged over subtasks
| # | Model | Affiliation | Organization | Avg | SEM | SPM | TEM | XRD | LFM | FIB | APT | EDS |
|---|
Agentic
Agent frameworks with planning, tool use, and multi-step reasoning, averaged over subtasks
| # | System | Affiliation | Organization | Avg | SEM | SPM | TEM | XRD | LFM | FIB | APT | EDS |
|---|
Success by Task Category
Table 2: Best model performance on five instrument-operation categories
Instrument Comparison
Human experts vs. best agents (subtask-level)
Eight Instruments
96 subtasks across 8 web-based scientific instrument simulators
End-to-End Workflow Evaluation
GPT-5.5 full-workflow evaluation (10 runs per workflow)
| Instrument | E2E Success Rate | Status |
|---|
Task Categories
Key Findings & Challenges
Subtask vs. End-to-End
Best subtask performance vs. GPT-5.5 end-to-end vs. OSWorld human baseline