Benchmark Results

LabOSBench

Leaderboard

Table 1: Subtask-level average success rate (%) across 8 scientific instrument simulators

Multimodal foundation models without an agent framework, averaged over subtasks
# Model Affiliation Organization Avg SEM SPM TEM XRD LFM FIB APT EDS

Success by Task Category

Table 2: Best model performance on five instrument-operation categories

Instrument Comparison

Human experts vs. best agents (subtask-level)

Eight Instruments

96 subtasks across 8 web-based scientific instrument simulators

End-to-End Workflow Evaluation

GPT-5.5 full-workflow evaluation (10 runs per workflow)

Instrument E2E Success Rate Status

Task Categories

Key Findings & Challenges

Subtask vs. End-to-End

Best subtask performance vs. GPT-5.5 end-to-end vs. OSWorld human baseline