Dev Set Leaderboard - Updates continuously as teams submit. Scores below include organizer baselines (non-competing, shown in grey) as reference. Competition submissions will appear once the dev leaderboard goes live in May 2026.
Rank Team / Model Affiliation Avg Pass^3 Avg Pass@1 Base Pass^3 Hall. Pass^3 Disamb. Pass^3
Baseline Claude Opus 4.6 Anthropic .58 .67 .80 .48 .46
Baseline GPT-5 OpenAI .54 .65 .66 .60 .36
Baseline GPT-5.2 OpenAI .53 .68 .61 .57 .42
Baseline Claude Opus 4.5 Anthropic .52 .65 .67 .52 .38
Baseline Claude Sonnet 4 Anthropic .47 .59 .63 .46 .32
Baseline Gemini 2.5 Flash Google .41 .54 .59 .41 .22
Baseline Gemini 2.5 Pro Google .38 .51 .53 .34 .28
Baseline GPT-4.1 OpenAI .37 .46 .57 .31 .22
Baseline Qwen3-32B Alibaba .31 .47 .45 .27 .22
Baseline xLAM-2-32B Salesforce .16 .25 .26 .11 .12

Competition submissions will appear above the baselines once the dev leaderboard opens.

Cerebras Fast-Reasoning Track - This leaderboard will be populated once the Cerebras Track opens for submissions. Teams will be ranked by a combination of innovative compute use and Pass^3 performance.

Coming Soon

The Cerebras Track leaderboard will open alongside the dev leaderboard in May 2026.

Hidden Test Set Results - Updated after each evaluation round (1st: Jul 10, final: Jul 19). Final rankings and awards are determined from the Jul 19 round.

Not Yet Available

Hidden test set results will be published after the 1st evaluation (Jul 10) and updated after the final evaluation (Jul 19).

How Scoring Works

Pass^3 measures deployment-ready consistency, not average performance.

Pass^3: A task scores 1 only if the agent solves it correctly in all 3 independent trials. The aggregate Pass^3 is the fraction of tasks scored 1.

Pass^1: Shown for comparison. A task scores 1 if solved in at least 1 trial. The gap between Pass^1 and Pass^3 reveals consistency issues.

Learn more about the metric →