Dev Set Leaderboard - Updates continuously as teams submit. Scores below include organizer baselines (non-competing, shown in grey) as reference. Competition submissions will appear once the dev leaderboard goes live in May 2026.
Rank Team / Model Affiliation Avg Pass^3 Avg Pass@1 Base Pass^3 Hall. Pass^3 Disamb. Pass^3
Baseline Claude Opus 4.6 Anthropic .58 .67 .80 .48 .46
Baseline GPT-5 OpenAI .54 .65 .66 .60 .36
Baseline GPT-5.2 OpenAI .53 .68 .61 .57 .42
Baseline Claude Opus 4.5 Anthropic .52 .65 .67 .52 .38
Baseline Claude Sonnet 4 Anthropic .47 .59 .63 .46 .32
Baseline Gemini 2.5 Flash Google .41 .54 .59 .41 .22
Baseline Gemini 2.5 Pro Google .38 .51 .53 .34 .28
Baseline GPT-4.1 OpenAI .37 .46 .57 .31 .22
Baseline Qwen3-32B Alibaba .31 .47 .45 .27 .22
Baseline xLAM-2-32B Salesforce .16 .25 .26 .11 .12

Competition submissions will appear above the baselines once the dev leaderboard opens.

Track 2: Cerebras Fast-Reasoning - This leaderboard will be populated once Track 2 opens for submissions. Teams will be ranked by a combination of innovative Cerebras compute use and Pass^3 performance.

Coming Soon

Track 2 development uses direct Cerebras-hosted gpt-oss inference. Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon.

Hidden Test Set Results - Updated after each evaluation round (1st: Jul 10, final: Jul 19). Final rankings and awards are determined from the Jul 19 round.

Not Yet Available

Hidden test set results will be published after the 1st evaluation (Jul 10) and updated after the final evaluation (Jul 19).

How Scoring Works

Pass^3 measures deployment-ready consistency, not average performance.

Pass^3: A task scores 1 only if the agent solves it correctly in all 3 independent trials. The aggregate Pass^3 is the fraction of tasks scored 1.

Pass^1: Shown for comparison. A task scores 1 if solved in at least 1 trial. The gap between Pass^1 and Pass^3 reveals consistency issues.

Learn more about the metric →