Track performance across all participants and baseline models.
| Rank | Team / Model | Affiliation | Avg Pass^3 | Avg Pass@1 | Base Pass^3 | Hall. Pass^3 | Disamb. Pass^3 |
|---|---|---|---|---|---|---|---|
| Baseline | Claude Opus 4.6 | Anthropic | .58 | .67 | .80 | .48 | .46 |
| Baseline | GPT-5 | OpenAI | .54 | .65 | .66 | .60 | .36 |
| Baseline | GPT-5.2 | OpenAI | .53 | .68 | .61 | .57 | .42 |
| Baseline | Claude Opus 4.5 | Anthropic | .52 | .65 | .67 | .52 | .38 |
| Baseline | Claude Sonnet 4 | Anthropic | .47 | .59 | .63 | .46 | .32 |
| Baseline | Gemini 2.5 Flash | .41 | .54 | .59 | .41 | .22 | |
| Baseline | Gemini 2.5 Pro | .38 | .51 | .53 | .34 | .28 | |
| Baseline | GPT-4.1 | OpenAI | .37 | .46 | .57 | .31 | .22 |
| Baseline | Qwen3-32B | Alibaba | .31 | .47 | .45 | .27 | .22 |
| Baseline | xLAM-2-32B | Salesforce | .16 | .25 | .26 | .11 | .12 |
Competition submissions will appear above the baselines once the dev leaderboard opens.
Coming Soon
The Cerebras Track leaderboard will open alongside the dev leaderboard in May 2026.
Not Yet Available
Hidden test set results will be published after the 1st evaluation (Jul 10) and updated after the final evaluation (Jul 19).
Pass^3 measures deployment-ready consistency, not average performance.