Track performance across all participants and baseline models.
| Rank | Team / Model | Affiliation | Avg Pass^3 | Avg Pass@1 | Base Pass^3 | Hall. Pass^3 | Disamb. Pass^3 |
|---|---|---|---|---|---|---|---|
| Baseline | Claude Opus 4.6 | Anthropic | .58 | .67 | .80 | .48 | .46 |
| Baseline | GPT-5 | OpenAI | .54 | .65 | .66 | .60 | .36 |
| Baseline | GPT-5.2 | OpenAI | .53 | .68 | .61 | .57 | .42 |
| Baseline | Claude Opus 4.5 | Anthropic | .52 | .65 | .67 | .52 | .38 |
| Baseline | Claude Sonnet 4 | Anthropic | .47 | .59 | .63 | .46 | .32 |
| Baseline | Gemini 2.5 Flash | .41 | .54 | .59 | .41 | .22 | |
| Baseline | Gemini 2.5 Pro | .38 | .51 | .53 | .34 | .28 | |
| Baseline | GPT-4.1 | OpenAI | .37 | .46 | .57 | .31 | .22 |
| Baseline | Qwen3-32B | Alibaba | .31 | .47 | .45 | .27 | .22 |
| Baseline | xLAM-2-32B | Salesforce | .16 | .25 | .26 | .11 | .12 |
Competition submissions will appear above the baselines once the dev leaderboard opens.
Coming Soon
Track 2 development uses direct Cerebras-hosted gpt-oss inference. Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon.
Not Yet Available
Hidden test set results will be published after the 1st evaluation (Jul 10) and updated after the final evaluation (Jul 19).
Pass^3 measures deployment-ready consistency, not average performance.