Leaderboard - CAR-bench Challenge @ IJCAI-ECAI 2026

Dev Set Leaderboard - Updates continuously as teams submit. Scores below include organizer baselines (non-competing, shown in grey) as reference. Competition submissions will appear once the dev leaderboard goes live in May 2026.

Rank	Team / Model	Affiliation	Avg Pass^3	Avg Pass@1	Base Pass^3	Hall. Pass^3	Disamb. Pass^3
Baseline	Claude Opus 4.6	Anthropic	.58	.67	.80	.48	.46
Baseline	GPT-5	OpenAI	.54	.65	.66	.60	.36
Baseline	GPT-5.2	OpenAI	.53	.68	.61	.57	.42
Baseline	Claude Opus 4.5	Anthropic	.52	.65	.67	.52	.38
Baseline	Claude Sonnet 4	Anthropic	.47	.59	.63	.46	.32
Baseline	Gemini 2.5 Flash	Google	.41	.54	.59	.41	.22
Baseline	Gemini 2.5 Pro	Google	.38	.51	.53	.34	.28
Baseline	GPT-4.1	OpenAI	.37	.46	.57	.31	.22
Baseline	Qwen3-32B	Alibaba	.31	.47	.45	.27	.22
Baseline	xLAM-2-32B	Salesforce	.16	.25	.26	.11	.12

Competition submissions will appear above the baselines once the dev leaderboard opens.

Cerebras Fast-Reasoning Track - This leaderboard will be populated once the Cerebras Track opens for submissions. Teams will be ranked by a combination of innovative compute use and Pass^3 performance.

Coming Soon

The Cerebras Track leaderboard will open alongside the dev leaderboard in May 2026.

Hidden Test Set Results - Updated after each evaluation round (1st: Jul 10, final: Jul 19). Final rankings and awards are determined from the Jul 19 round.

Not Yet Available

Hidden test set results will be published after the 1st evaluation (Jul 10) and updated after the final evaluation (Jul 19).

How Scoring Works

Pass^3 measures deployment-ready consistency, not average performance.

Pass^3: A task scores 1 only if the agent solves it correctly in all 3 independent trials. The aggregate Pass^3 is the fraction of tasks scored 1.

Pass^1: Shown for comparison. A task scores 1 if solved in at least 1 trial. The gap between Pass^1 and Pass^3 reveals consistency issues.

Learn more about the metric →