Tracks - CAR-bench Challenge @ IJCAI-ECAI 2026

At a Glance

Which Track Should You Enter?

The comparison below answers the most common question participants have.

Dimension	Track 1: Open Track	Track 2: Cerebras Fast-Reasoning
Goal	Maximize Pass^3 on the hidden test set using any approach	Convert ~10x faster Cerebras inference into higher Pass^3 within the same time budget per turn
Allowed Models	Any LLM, any provider (proprietary or open-source)	Models running on Cerebras inference infrastructure
Evaluation Metric	Pass^3 (task solved in all 3 trials)	Combination of innovative compute use + Pass^3 performance
Prize Structure	Rank Award + Best Innovation Award	Cerebras Innovation Award
Suitable For	Teams exploring architecture, prompting, fine-tuning, multi-agent approaches	Teams interested in leveraging fast inference for reasoning-heavy strategies
Compute	Bring your own API keys for development.	Cerebras free tier for development

Track 1: Open Track

Motivation

The Open Track is designed for maximum flexibility: use any model, any provider, and any architecture to build the most reliable agent possible. The goal is to close the gap between frontier model capability and deployment-ready consistency on CAR-bench.

Current state-of-the-art achieves only 58% consistent success (Pass^3). There is significant room for improvement through better scaffolding, prompting, planning, and agent design.

What You’ll Build

A dockerized Agent-under-Test that communicates via the A2A (Agent-to-Agent) protocol with the CAR-bench evaluation environment. Your agent receives user requests, interacts with tools, and produces responses that are evaluated for correctness, policy compliance, limit-awareness, and disambiguation behavior.

Allowed Approaches

Prompting: System prompts, few-shot, chain-of-thought, self-reflection
Scaffolding: Planning-execution separation, tool-use strategies, retry logic
Multi-agent: Subagents, agent skills, orchestration patterns
Fine-tuning: SFT, RLHF, DPO on the training data
Context engineering: RAG over environment data, dynamic context management
Any idea and combination of the above without exploiting the benchmark evaluation process

Evaluation

Teams are ranked by Pass^3 on a hidden test set. A task scores 1 only if solved correctly in all 3 independent trials. Per-category breakdown (Base / Hallucination / Disambiguation) is shown alongside the aggregate score.

Prizes

Rank Award: By Pass^3 on hidden test set
Best Innovation Award: For novel approaches - efficient use of open models, creative architectures, ... Evaluated based on the required technical report.
Certificates for award winners
(Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
(Optional) Co-authorship on IJCAI 2027 demo paper

Get started with this track →

Track 2: Cerebras Fast-Reasoning Track

Motivation

Cerebras builds the world’s fastest AI inference infrastructure. Their output speed is ~10x faster than conventional GPU-based inference (e.g., GLM-4.7 on Cerebras: ~1,000 tokens/s vs. Claude Sonnet 4.5 on GPUs: ~97 tokens/s). This means that within the same time budget, a Cerebras-hosted model can generate roughly 10x more tokens per assistant turn.

This track explores how to turn that extra token capacity within the same time budget into higher Pass^3 scores. For example, where a conventional agent gets one reasoning pass per turn, a Cerebras-hosted agent could run multiple passes, verify its own output, or explore alternative tool-call sequences — all before the time budget runs out. The question is: which strategies best convert raw output speed into measurable reliability gains?

The focus is on the scaffolding architecture, not the underlying model. The time budget will be scaled to the models currently available on Cerebras’s platform.

What You’ll Build

Same A2A-compatible agent format as Track 1, but your agent runs on Cerebras inference infrastructure. During development, participants use the Cerebras free tier.

Allowed Approaches

All approaches from Track 1 are permitted
Additionally encouraged: Multi-pass reasoning, self-reflection loops, best-of-N sampling, search-based planning, verification chains, ensemble approaches
Agents must use Cerebras infrastructure

Evaluation

Ranked primarily by innovative use of compute headroom — how meaningfully the scaffolding leverages fast inference in ways that differ from standard single-pass agent design. Pass^3 is taken into account only as relative improvement over the chosen model as baseline, i.e. the delta the scaffolding produces, not the absolute score. Evaluation is based on the required technical report alongside the measured improvement.

Prizes

Cerebras Innovation Award: For the best demonstration of compute-aware agent design
Certificates for award winners
(Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
(Optional) Co-authorship on IJCAI 2027 demo paper

Get started with this track →

Competition Tracks

Which Track Should You Enter?

Track 1: Open Track

Motivation

What You’ll Build

Allowed Approaches

Evaluation

Prizes

Track 2: Cerebras Fast-Reasoning Track

Motivation

What You’ll Build

Allowed Approaches

Evaluation

Prizes

Ready to Start?