At a Glance

Which Track Should You Enter?

The comparison below answers the most common question participants have.

Dimension Track 1: Open Track Track 2: Cerebras Fast-Reasoning
Goal Maximize Pass^3 on the hidden test set using any approach Convert ~10x faster Cerebras inference into higher Pass^3 within the same time budget per turn
Allowed Models Any LLM, any provider (proprietary or open-source) Models running on Cerebras inference infrastructure
Evaluation Metric Pass^3 (task solved in all 3 trials) Combination of innovative compute use + Pass^3 performance
Prize Structure Rank Award + Best Innovation Award Cerebras Innovation Award
Suitable For Teams exploring architecture, prompting, fine-tuning, multi-agent approaches Teams interested in leveraging fast inference for reasoning-heavy strategies
Compute Bring your own API keys for development. Cerebras free tier for development

Track 1: Open Track

Motivation

The Open Track is designed for maximum flexibility: use any model, any provider, and any architecture to build the most reliable agent possible. The goal is to close the gap between frontier model capability and deployment-ready consistency on CAR-bench.

Current state-of-the-art achieves only 58% consistent success (Pass^3). There is significant room for improvement through better scaffolding, prompting, planning, and agent design.

What You’ll Build

A dockerized Agent-under-Test that communicates via the A2A (Agent-to-Agent) protocol with the CAR-bench evaluation environment. Your agent receives user requests, interacts with tools, and produces responses that are evaluated for correctness, policy compliance, limit-awareness, and disambiguation behavior.

Allowed Approaches

  • Prompting: System prompts, few-shot, chain-of-thought, self-reflection
  • Scaffolding: Planning-execution separation, tool-use strategies, retry logic
  • Multi-agent: Subagents, agent skills, orchestration patterns
  • Fine-tuning: SFT, RLHF, DPO on the training data
  • Context engineering: RAG over environment data, dynamic context management
  • Any idea and combination of the above without exploiting the benchmark evaluation process

Evaluation

Teams are ranked by Pass^3 on a hidden test set. A task scores 1 only if solved correctly in all 3 independent trials. Per-category breakdown (Base / Hallucination / Disambiguation) is shown alongside the aggregate score.

Prizes

  • Rank Award: By Pass^3 on hidden test set
  • Best Innovation Award: For novel approaches - efficient use of open models, creative architectures, ... Evaluated based on the required technical report.
  • Certificates for award winners
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper

Get started with this track →

Track 2: Cerebras Fast-Reasoning Track

Motivation

Cerebras builds the world’s fastest AI inference infrastructure. Their output speed is ~10x faster than conventional GPU-based inference (e.g., GLM-4.7 on Cerebras: ~1,000 tokens/s vs. Claude Sonnet 4.5 on GPUs: ~97 tokens/s). This means that within the same time budget, a Cerebras-hosted model can generate roughly 10x more tokens per assistant turn.

This track explores how to turn that extra token capacity within the same time budget into higher Pass^3 scores. For example, where a conventional agent gets one reasoning pass per turn, a Cerebras-hosted agent could run multiple passes, verify its own output, or explore alternative tool-call sequences — all before the time budget runs out. The question is: which strategies best convert raw output speed into measurable reliability gains?

The focus is on the scaffolding architecture, not the underlying model. The time budget will be scaled to the models currently available on Cerebras’s platform.

What You’ll Build

Same A2A-compatible agent format as Track 1, but your agent runs on Cerebras inference infrastructure. During development, participants use the Cerebras free tier.

Allowed Approaches

  • All approaches from Track 1 are permitted
  • Additionally encouraged: Multi-pass reasoning, self-reflection loops, best-of-N sampling, search-based planning, verification chains, ensemble approaches
  • Agents must use Cerebras infrastructure

Evaluation

Ranked primarily by innovative use of compute headroom — how meaningfully the scaffolding leverages fast inference in ways that differ from standard single-pass agent design. Pass^3 is taken into account only as relative improvement over the chosen model as baseline, i.e. the delta the scaffolding produces, not the absolute score. Evaluation is based on the required technical report alongside the measured improvement.

Prizes

  • Cerebras Innovation Award: For the best demonstration of compute-aware agent design
  • Certificates for award winners
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper

Get started with this track →

Ready to Start?

Set up the baseline agent, run it on the dev set, and submit your first score in under 20 minutes.