At a Glance

Which Track Should You Enter?

The comparison below answers the most common question participants have.

Dimension Track 1: Open Track Track 2: Cerebras Fast-Reasoning
Goal Maximize Pass^3 on the hidden test set using any approach Use direct Cerebras-hosted gpt-oss inference to convert fast inference into higher Pass^3 under Track 2 inference-compute constraints
Allowed Models Any LLM, any provider (proprietary or open-source) Direct Cerebras-hosted models through the starter kit, with optional self-hosted open-source development runs
Ranking / Awards Rank Award by Pass^3, plus Best Innovation Award for agent harnessing and reliability design Combination of innovative compute use + Pass^3 performance
Suitable For Teams exploring architecture, prompting, fine-tuning, multi-agent approaches Teams interested in leveraging fast inference for reasoning-heavy strategies
Compute Bring your own API keys for development. Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon
Starter Package src/track_1_agent_under_test/ src/track_2_agent_under_test_cerebras*/
Submission Shape A public digest-pinned GHCR agent image, scenario.toml using the official evaluator and hidden-set config, required env var names, and a 4-page technical report using the IJCAI author kit. Organizers run hidden tasks through the same A2A boundary.

Participant Workflow

Same Interface, Different Harnesses

Both tracks submit the same kind of agent. The difference is what your internal reasoning harness is allowed and expected to use.

Step What You Do
1. Clone starter kit Use the competition starter repository, run the setup script, and choose a Track 1 or Track 2 scenario folder.
2. Implement the agent Your agent receives A2A messages with user text, tool definitions, and tool results. It returns user-facing text and/or tool-call data.
3. Validate locally Run local_smoke.toml first, then public validation scenarios. Scenario TOML files are the run configs.
4. Validate Docker/GHCR Generate Docker Compose from a Docker scenario, then test the image/config style organizers will run.
5. Submit Submit the public digest-pinned GHCR image, scenario.toml using the official evaluator and hidden-set config, required env var names, and technical report. Never submit secret values. See the submission checklist.

Track 1: Open Track

Motivation

The Open Track is designed for maximum flexibility: use any model, provider, framework, and architecture to build the most reliable agent possible. You can use Claude Agents SDK, LangGraph, AutoGen, custom orchestration, fine-tuning pipelines, retrieval systems, or any other harness that respects the benchmark boundary. The goal is to close the gap between frontier model capability and deployment-ready consistency on CAR-bench.

Current state-of-the-art achieves only 58% consistent success (Pass^3). There is significant room for improvement through better scaffolding, prompting, planning, and agent design.

What You’ll Build

A dockerized agent under test that communicates via the A2A (Agent-to-Agent) protocol with the CAR-bench evaluator. Your agent receives user requests, tool definitions, and tool results; it returns user-facing responses and/or tool calls. The evaluator executes tools and scores correctness, policy compliance, limit-awareness, and disambiguation behavior.

Allowed Approaches

  • Prompting: System prompts, few-shot, chain-of-thought, self-reflection
  • Frameworks: Claude Agents SDK, LangGraph, AutoGen, custom A2A services, or any equivalent setup
  • Scaffolding: Planning-execution separation, tool-use strategies, retry logic, verifiers, critics, rerankers
  • Multi-agent: Subagents, agent skills, orchestration patterns
  • Fine-tuning: SFT, RLHF, DPO on the training data
  • Context engineering: RAG over environment data, dynamic context management
  • Any idea and combination of the above without exploiting the benchmark evaluation process

Evaluation

Teams are ranked by Pass^3 on a hidden test set. A task scores 1 only if solved correctly in all 3 independent trials. Per-category breakdown (Base / Hallucination / Disambiguation) is shown alongside the aggregate score.

Prizes

  • Rank Award: For top Pass^3 performance on the hidden test set.
  • Best Innovation Award: For novel agent harnessing, reliability design, efficient model use, and thoughtful engineering. Evaluated based on the required technical report.
  • $5,000 pool of Anthropic API Credits: Split across Rank Award and Best Innovation Award winners (details to be announced), sponsored by Anthropic.
  • More prizes to be announced
  • Certificates for award winners
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper

Get started with this track →

Track 2: Cerebras Fast-Reasoning

Motivation

Track 2 focuses on direct Cerebras gpt-oss inference for fast-reasoning agent harnesses. Fast inference makes it practical to explore more reasoning, verification, retry, and search work while keeping inference compute bounded.

This track explores how to turn that inference speed into higher Pass^3 scores. For each baseline LLM step, where the baseline directly decides on a tool call or final answer, participants may use up to 5 sequential LLM calls. Parallel calls within each step are allowed and do not count toward this sequential limit.

Token usage is limited to 500k tokens on average per task, including input, reasoning, and output tokens. As a reference, the baseline agent uses approximately 54k tokens on average per task. Token usage must be tracked through the A2A turn_metrics token fields and may be counterchecked.

Registration is closed for Track 2. The track remains limited to 15 participating teams.

What You’ll Build

The same dockerized A2A-compatible agent format as Track 1. The reference starter kit includes Cerebras SDK/direct API gpt-oss agents, including a direct executor and a planner/executor template. Participants may replace the internal planner, verifier, retry, or ensemble logic while preserving the same A2A boundary.

During normal development, participants may see strict limits on free personal Cerebras accounts. Cerebras will provide increased rate limits for the competition compared with a free personal account; access details will follow soon. Participants may also self-host the open-source models used by the Cerebras gpt-oss executor during development, then validate the speed gain with Cerebras access.

Allowed Approaches

  • All approaches from Track 1 are permitted
  • Additionally encouraged: Multi-pass reasoning, self-reflection loops, best-of-N sampling, search-based planning, verification chains, ensemble approaches
  • Submitted Track 2 agents should use the direct Cerebras gpt-oss inference setup described in the starter kit
  • Technical reports should include an architecture diagram so the sequential-call constraint can be audited

Evaluation

Ranked primarily by innovative use of inference-time compute headroom - how meaningfully the scaffolding leverages fast inference in ways that differ from standard single-pass agent design. Pass^3 is taken into account only as relative improvement over the chosen model as baseline, i.e. the delta the scaffolding produces, not the absolute score. Evaluation is based on the required technical report alongside the measured improvement.

Prizes

  • Cerebras Innovation Award: For the best demonstration of compute-aware agent design
  • Codex Pro: 12-month subscriptions for the top 2 winning teams (one per team), sponsored by OpenAI
  • More prizes to be announced
  • Certificates for award winners
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper

Get started with this track →

Ready to Start?

Set up a starter agent, run a local smoke scenario, then validate the Docker image you will submit.