Two tracks, two ways to push the frontier of LLM agent reliability. Track 1 remains open for new registrations; Track 2 registration is closed.
At a Glance
The comparison below answers the most common question participants have.
| Dimension | Track 1: Open Track | Track 2: Cerebras Fast-Reasoning |
|---|---|---|
| Goal | Maximize Pass^3 on the hidden test set using any approach | Use direct Cerebras-hosted gpt-oss inference to convert fast inference into higher Pass^3 under Track 2 inference-compute constraints |
| Allowed Models | Any LLM, any provider (proprietary or open-source) | Direct Cerebras-hosted models through the starter kit, with optional self-hosted open-source development runs |
| Ranking / Awards | Rank Award by Pass^3, plus Best Innovation Award for agent harnessing and reliability design | Combination of innovative compute use + Pass^3 performance |
| Suitable For | Teams exploring architecture, prompting, fine-tuning, multi-agent approaches | Teams interested in leveraging fast inference for reasoning-heavy strategies |
| Compute | Bring your own API keys for development. | Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon |
| Starter Package | src/track_1_agent_under_test/ |
src/track_2_agent_under_test_cerebras*/ |
| Submission Shape | A public digest-pinned GHCR agent image, scenario.toml using the official evaluator and hidden-set config, required env var names, and a 4-page technical report using the IJCAI author kit. Organizers run hidden tasks through the same A2A boundary. |
|
Participant Workflow
Both tracks submit the same kind of agent. The difference is what your internal reasoning harness is allowed and expected to use.
| Step | What You Do |
|---|---|
| 1. Clone starter kit | Use the competition starter repository, run the setup script, and choose a Track 1 or Track 2 scenario folder. |
| 2. Implement the agent | Your agent receives A2A messages with user text, tool definitions, and tool results. It returns user-facing text and/or tool-call data. |
| 3. Validate locally | Run local_smoke.toml first, then public validation scenarios. Scenario TOML files are the run configs. |
| 4. Validate Docker/GHCR | Generate Docker Compose from a Docker scenario, then test the image/config style organizers will run. |
| 5. Submit | Submit the public digest-pinned GHCR image, scenario.toml using the official evaluator and hidden-set config, required env var names, and technical report. Never submit secret values. See the submission checklist. |
The Open Track is designed for maximum flexibility: use any model, provider, framework, and architecture to build the most reliable agent possible. You can use Claude Agents SDK, LangGraph, AutoGen, custom orchestration, fine-tuning pipelines, retrieval systems, or any other harness that respects the benchmark boundary. The goal is to close the gap between frontier model capability and deployment-ready consistency on CAR-bench.
Current state-of-the-art achieves only 58% consistent success (Pass^3). There is significant room for improvement through better scaffolding, prompting, planning, and agent design.
A dockerized agent under test that communicates via the A2A (Agent-to-Agent) protocol with the CAR-bench evaluator. Your agent receives user requests, tool definitions, and tool results; it returns user-facing responses and/or tool calls. The evaluator executes tools and scores correctness, policy compliance, limit-awareness, and disambiguation behavior.
Teams are ranked by Pass^3 on a hidden test set. A task scores 1 only if solved correctly in all 3 independent trials. Per-category breakdown (Base / Hallucination / Disambiguation) is shown alongside the aggregate score.
Track 2 focuses on direct Cerebras gpt-oss inference for fast-reasoning agent harnesses. Fast inference makes it practical to explore more reasoning, verification, retry, and search work while keeping inference compute bounded.
This track explores how to turn that inference speed into higher Pass^3 scores. For each baseline LLM step, where the baseline directly decides on a tool call or final answer, participants may use up to 5 sequential LLM calls. Parallel calls within each step are allowed and do not count toward this sequential limit.
Token usage is limited to 500k tokens on average per task, including input, reasoning, and output tokens. As a reference, the baseline agent uses approximately 54k tokens on average per task. Token usage must be tracked through the A2A turn_metrics token fields and may be counterchecked.
Registration is closed for Track 2. The track remains limited to 15 participating teams.
The same dockerized A2A-compatible agent format as Track 1. The reference starter kit includes Cerebras SDK/direct API gpt-oss agents, including a direct executor and a planner/executor template. Participants may replace the internal planner, verifier, retry, or ensemble logic while preserving the same A2A boundary.
During normal development, participants may see strict limits on free personal Cerebras accounts. Cerebras will provide increased rate limits for the competition compared with a free personal account; access details will follow soon. Participants may also self-host the open-source models used by the Cerebras gpt-oss executor during development, then validate the speed gain with Cerebras access.
gpt-oss inference setup described in the starter kitRanked primarily by innovative use of inference-time compute headroom - how meaningfully the scaffolding leverages fast inference in ways that differ from standard single-pass agent design. Pass^3 is taken into account only as relative improvement over the chosen model as baseline, i.e. the delta the scaffolding produces, not the absolute score. Evaluation is based on the required technical report alongside the measured improvement.
Set up a starter agent, run a local smoke scenario, then validate the Docker image you will submit.