Two tracks, two ways to push the frontier of LLM agent reliability. Enter one or both.
At a Glance
The comparison below answers the most common question participants have.
| Dimension | Track 1: Open Track | Track 2: Cerebras Fast-Reasoning |
|---|---|---|
| Goal | Maximize Pass^3 on the hidden test set using any approach | Convert ~10x faster Cerebras inference into higher Pass^3 within the same time budget per turn |
| Allowed Models | Any LLM, any provider (proprietary or open-source) | Models running on Cerebras inference infrastructure |
| Evaluation Metric | Pass^3 (task solved in all 3 trials) | Combination of innovative compute use + Pass^3 performance |
| Prize Structure | Rank Award + Best Innovation Award | Cerebras Innovation Award |
| Suitable For | Teams exploring architecture, prompting, fine-tuning, multi-agent approaches | Teams interested in leveraging fast inference for reasoning-heavy strategies |
| Compute | Bring your own API keys for development. | Cerebras free tier for development |
The Open Track is designed for maximum flexibility: use any model, any provider, and any architecture to build the most reliable agent possible. The goal is to close the gap between frontier model capability and deployment-ready consistency on CAR-bench.
Current state-of-the-art achieves only 58% consistent success (Pass^3). There is significant room for improvement through better scaffolding, prompting, planning, and agent design.
A dockerized Agent-under-Test that communicates via the A2A (Agent-to-Agent) protocol with the CAR-bench evaluation environment. Your agent receives user requests, interacts with tools, and produces responses that are evaluated for correctness, policy compliance, limit-awareness, and disambiguation behavior.
Teams are ranked by Pass^3 on a hidden test set. A task scores 1 only if solved correctly in all 3 independent trials. Per-category breakdown (Base / Hallucination / Disambiguation) is shown alongside the aggregate score.
Cerebras builds the world’s fastest AI inference infrastructure. Their output speed is ~10x faster than conventional GPU-based inference (e.g., GLM-4.7 on Cerebras: ~1,000 tokens/s vs. Claude Sonnet 4.5 on GPUs: ~97 tokens/s). This means that within the same time budget, a Cerebras-hosted model can generate roughly 10x more tokens per assistant turn.
This track explores how to turn that extra token capacity within the same time budget into higher Pass^3 scores. For example, where a conventional agent gets one reasoning pass per turn, a Cerebras-hosted agent could run multiple passes, verify its own output, or explore alternative tool-call sequences — all before the time budget runs out. The question is: which strategies best convert raw output speed into measurable reliability gains?
The focus is on the scaffolding architecture, not the underlying model. The time budget will be scaled to the models currently available on Cerebras’s platform.
Same A2A-compatible agent format as Track 1, but your agent runs on Cerebras inference infrastructure. During development, participants use the Cerebras free tier.
Ranked primarily by innovative use of compute headroom — how meaningfully the scaffolding leverages fast inference in ways that differ from standard single-pass agent design. Pass^3 is taken into account only as relative improvement over the chosen model as baseline, i.e. the delta the scaffolding produces, not the absolute score. Evaluation is based on the required technical report alongside the measured improvement.
Set up the baseline agent, run it on the dev set, and submit your first score in under 20 minutes.