CAR-bench Challenge @ IJCAI-ECAI 2026 - Registration is open!
IJCAI-ECAI 2026 Aug 15–21, Bremen Competition Track

CAR-bench Challenge:
Building Reliable LLM Agents
Under Real-World Uncertainty

Frontier LLMs solve only 58% of tasks consistently. Build a more reliable agent and compete for $10k+ in prizes, recognition, and a spot on the CAR-bench leaderboard.
254 public tasks · 58 tools · 19 policies · 3 evaluation dimensions

Track update: Track 1 registration remains open; Track 2 registration is now closed. Track 2 uses Cerebras-hosted gpt-oss via API, not Codex as the submitted-agent runtime, and Cerebras will provide increased rate limits compared with a free personal account. Access details will follow soon.

Timeline - all deadlines AoE

May 25 Competition opens - data, docs, and baseline agent released. Get started now →
May–Jul 2026 Open development with live dev-set leaderboard.
Jul 10 1st evaluation on hidden test set. Scores returned to participants.
Jul 19 Final evaluation on hidden test set (2nd official submission). Rankings determined.
Jul 26 Technical report submission deadline (4 pages, IJCAI format).
Jul 31 Winners announced.
Aug 15–21 Presentations at IJCAI-ECAI 2026, Bremen, Germany.
Submission checklist: Final entries include a public digest-pinned GHCR agent image, a scenario.toml using the official evaluator and hidden-set config, and a 4-page technical report using the IJCAI author kit and citing CAR-bench. Track 2 reports should include an architecture diagram for compute-use audit. The submission Google Form is not open yet and will be announced on Discord.
View Submission Requirements

About the Benchmark

What is CAR-bench?

The problem: LLM agents are rapidly moving from research prototypes to real-world deployments, yet existing benchmarks evaluate them under idealized conditions - complete information, available tools, and unambiguous instructions. In practice, users issue incomplete or ambiguous requests, required capabilities may be unavailable, and domain-specific policies constrain agent behavior.

The approach: CAR-bench evaluates LLM agents as automotive in-car voice assistants across 254 public tasks spanning three complementary dimensions: Base multi-turn task completion (100 tasks), Hallucination limit-awareness under missing capabilities (98 tasks), and Disambiguation uncertainty resolution of ambiguous requests (56 tasks). Agents interact with an LLM-simulated user, plan and chain calls across 58 interconnected tools governed by 19 domain-specific policies, and operate over large-scale world data (48 European cities, 130K+ POIs, 1.7M+ routes).

Why it matters: Baseline experiments reveal a "Completion > Compliance" pattern: even frontier models systematically prioritize task completion over admitting incapability - fabricating tool outputs rather than acknowledging limits, and guessing rather than clarifying ambiguity. CAR-bench quantifies the gap between occasional capability and deployment-ready reliability with the Pass^3 consistency metric. A task scores 1 only if solved in all 3 trials.

Benchmark Details →   Read the Paper →   Hugging Face →   Starter Repo →

CAR-bench benchmark overview showing the three task types and evaluation framework
Overview of the CAR-bench components. (a) An LLM-simulated user generates multi-turn messages following task instructions (1); (b) the LLM agent, guided by domain policies, interacts with (c) tools to (2a) observe the environment or (2b) modify its state, until producing an informed response (3). The environment consists of (d) mutable states, (e) fixed context variables, and (f) static databases. The user instructions show the Base task type; the task types Hallucination and Disambiguation are explained in Benchmark Details.

Competition

Two Tracks

Track 1 remains open for new registrations. Track 2 is closed to new teams.

Track 1: Open Track

Use any LLM, provider, framework, or architecture, from Claude Agents SDK and LangGraph to fully custom harnesses. Compete for top Pass^3 on the hidden test set and a Best Innovation Award focused on agent harnessing, reliability design, and efficient model use.

Learn more →

Track 2: Cerebras Fast-Reasoning

Use direct Cerebras gpt-oss inference to build fast-reasoning agents under inference-compute constraints: bounded sequential LLM-call depth and token usage up to 500k tokens on average per task.

Registration closed. Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon.

Learn more →

Awards & Prizes

🏆

Open Track

  • Rank Award - By Pass^3 on hidden test set
  • Best Innovation Award - Novel harnessing, reliability design, efficient model use

Prizes

  • $5,000 pool of Anthropic API Credits split across Rank Award and Best Innovation Award winners (details to be announced), sponsored by Anthropic
  • Certificate
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper, reporting experience of the 2026 competition
  • More prizes to be announced

Cerebras Fast-Reasoning Track

  • Cerebras Innovation Award - Best demonstration of compute-aware agent design

Prizes

  • Codex Pro: 12-month subscriptions for the top 2 teams, sponsored by OpenAI
  • Certificate
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper, reporting experience of the 2026 competition
  • More prizes to be announced

Each final submission must include a 4-page technical report using the IJCAI author kit. Submission details →

Baselines

Can You Beat the Frontier?

Baseline results using our default agent scaffold with no optimization. Your target starts here.

Model Provider Avg Pass^3 Base Pass^3 Hall. Pass^3 Disamb. Pass^3
Claude Opus 4.6 Anthropic .58 .80 .48 .46
GPT-5 OpenAI .54 .66 .60 .36
Gemini 2.5 Pro Google .38 .53 .34 .28
Qwen3-32B Alibaba .31 .45 .27 .22
xLAM-2-32B Salesforce .16 .26 .11 .12

Full Leaderboard →

Prior Recognition

Validated by the Community

CAR-bench was accepted at ACL 2026 Main, selected as Hugging Face Paper of the Day, and won 1st place at UC Berkeley’s AgentX-AgentBeats Competition (Computer-Use Track, Google DeepMind-sponsored). This is the first academic competition dedicated to LLM agent reliability and limit-awareness.

Team

Organizers

A multidisciplinary team spanning academia and industry.

Johannes Kirmayr
Johannes Kirmayr
BMW & Augsburg Univ.
Lukas Stappen
Lukas Stappen
BMW Group Research
Elisabeth André
Elisabeth André
Augsburg University
Erik Cambria
Erik Cambria
NTU Singapore
Iryna Gurevych
Iryna Gurevych
TU Darmstadt
Patrick Dreisch
Patrick Dreisch
Anthropic
Natalia Vassilieva
Natalia Vassilieva
Cerebras Systems
Raj Tumuluri
Raj Tumuluri
OpenStream.ai
Varin Sikka
Varin Sikka
Stanford University

Extended Bios →

Cite

Citation

If you use CAR-bench in your research, please cite our paper.

@misc{kirmayr2026carbenchevaluatingconsistencylimitawareness,
      title={CAR-bench: Evaluating the Consistency and
             Limit-Awareness of LLM Agents under
             Real-World Uncertainty},
      author={Johannes Kirmayr and Lukas Stappen
              and Elisabeth Andr{\'e}},
      year={2026},
      eprint={2601.22027},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://huggingface.co/papers/2601.22027},
}