CAR-bench Challenge @ IJCAI-ECAI 2026 - Registration is open!
IJCAI-ECAI 2026 Aug 15–21, Bremen Competition Track

CAR-bench Challenge:
Building Reliable LLM Agents
Under Real-World Uncertainty

Frontier LLMs solve only 58% of tasks consistently. Build a more reliable agent and compete for $10k+ in prizes, recognition, and a spot on the CAR-bench leaderboard.
254 public tasks · 58 tools · 19 policies · 3 evaluation dimensions

Track 2 update: Cerebras Fast-Reasoning now uses Cerebras-hosted gpt-oss via API; Codex with gpt-5.3-codex-spark is no longer the submitted-agent runtime. Codex Pro development-support plans will be allocated by June 15.

Timeline - all deadlines AoE

May 25 Competition opens - data, docs, and baseline agent released. Get started now →
May–Jul 2026 Open development with live dev-set leaderboard.
Jul 10 1st evaluation on hidden test set. Scores returned to participants.
Jul 19 Final evaluation on hidden test set (2nd official submission). Rankings determined.
Jul 26 Technical report submission deadline (4 pages, IJCAI format).
Jul 31 Winners announced.
Aug 15–21 Presentations at IJCAI-ECAI 2026, Bremen, Germany.

About the Benchmark

What is CAR-bench?

The problem: LLM agents are rapidly moving from research prototypes to real-world deployments, yet existing benchmarks evaluate them under idealized conditions - complete information, available tools, and unambiguous instructions. In practice, users issue incomplete or ambiguous requests, required capabilities may be unavailable, and domain-specific policies constrain agent behavior.

The approach: CAR-bench evaluates LLM agents as automotive in-car voice assistants across 254 public tasks spanning three complementary dimensions: Base multi-turn task completion (100 tasks), Hallucination limit-awareness under missing capabilities (98 tasks), and Disambiguation uncertainty resolution of ambiguous requests (56 tasks). Agents interact with an LLM-simulated user, plan and chain calls across 58 interconnected tools governed by 19 domain-specific policies, and operate over large-scale world data (48 European cities, 130K+ POIs, 1.7M+ routes).

Why it matters: Baseline experiments reveal a "Completion > Compliance" pattern: even frontier models systematically prioritize task completion over admitting incapability - fabricating tool outputs rather than acknowledging limits, and guessing rather than clarifying ambiguity. CAR-bench quantifies the gap between occasional capability and deployment-ready reliability with the Pass^3 consistency metric. A task scores 1 only if solved in all 3 trials.

Benchmark Details →   Read the Paper →   Hugging Face →   Starter Repo →

CAR-bench benchmark overview showing the three task types and evaluation framework
Overview of the CAR-bench components. (a) An LLM-simulated user generates multi-turn messages following task instructions (1); (b) the LLM agent, guided by domain policies, interacts with (c) tools to (2a) observe the environment or (2b) modify its state, until producing an informed response (3). The environment consists of (d) mutable states, (e) fixed context variables, and (f) static databases. The user instructions show the Base task type; the task types Hallucination and Disambiguation are explained in Benchmark Details.

Competition

Two Tracks

Choose the track that fits your research - or enter both.

Track 1: Open Track

Use any LLM, provider, framework, or architecture, from Claude Agents SDK and LangGraph to fully custom harnesses. Compete for top Pass^3 on the hidden test set and a Best Innovation Award focused on agent harnessing, reliability design, and efficient model use.

Learn more →

Track 2: Cerebras Fast-Reasoning

Use direct Cerebras gpt-oss inference to build fast-reasoning agents under a shared time budget. The challenge is to turn inference speed into higher reliability through planning, verification, retries, search, or ensemble-style harnesses.

Limited to 15 participating teams. Public-tier development has strict rate limits; organizers provide elevated-rate test windows. Codex Pro plans support harness engineering and are allocated by June 15.

Learn more →

Awards & Prizes

🏆

Open Track

  • Rank Award - By Pass^3 on hidden test set
  • Best Innovation Award - Novel harnessing, reliability design, efficient model use

Prizes

  • $5,000 pool of Anthropic API Credits split across Rank Award and Best Innovation Award winners (details to be announced), sponsored by Anthropic
  • Certificate
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper, reporting experience of the 2026 competition
  • More prizes to be announced

Cerebras Fast-Reasoning Track

  • Cerebras Innovation Award - Best demonstration of compute-aware agent design

Prizes

  • Codex Pro: 12-month subscriptions for the top 2 teams, sponsored by OpenAI
  • Certificate
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper, reporting experience of the 2026 competition
  • More prizes to be announced

Each team must submit a 4-page technical report (IJCAI format) for award eligibility. Full details →

Baselines

Can You Beat the Frontier?

Baseline results using our default agent scaffold with no optimization. Your target starts here.

Model Provider Avg Pass^3 Base Pass^3 Hall. Pass^3 Disamb. Pass^3
Claude Opus 4.6 Anthropic .58 .80 .48 .46
GPT-5 OpenAI .54 .66 .60 .36
Gemini 2.5 Pro Google .38 .53 .34 .28
Qwen3-32B Alibaba .31 .45 .27 .22
xLAM-2-32B Salesforce .16 .26 .11 .12

Full Leaderboard →

Prior Recognition

Validated by the Community

CAR-bench was accepted at ACL 2026 Main, selected as Hugging Face Paper of the Day, and won 1st place at UC Berkeley’s AgentX-AgentBeats Competition (Computer-Use Track, Google DeepMind-sponsored). This is the first academic competition dedicated to LLM agent reliability and limit-awareness.

Team

Organizers

A multidisciplinary team spanning academia and industry.

Johannes Kirmayr
Johannes Kirmayr
BMW & Augsburg Univ.
Lukas Stappen
Lukas Stappen
BMW Group Research
Elisabeth André
Elisabeth André
Augsburg University
Erik Cambria
Erik Cambria
NTU Singapore
Iryna Gurevych
Iryna Gurevych
TU Darmstadt
Patrick Dreisch
Patrick Dreisch
Anthropic
Natalia Vassilieva
Natalia Vassilieva
Cerebras Systems
Raj Tumuluri
Raj Tumuluri
OpenStream.ai
Varin Sikka
Varin Sikka
Stanford University

Extended Bios →

Cite

Citation

If you use CAR-bench in your research, please cite our paper.

@misc{kirmayr2026carbenchevaluatingconsistencylimitawareness,
      title={CAR-bench: Evaluating the Consistency and
             Limit-Awareness of LLM Agents under
             Real-World Uncertainty},
      author={Johannes Kirmayr and Lukas Stappen
              and Elisabeth Andr{\'e}},
      year={2026},
      eprint={2601.22027},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://huggingface.co/papers/2601.22027},
}