CAR-bench Challenge @ IJCAI-ECAI 2026 - Get started now!
IJCAI-ECAI 2026 Aug 15–21, Bremen Competition Track

CAR-bench Challenge:
Building Reliable LLM Agents
Under Real-World Uncertainty

Frontier LLMs succeed on only 58% of real-world agent tasks consistently. Can you do better?
250 tasks · 58 tools · 19 policies · 3 evaluation dimensions

Next Deadline
Competition Opens
--Days
--Hrs
--Min
--Sec

Timeline - all deadlines AoE

May 4 Competition opens - data, docs, and baseline agent released. Get started now →
May–Jul 2026 Open development with live dev-set leaderboard.
Jul 10 1st evaluation on hidden test set. Scores returned to participants.
Jul 19 Final evaluation on hidden test set (2nd official submission). Rankings determined.
Jul 26 Technical report submission deadline (4 pages, IJCAI format).
Jul 31 Winners announced.
Aug 15–21 Presentations at IJCAI-ECAI 2026, Bremen, Germany.

About the Benchmark

What is CAR-bench?

The problem: LLM agents are rapidly moving from research prototypes to real-world deployments, yet existing benchmarks evaluate them under idealized conditions - complete information, available tools, and unambiguous instructions. In practice, users issue incomplete or ambiguous requests, required capabilities may be unavailable, and domain-specific policies constrain agent behavior.

The approach: CAR-bench evaluates LLM agents as automotive in-car voice assistants across 250 tasks spanning three complementary dimensions: Base multi-turn task completion (100 tasks), Hallucination limit-awareness under missing capabilities (100 tasks), and Disambiguation uncertainty resolution of ambiguous requests (50 tasks). Agents interact with an LLM-simulated user, plan and chain calls across 58 interconnected tools governed by 19 domain-specific policies, and operate over large-scale world data (48 European cities, 130K+ POIs, 1.7M+ routes).

Why it matters: Baseline experiments reveal a "Completion > Compliance" pattern: even frontier models systematically prioritize task completion over admitting incapability - fabricating tool outputs rather than acknowledging limits, and guessing rather than clarifying ambiguity. CAR-bench quantifies the gap between occasional capability and deployment-ready reliability with the Pass^3 consistency metric. A task scores 1 only if solved in all 3 trials.

Benchmark Details →   Read the Paper →   Hugging Face →   GitHub →

CAR-bench benchmark overview showing the three task types and evaluation framework
Overview of the CAR-bench components. (a) An LLM-simulated user generates multi-turn messages following task instructions (1); (b) the LLM agent, guided by domain policies, interacts with (c) tools to (2a) observe the environment or (2b) modify its state, until producing an informed response (3). The environment consists of (d) mutable states, (e) fixed context variables, and (f) static databases. The user instructions show the Base task type; the task types Hallucination and Disambiguation are explained in Benchmark Details.

Competition

Two Tracks

Choose the track that fits your research - or enter both.

Track 1: Open Track

Use any LLM, any provider, any architecture. The goal is maximum Pass^3 on the hidden test set. Rank award for top performance, plus a Best Innovation Award for novel approaches (efficient use of open models, creative architectures, ...).

Learn more →

Track 2: Cerebras Fast-Reasoning

Cerebras inference is ~10x faster than conventional GPUs. Convert that extra token capacity into higher Pass^3 within the same time budget per turn - via multi-pass reasoning, self-verification, retries, or search strategies.

Learn more →

Baselines

Can You Beat the Frontier?

Baseline results using our default agent scaffold with no optimization. Your target starts here.

Model Provider Avg Pass^3 Base Pass^3 Hall. Pass^3 Disamb. Pass^3
Claude Opus 4.6 Anthropic .58 .80 .48 .46
GPT-5 OpenAI .54 .66 .60 .36
Gemini 2.5 Pro Google .38 .53 .34 .28
Qwen3-32B Alibaba .31 .45 .27 .22
xLAM-2-32B Salesforce .16 .26 .11 .12

Full Leaderboard →

Prior Recognition

Validated by the Community

CAR-bench was accepted at ACL 2026 Main, selected as Hugging Face Paper of the Day, and won 1st place at UC Berkeley’s AgentX-AgentBeats Competition (Computer-Use Track, Google DeepMind-sponsored). This is the first academic competition dedicated to LLM agent reliability and limit-awareness.

Prizes & Awards

Awards

🏆

Open Track

  • Rank Award - By Pass^3 on hidden test set
  • Best Innovation Award - Novel approaches, smaller models, efficiency

Cerebras Track

  • Cerebras Innovation Award - Best demonstration of compute-aware agent design

Prizes for all award winners

  • Sponsored prize (details to be announced)
  • Certificate
  • (Optional) Oral presentation at IJCAI-ECAI 2026 competition slot
  • (Optional) Co-authorship on IJCAI 2027 demo paper, reporting experience of the 2026 competition

Each team must submit a 4-page technical report (IJCAI format) for award eligibility. Full details →

Team

Organizers

A multidisciplinary team spanning academia and industry.

Johannes Kirmayr
Johannes Kirmayr
BMW & Augsburg Univ.
Lukas Stappen
Lukas Stappen
BMW Group Research
Elisabeth André
Elisabeth André
Augsburg University
Erik Cambria
Erik Cambria
NTU Singapore
Iryna Gurevych
Iryna Gurevych
TU Darmstadt
Patrick Dreisch
Patrick Dreisch
Anthropic
Natalia Vassilieva
Natalia Vassilieva
Cerebras Systems
Raj Tumuluri
Raj Tumuluri
OpenStream.ai
VS
Varin Sikka
Stanford University

Extended Bios →

Cite

Citation

If you use CAR-bench in your research, please cite our paper.

@misc{kirmayr2026carbenchevaluatingconsistencylimitawareness,
      title={CAR-bench: Evaluating the Consistency and
             Limit-Awareness of LLM Agents under
             Real-World Uncertainty},
      author={Johannes Kirmayr and Lukas Stappen
              and Elisabeth Andr{\'e}},
      year={2026},
      eprint={2601.22027},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://huggingface.co/papers/2601.22027},
}