1. What You Get

The competition starter kit provides the code and configs needed to develop, test, package, and submit a CAR-bench agent:

  • Train set: 129 public tasks (50 Base, 48 Hallucination, 31 Disambiguation) for development.
  • Public validation split: 125 public tasks (50 Base, 50 Hallucination, 25 Disambiguation) for local validation. These are not the hidden final tasks.
  • Hidden test set: held out, used during the two official evaluation rounds (1st: Jul 10, final: Jul 19)
Three task types: Base tasks require reaching a ground-truth end-state through correct tool calls and policy compliance. Hallucination tasks remove a required tool, parameter, or result - the agent must acknowledge it cannot fulfill the request. Disambiguation tasks introduce controlled ambiguity - the agent must detect it and resolve it internally (e.g., via stored preferences) before asking the user. Full details →
  • Evaluator: a dockerized CAR-bench evaluator that owns the simulated user, tools, environment, trajectories, and scoring.
  • Track 1 starter: a minimal agent-under-test template for any model, provider, or architecture.
  • Track 2 references: Cerebras SDK/direct API agents, including a direct executor and a planner/executor template.
  • Scenario configs: TOML files that define local, Docker, and GHCR validation runs.
  • A2A interface: the message protocol used between the evaluator and your submitted agent.
GitHub Repo Dataset on HF Read the Paper

2. Quickstart

Start with a smoke scenario before changing the agent. Scenario TOML files are the run configs: they choose the evaluator, the agent under test, the task split, task counts, trials, and environment variables. The full, copy-pasteable setup and troubleshooting guide lives in the starter repo README; this page gives the high-level path.

Prerequisites: Python 3.11+, uv, Docker for Docker/GHCR validation, and the API credentials required by your chosen track.

Participant workflow

  1. Choose a track: Track 1 is open to any model, framework, or architecture. Track 2 is limited to 15 teams and uses direct Cerebras-hosted gpt-oss inference for fast-reasoning harnesses.
  2. Run a scenario config: start with local_smoke.toml, then move to public validation and Docker scenarios.
  3. Prepare submission: publish a public digest-pinned GHCR agent image, prepare scenario.toml with the official evaluator and hidden-set config, and list required secret names without secret values. See the submission checklist.

Common setup

Fork the starter repository first and clone your fork. Your fork is where your agent code, Dockerfile, scenario configs, and technical-report notes should live. Official submission will be a public digest-pinned GHCR agent image plus scenario.toml, not a pull request.

git clone https://github.com/YOUR_ORG_OR_USERNAME/car-bench-ijcai.git
cd car-bench-ijcai
python3.11 -m venv .venv
source .venv/bin/activate
./scripts/setup_car_bench.sh
cp .env.example .env

Add the evaluator key to .env. Track 1 participants add whatever model-provider keys their agent needs. Track 2 participants add CEREBRAS_API_KEY and any model-routing variables used by their planner or executor.

Install by track

Track Install command
Track 1 uv sync --extra track-1-agent --extra car-bench-evaluator
Track 2 uv sync --extra track-2-agent --extra car-bench-evaluator

Track 2 Cerebras setup

Track 2 reference agents call Cerebras-hosted gpt-oss directly through the Cerebras SDK. Configure the Cerebras gpt-oss executor and, for the planner/executor template, the planner model through environment variables:

CEREBRAS_API_KEY=...
TRACK2_EXECUTOR_MODEL=gpt-oss-120b
TRACK2_EXECUTOR_REASONING_EFFORT=medium
TRACK2_PLANNER_MODEL=gpt-oss-120b   # planner/executor template only
TRACK2_PLANNER_REASONING_EFFORT=high
TRACK2_PLANNER_MAX_COMPLETION_TOKENS=4096

During ordinary development, free personal Cerebras accounts can have strict request and token rate limits. Use smoke scenarios first, keep completion-token caps tight, and schedule longer validation runs instead of launching many jobs at once. Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon. Participants may self-host the open-source models used by Cerebras during development. Codex Pro is not the submitted-agent runtime for these templates.

Choose your first run

Track Local smoke command
Track 1 uv run car-bench-run scenarios/track_1_agent_under_test/local_smoke.toml --show-logs
Track 2 Cerebras uv run car-bench-run scenarios/track_2_agent_under_test_cerebras/local_smoke.toml --show-logs
Track 2 planner/executor uv run car-bench-run scenarios/track_2_agent_under_test_cerebras_planner/local_smoke.toml --show-logs

Validation modes

  • Local smoke: fastest iteration; one train task from each task type.
  • Local test set: full public validation split with three trials.
  • Docker local build: verifies your Dockerfile and runtime assumptions.
  • GHCR validation: tests the image/config style expected for official submission.

For full commands, scenario structure, and Track 2 Cerebras setup details, read the starter repo README after cloning.

3. The Environment in Depth

The CAR-bench environment simulates a rich automotive in-car voice assistant ecosystem:

Tools (58 across 6 domains)

Navigation
Route planning, POI search, traffic, turn-by-turn directions
Vehicle Control
Climate, seat heating, windows, lights, driving modes
Charging
Station search, status, scheduling, range estimation
Productivity
Calendar, contacts, messages, reminders, notes
Weather
Current conditions, forecasts, alerts
Cross-domain
Multi-tool chains, composite queries

Policies (19 domain-specific)

Rules governing agent behavior: safety constraints, operational limits, data access restrictions, and interaction protocols. Agents must follow these policies even when users request actions that violate them.

World Data

  • 48 European cities with realistic geography
  • 130K+ Points of Interest (restaurants, hotels, gas stations, attractions, etc.)
  • 1.7M+ routes between locations
  • 100 calendar entries and 100 contacts
  • LLM-simulated users with diverse personas and interaction styles

For full environment details, see the paper or the Benchmark Details page with task types, evaluation subscores, key findings, and real agent trajectories.

4. The Pass^3 Metric

Pass^3 measures deployment-ready consistency. A task scores 1 only if the agent solves it correctly in all 3 independent trials. This is fundamentally different from Pass@3 (at least 1 success) or average accuracy.

Worked Example

Task A: "Navigate to the nearest Italian restaurant"
Trials: ✓ ✓ ✗ → Pass^3 = 0 (inconsistent - fails once, unreliable for deployment)

Task B: "Set a reminder for my meeting tomorrow"
Trials: ✓ ✓ ✓ → Pass^3 = 1 (consistent success across all trials)

This matters because real users don’t get three tries. If an agent succeeds only 2 out of 3 times, it’s not reliable enough for deployment. Pass^3 captures this: it penalizes inconsistency, not just average failure.

Key insight from baseline experiments: There is a significant gap between Pass^1 (single-trial success) and Pass^3 (consistent success) across all frontier models. Claude Opus 4.6 scores .84 Pass^1 but only .80 Pass^3 on Base tasks, and the gap widens dramatically on Hallucination (.59 vs .48) and Disambiguation (.58 vs .46) tasks.

5. Beyond the Baseline - Research Directions

The baseline agent is a straightforward tool-calling scaffold with no optimization. Here are productive directions participants have explored in related challenges:

  • Advanced prompting: Chain-of-thought, self-consistency, reflection, few-shot with curated exemplars
  • Planning-execution separation: Plan the full tool-call sequence before execution; verify the plan; adapt on failure
  • Subagents and skills: Decompose complex tasks into specialized sub-tasks handled by dedicated agents
  • Context engineering: RAG over environment data, dynamic context windows, selective tool description loading
  • Fine-tuning: SFT on the training data trajectories; RLHF/DPO with Pass^3 as the reward signal
  • Multi-agent systems: Orchestration of multiple agents with different roles (planner, executor, critic)
  • Consistency-focused strategies: Best-of-N with self-verification; majority voting; Monte Carlo tree search

These directions are not prescriptive - they’re signposts. Any approach that improves reliable behavior is welcome, as long as it respects the benchmark boundary and does not exploit evaluator internals.

Ready?

Clone the starter repo, run a smoke scenario, then validate your agent in Docker.