Clone the starter kit, choose a track, run a scenario config, and validate the same Docker interface organizers will use.
The competition starter kit provides the code and configs needed to develop, test, package, and submit a CAR-bench agent:
Start with a smoke scenario before changing the agent. Scenario TOML files are the run configs: they choose the evaluator, the agent under test, the task split, task counts, trials, and environment variables. The full, copy-pasteable setup and troubleshooting guide lives in the starter repo README; this page gives the high-level path.
Prerequisites: Python 3.11+, uv, Docker for Docker/GHCR validation, and the API credentials required by your chosen track.
gpt-oss inference for fast-reasoning harnesses.local_smoke.toml, then move to public validation and Docker scenarios.scenario.toml with the official evaluator and hidden-set config, and list required secret names without secret values. See the submission checklist.Fork the starter repository first and clone your fork. Your fork is where your agent code, Dockerfile, scenario configs, and technical-report notes should live. Official submission will be a public digest-pinned GHCR agent image plus scenario.toml, not a pull request.
git clone https://github.com/YOUR_ORG_OR_USERNAME/car-bench-ijcai.git
cd car-bench-ijcai
python3.11 -m venv .venv
source .venv/bin/activate
./scripts/setup_car_bench.sh
cp .env.example .env
Add the evaluator key to .env. Track 1 participants add whatever model-provider keys their agent needs. Track 2 participants add CEREBRAS_API_KEY and any model-routing variables used by their planner or executor.
| Track | Install command |
|---|---|
| Track 1 | uv sync --extra track-1-agent --extra car-bench-evaluator |
| Track 2 | uv sync --extra track-2-agent --extra car-bench-evaluator |
Track 2 reference agents call Cerebras-hosted gpt-oss directly through the Cerebras SDK. Configure the Cerebras gpt-oss executor and, for the planner/executor template, the planner model through environment variables:
CEREBRAS_API_KEY=...
TRACK2_EXECUTOR_MODEL=gpt-oss-120b
TRACK2_EXECUTOR_REASONING_EFFORT=medium
TRACK2_PLANNER_MODEL=gpt-oss-120b # planner/executor template only
TRACK2_PLANNER_REASONING_EFFORT=high
TRACK2_PLANNER_MAX_COMPLETION_TOKENS=4096
During ordinary development, free personal Cerebras accounts can have strict request and token rate limits. Use smoke scenarios first, keep completion-token caps tight, and schedule longer validation runs instead of launching many jobs at once. Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon. Participants may self-host the open-source models used by Cerebras during development. Codex Pro is not the submitted-agent runtime for these templates.
| Track | Local smoke command |
|---|---|
| Track 1 | uv run car-bench-run scenarios/track_1_agent_under_test/local_smoke.toml --show-logs |
| Track 2 Cerebras | uv run car-bench-run scenarios/track_2_agent_under_test_cerebras/local_smoke.toml --show-logs |
| Track 2 planner/executor | uv run car-bench-run scenarios/track_2_agent_under_test_cerebras_planner/local_smoke.toml --show-logs |
For full commands, scenario structure, and Track 2 Cerebras setup details, read the starter repo README after cloning.
The CAR-bench environment simulates a rich automotive in-car voice assistant ecosystem:
Rules governing agent behavior: safety constraints, operational limits, data access restrictions, and interaction protocols. Agents must follow these policies even when users request actions that violate them.
For full environment details, see the paper or the Benchmark Details page with task types, evaluation subscores, key findings, and real agent trajectories.
Pass^3 measures deployment-ready consistency. A task scores 1 only if the agent solves it correctly in all 3 independent trials. This is fundamentally different from Pass@3 (at least 1 success) or average accuracy.
This matters because real users don’t get three tries. If an agent succeeds only 2 out of 3 times, it’s not reliable enough for deployment. Pass^3 captures this: it penalizes inconsistency, not just average failure.
The baseline agent is a straightforward tool-calling scaffold with no optimization. Here are productive directions participants have explored in related challenges:
These directions are not prescriptive - they’re signposts. Any approach that improves reliable behavior is welcome, as long as it respects the benchmark boundary and does not exploit evaluator internals.
Clone the starter repo, run a smoke scenario, then validate your agent in Docker.