Everything you need to go from zero to first submission in under 20 minutes.
The competition provides everything you need to develop, test, and submit your agent:
The CAR-bench environment simulates a rich automotive in-car voice assistant ecosystem:
Rules governing agent behavior: safety constraints, operational limits, data access restrictions, and interaction protocols. Agents must follow these policies even when users request actions that violate them.
For full environment details, see the paper or the Benchmark Details page with task types, evaluation subscores, key findings, and real agent trajectories.
Pass^3 measures deployment-ready consistency. A task scores 1 only if the agent solves it correctly in all 3 independent trials. This is fundamentally different from Pass@3 (at least 1 success) or average accuracy.
This matters because real users don’t get three tries. If an agent succeeds only 2 out of 3 times, it’s not reliable enough for deployment. Pass^3 captures this: it penalizes inconsistency, not just average failure.
The baseline agent is a straightforward tool-calling scaffold with no optimization. Here are productive directions participants have explored in related challenges:
These directions are not prescriptive - they’re signposts. Any approach that improves Pass^3 is welcome.
Clone the repo, run the baseline, and submit your first score.