Data & Starter Kit

1. What You Get

The competition provides everything you need to develop, test, and submit your agent:

Train set: 125 tasks (50 Base, 50 Hallucination, 25 Disambiguation) with full labels
Dev set: 125 tasks (50 Base, 50 Hallucination, 25 Disambiguation) with full labels
Hidden test set: held out, used during the two official evaluation rounds (1st: Jul 10, final: Jul 19)

Three task types: Base tasks require reaching a ground-truth end-state through correct tool calls and policy compliance. Hallucination tasks remove a required tool, parameter, or result - the agent must acknowledge it cannot fulfill the request. Disambiguation tasks introduce controlled ambiguity - the agent must detect it and resolve it internally (e.g., via stored preferences) before asking the user. Full details →

Baseline agents: Two reference tool-calling agents - one for the Open Track (using Claude Opus 4.6, scoring .58 Pass^3) and one for the Cerebras Track (using a Cerebras-hosted LLM)
Docker template: Pre-configured container for submission
A2A protocol spec: Agent-to-Agent communication protocol used by the evaluation environment

GitHub Repo Dataset on HF

3. The Environment in Depth

The CAR-bench environment simulates a rich automotive in-car voice assistant ecosystem:

Tools (58 across 6 domains)

Navigation
Route planning, POI search, traffic, turn-by-turn directions

Vehicle Control
Climate, seat heating, windows, lights, driving modes

Charging
Station search, status, scheduling, range estimation

Productivity
Calendar, contacts, messages, reminders, notes

Weather
Current conditions, forecasts, alerts

Cross-domain
Multi-tool chains, composite queries

Policies (19 domain-specific)

Rules governing agent behavior: safety constraints, operational limits, data access restrictions, and interaction protocols. Agents must follow these policies even when users request actions that violate them.

World Data

48 European cities with realistic geography
130K+ Points of Interest (restaurants, hotels, gas stations, attractions, etc.)
1.7M+ routes between locations
100 calendar entries and 100 contacts
LLM-simulated users with diverse personas and interaction styles

For full environment details, see the paper or the Benchmark Details page with task types, evaluation subscores, key findings, and real agent trajectories.

4. The Pass^3 Metric

Pass^3 measures deployment-ready consistency. A task scores 1 only if the agent solves it correctly in all 3 independent trials. This is fundamentally different from Pass@3 (at least 1 success) or average accuracy.

Worked Example

Task A: "Navigate to the nearest Italian restaurant"
Trials: ✓ ✓ ✗ → Pass^3 = 0 (inconsistent - fails once, unreliable for deployment)

Task B: "Set a reminder for my meeting tomorrow"
Trials: ✓ ✓ ✓ → Pass^3 = 1 (consistent success across all trials)

This matters because real users don’t get three tries. If an agent succeeds only 2 out of 3 times, it’s not reliable enough for deployment. Pass^3 captures this: it penalizes inconsistency, not just average failure.

Key insight from baseline experiments: There is a significant gap between Pass^1 (single-trial success) and Pass^3 (consistent success) across all frontier models. Claude Opus 4.6 scores .84 Pass^1 but only .80 Pass^3 on Base tasks, and the gap widens dramatically on Hallucination (.59 vs .48) and Disambiguation (.58 vs .46) tasks.

5. Beyond the Baseline - Research Directions

The baseline agent is a straightforward tool-calling scaffold with no optimization. Here are productive directions participants have explored in related challenges:

Advanced prompting: Chain-of-thought, self-consistency, reflection, few-shot with curated exemplars
Planning-execution separation: Plan the full tool-call sequence before execution; verify the plan; adapt on failure
Subagents and skills: Decompose complex tasks into specialized sub-tasks handled by dedicated agents
Context engineering: RAG over environment data, dynamic context windows, selective tool description loading
Fine-tuning: SFT on the training data trajectories; RLHF/DPO with Pass^3 as the reward signal
Multi-agent systems: Orchestration of multiple agents with different roles (planner, executor, critic)
Consistency-focused strategies: Best-of-N with self-verification; majority voting; Monte Carlo tree search

These directions are not prescriptive - they’re signposts. Any approach that improves Pass^3 is welcome.

1. What You Get

2. Quickstart

3. The Environment in Depth

Tools (58 across 6 domains)

Policies (19 domain-specific)

World Data

4. The Pass^3 Metric

Worked Example

5. Beyond the Baseline - Research Directions

Ready?