1. What You Get

The competition provides everything you need to develop, test, and submit your agent:

  • Train set: 125 tasks (50 Base, 50 Hallucination, 25 Disambiguation) with full labels
  • Dev set: 125 tasks (50 Base, 50 Hallucination, 25 Disambiguation) with full labels
  • Hidden test set: held out, used during the two official evaluation rounds (1st: Jul 10, final: Jul 19)
Three task types: Base tasks require reaching a ground-truth end-state through correct tool calls and policy compliance. Hallucination tasks remove a required tool, parameter, or result - the agent must acknowledge it cannot fulfill the request. Disambiguation tasks introduce controlled ambiguity - the agent must detect it and resolve it internally (e.g., via stored preferences) before asking the user. Full details →
  • Baseline agents: Two reference tool-calling agents - one for the Open Track (using Claude Opus 4.6, scoring .58 Pass^3) and one for the Cerebras Track (using a Cerebras-hosted LLM)
  • Docker template: Pre-configured container for submission
  • A2A protocol spec: Agent-to-Agent communication protocol used by the evaluation environment
GitHub Repo Dataset on HF

2. Quickstart

Coming soon. The GitHub repository is receiving final updates for the competition setup. The full quickstart guide with step-by-step instructions will be available when the competition opens on May 4, 2026.

3. The Environment in Depth

The CAR-bench environment simulates a rich automotive in-car voice assistant ecosystem:

Tools (58 across 6 domains)

Navigation
Route planning, POI search, traffic, turn-by-turn directions
Vehicle Control
Climate, seat heating, windows, lights, driving modes
Charging
Station search, status, scheduling, range estimation
Productivity
Calendar, contacts, messages, reminders, notes
Weather
Current conditions, forecasts, alerts
Cross-domain
Multi-tool chains, composite queries

Policies (19 domain-specific)

Rules governing agent behavior: safety constraints, operational limits, data access restrictions, and interaction protocols. Agents must follow these policies even when users request actions that violate them.

World Data

  • 48 European cities with realistic geography
  • 130K+ Points of Interest (restaurants, hotels, gas stations, attractions, etc.)
  • 1.7M+ routes between locations
  • 100 calendar entries and 100 contacts
  • LLM-simulated users with diverse personas and interaction styles

For full environment details, see the paper or the Benchmark Details page with task types, evaluation subscores, key findings, and real agent trajectories.

4. The Pass^3 Metric

Pass^3 measures deployment-ready consistency. A task scores 1 only if the agent solves it correctly in all 3 independent trials. This is fundamentally different from Pass@3 (at least 1 success) or average accuracy.

Worked Example

Task A: "Navigate to the nearest Italian restaurant"
Trials: ✓ ✓ ✗ → Pass^3 = 0 (inconsistent - fails once, unreliable for deployment)

Task B: "Set a reminder for my meeting tomorrow"
Trials: ✓ ✓ ✓ → Pass^3 = 1 (consistent success across all trials)

This matters because real users don’t get three tries. If an agent succeeds only 2 out of 3 times, it’s not reliable enough for deployment. Pass^3 captures this: it penalizes inconsistency, not just average failure.

Key insight from baseline experiments: There is a significant gap between Pass^1 (single-trial success) and Pass^3 (consistent success) across all frontier models. Claude Opus 4.6 scores .84 Pass^1 but only .80 Pass^3 on Base tasks, and the gap widens dramatically on Hallucination (.59 vs .48) and Disambiguation (.58 vs .46) tasks.

5. Beyond the Baseline - Research Directions

The baseline agent is a straightforward tool-calling scaffold with no optimization. Here are productive directions participants have explored in related challenges:

  • Advanced prompting: Chain-of-thought, self-consistency, reflection, few-shot with curated exemplars
  • Planning-execution separation: Plan the full tool-call sequence before execution; verify the plan; adapt on failure
  • Subagents and skills: Decompose complex tasks into specialized sub-tasks handled by dedicated agents
  • Context engineering: RAG over environment data, dynamic context windows, selective tool description loading
  • Fine-tuning: SFT on the training data trajectories; RLHF/DPO with Pass^3 as the reward signal
  • Multi-agent systems: Orchestration of multiple agents with different roles (planner, executor, critic)
  • Consistency-focused strategies: Best-of-N with self-verification; majority voting; Monte Carlo tree search

These directions are not prescriptive - they’re signposts. Any approach that improves Pass^3 is welcome.

Ready?

Clone the repo, run the baseline, and submit your first score.