The final submission Google Form is not open yet. The link will be published here and announced on Discord before the first official evaluation round.

Submission Checklist

  • Agent Docker image: a public GHCR image for your agent under test, pinned by digest.
  • Scenario TOML: a scenario.toml file using the official evaluator image and your agent-under-test image/config.
  • Environment variables: required and optional env var names only. Never submit secret values.
  • Model configurability: all LLM model names, provider routes, deployment names, API bases, service tiers, and reasoning-effort selectors must be configurable through env vars.
  • Technical report: a 4-page IJCAI-format report including the CAR-bench citation. Use the IJCAI author kit. Track 2 reports should include an architecture diagram for compute-use audit.
  • Track selection: Track 1, Track 2, or both.

The evaluator image is official and organizer-published. Participants do not submit, modify, or self-host evaluator images for official evaluation.

Copy-Paste Scenario TOML

Use this as a starting point and replace the agent image and env vars for your team. The evaluator must stay the official organizer-published evaluator. Choose your eligible track in the submission form.

[evaluator]
image = "ghcr.io/car-bench/car-bench-evaluator:latest"

[evaluator.env]
GEMINI_API_KEY = "${GEMINI_API_KEY:?Set GEMINI_API_KEY}"
LOGURU_LEVEL = "${LOGURU_LEVEL:-INFO}"

[agent_under_test]
image = "ghcr.io/your-org/your-agent@sha256:replace_with_digest"

[agent_under_test.env]
AGENT_LLM = "${AGENT_LLM:?Set AGENT_LLM}"
AGENT_API_BASE = "${AGENT_API_BASE:-}"
AGENT_API_KEY = "${AGENT_API_KEY:?Set AGENT_API_KEY}"
AGENT_TEMPERATURE = "${AGENT_TEMPERATURE:-}"
LOGURU_LEVEL = "${LOGURU_LEVEL:-INFO}"

[config]
num_trials = 3
task_split = "hidden"
tasks_base_num_tasks = -1
tasks_hallucination_num_tasks = -1
tasks_disambiguation_num_tasks = -1
max_steps = 50

Field Meanings

  • [evaluator]: must use the official evaluator image. Participants do not submit, modify, or self-host evaluator images for official evaluation.
  • [evaluator.env]: may reference evaluator env var names, but organizers provide evaluator secrets for official runs.
  • [agent_under_test].image: public GHCR image pinned by digest, for example ghcr.io/team/agent@sha256:....
  • [agent_under_test.env]: env vars organizers must set. Use ${VAR:?message} for required vars and ${VAR:-} or ${VAR:-default} for optional vars. Do not include secret values.
  • [config]: use task_split = "hidden" and -1 for each task-count field so the full hidden set is selected.
  • Name environment variables however you want, but every model/provider choice needed to run the agent must be configurable through env vars.

Track 2 Notes

Organizers may run the same submitted Track 2 agent against a different hosted model deployment, so do not hard-code model IDs, deployment names, provider routes, API bases, service tiers, or reasoning-effort choices into the image.

Track 2 token usage must be reported through existing A2A turn_metrics fields: prompt_tokens, completion_tokens, and thinking_tokens. Do not add a custom sequential-depth metadata field; document the sequential-call structure in the report architecture diagram.

Report

The technical report is required for every final submission. Use the IJCAI author kit, keep the main text to 4 pages, and include the CAR-bench citation. We encourage and endorse arXiv submissions as well, but arXiv is not required.

Describe the method, model and harness architecture, validation results, important ablations or lessons learned, and any compute or rate-limit assumptions needed to interpret the submission. Track 2 reports should include an architecture diagram so the 5-sequential-call-per-baseline-step constraint can be audited.