Prepare the public agent image, scenario config, and technical report organizers need for hidden evaluation.
scenario.toml file using the official evaluator image and your agent-under-test image/config.The evaluator image is official and organizer-published. Participants do not submit, modify, or self-host evaluator images for official evaluation.
Use this as a starting point and replace the agent image and env vars for your team. The evaluator must stay the official organizer-published evaluator. Choose your eligible track in the submission form.
[evaluator]
image = "ghcr.io/car-bench/car-bench-evaluator:latest"
[evaluator.env]
GEMINI_API_KEY = "${GEMINI_API_KEY:?Set GEMINI_API_KEY}"
LOGURU_LEVEL = "${LOGURU_LEVEL:-INFO}"
[agent_under_test]
image = "ghcr.io/your-org/your-agent@sha256:replace_with_digest"
[agent_under_test.env]
AGENT_LLM = "${AGENT_LLM:?Set AGENT_LLM}"
AGENT_API_BASE = "${AGENT_API_BASE:-}"
AGENT_API_KEY = "${AGENT_API_KEY:?Set AGENT_API_KEY}"
AGENT_TEMPERATURE = "${AGENT_TEMPERATURE:-}"
LOGURU_LEVEL = "${LOGURU_LEVEL:-INFO}"
[config]
num_trials = 3
task_split = "hidden"
tasks_base_num_tasks = -1
tasks_hallucination_num_tasks = -1
tasks_disambiguation_num_tasks = -1
max_steps = 50
[evaluator]: must use the official evaluator image. Participants do not submit, modify, or self-host evaluator images for official evaluation.[evaluator.env]: may reference evaluator env var names, but organizers provide evaluator secrets for official runs.[agent_under_test].image: public GHCR image pinned by digest, for example ghcr.io/team/agent@sha256:....[agent_under_test.env]: env vars organizers must set. Use ${VAR:?message} for required vars and ${VAR:-} or ${VAR:-default} for optional vars. Do not include secret values.[config]: use task_split = "hidden" and -1 for each task-count field so the full hidden set is selected.Organizers may run the same submitted Track 2 agent against a different hosted model deployment, so do not hard-code model IDs, deployment names, provider routes, API bases, service tiers, or reasoning-effort choices into the image.
Track 2 token usage must be reported through existing A2A turn_metrics fields: prompt_tokens, completion_tokens, and thinking_tokens. Do not add a custom sequential-depth metadata field; document the sequential-call structure in the report architecture diagram.
The technical report is required for every final submission. Use the IJCAI author kit, keep the main text to 4 pages, and include the CAR-bench citation. We encourage and endorse arXiv submissions as well, but arXiv is not required.
Describe the method, model and harness architecture, validation results, important ablations or lessons learned, and any compute or rate-limit assumptions needed to interpret the submission. Track 2 reports should include an architecture diagram so the 5-sequential-call-per-baseline-step constraint can be audited.