1. Eligibility

  • The competition is open to individuals and teams worldwide.
  • There is no minimum or maximum team size, though we recommend 1–5 members.
  • Participants may be affiliated with academic institutions, industry organizations, or be independent researchers.
  • An individual may be a member of multiple teams. However, if two submissions from teams sharing a member are substantially overlapping, the worse-performing submission will be excluded from award eligibility.

2. Registration

  • Teams must register before their first submission to the official leaderboard.
  • Registration is preferably submitted through the official Google Form. If Google Forms is inaccessible from a participant’s location, teams may register by email using the fallback instructions on the registration page.
  • Registration requires: team name, institution or organization, country or region, contact person name, contact email, team member names, optional team member emails, and track selection.
  • Track 1 registration remains open.
  • Track 2 registration is closed and remains limited to 15 participating teams.
  • Registration opened on May 25, 2026.
  • Teams may update their team members up to 7 days before the final evaluation deadline.

3. Allowed Approaches

We adopt a permissive stance. Any approach is permitted unless explicitly prohibited below. This includes but is not limited to:

  • Any LLM (proprietary or open-source, any size)
  • Prompting strategies (system prompts, few-shot, chain-of-thought, self-reflection)
  • Agentic scaffolding (planning-execution separation, retry logic, verification)
  • Multi-agent systems (subagents, orchestration, ensemble methods)
  • Fine-tuning on the provided training data (SFT, RLHF, DPO, etc.)
  • Retrieval-augmented generation (RAG) over environment data
  • External knowledge bases or pre-computed lookup tables (provided they do not encode test set answers)

Cerebras Fast-Reasoning Track: Track 2 submissions must use the direct Cerebras-hosted gpt-oss inference setup described in the starter kit. Agents must expose the same dockerized A2A interface as Track 1. Multi-pass reasoning, private planning, self-verification, retries, ensembles, and parallel calls are allowed, but the agent must respect Track 2 inference-compute constraints and must not exploit evaluator internals. For each baseline LLM step, participants may use up to 5 sequential LLM calls; parallel calls inside each step do not count toward this sequential limit. Token usage may be up to 500k tokens on average per task, including input, reasoning, and output tokens. Token usage must be tracked through A2A turn_metrics token fields and may be counterchecked. Track 2 reports should include an architecture diagram so the sequential-call constraint can be audited. Cerebras will provide increased rate limits compared with a free personal account; access details will follow soon.

4. Prohibited Behaviors

The following are strictly prohibited and will result in disqualification:

  • Hard-coding answers: Encoding specific test-set answers or lookup tables that map task descriptions to solutions.
  • Exploiting evaluation infrastructure: Reverse-engineering, probing, or exploiting vulnerabilities in the evaluation system. This includes self-evaluating against the task-level metrics (e.g., checking subscores and re-prompting the LLM to correct failures before the official evaluation scores the run). Agents must produce genuine responses, not iteratively repair them against the scoring criteria.
  • Probing the hidden test set: Any attempt to extract, infer, or reconstruct hidden test set tasks through the evaluation API or any other channel.
  • Overlapping submissions: Teams sharing a member may not submit substantially overlapping agents. If detected, the worse-performing submission will be excluded from award eligibility.

5. Submission Mechanics

  • Submissions are dockerized A2A-compatible agents built from the starter kit interface.
  • Participants submit a public GHCR image for the agent under test, pinned by digest.
  • Participants submit scenario.toml, using the official organizer-published evaluator image and their agent-under-test runtime config.
  • The official evaluator image must not be replaced, modified, self-hosted, or submitted as a participant artifact.
  • Submission scenarios should list required and optional environment variable or secret names, but must never contain secret values.
  • Official submission scenarios must use task_split = "hidden" and -1 for each task-count field.
  • All LLM model names, provider routes, deployment names, API bases, service tiers, and reasoning-effort selectors must be configurable through environment variables.
  • Development validation: Participants may run public local, Docker, and GHCR scenario configs themselves before submitting.
  • Hidden test set: Two official evaluation rounds. 1st round (Jul 10): each team gets one submission and receives scores. Final (Jul 19): each team gets one submission. Final rankings are determined from the Jul 19 round.
  • Technical report: A 4-page technical report using the IJCAI author kit, excluding references, with a CAR-bench citation is required for every final submission. Track 2 reports should include an architecture diagram for compute-use audit. Deadline: July 26, 2026 (AoE).
Submission form opens soon. The Google Form link will be published on the submission page and announced on Discord before the first official evaluation round.

6. Code Release

  • Participants are expected to release their code under a permissive open-source license (MIT, Apache 2.0, or equivalent) upon conclusion of the competition.
  • Code must be sufficient for independent reproduction of reported results.

7. Technical Reports

  • A 4-page technical report using the IJCAI author kit, excluding references, is required for every final submission.
  • Deadline: July 26, 2026 (AoE).
  • Reports should describe the approach, model and harness architecture, design decisions, validation results, ablation results or lessons learned, and any compute or rate-limit assumptions.
  • Reports must cite the CAR-bench paper.
  • Reports will be published as non-archival proceedings on the competition website. arXiv submission is encouraged and endorsed, but not required.
  • The submission portal link will be added to the submission page when available.

8. Data Use

  • The starter kit and public CAR-bench data are released under the MIT License.
  • Participants may use the training and dev data for other research, provided they cite the original paper.
  • The hidden test set will not be released during the competition. Post-competition release will be announced separately.
  • Results and rankings will be made publicly available on the competition website after the final evaluation.