Can’t find your answer? Reach us at johannes1.kirmayr@uni-a.de or on Discord.
Yes. There is no minimum team size. Solo participants are welcome and compete on equal footing with teams.
Yes. However, if two teams sharing a member submit substantially overlapping agents, the worse-performing submission will be excluded from award eligibility.
In the Open Track, any LLM is allowed - proprietary, open-source, any size. In the Cerebras Fast-Reasoning Track, submitted agents use direct Cerebras-hosted gpt-oss inference through the starter kit. Larger closed-source models may be used for internal planner, verifier, or condenser steps if the complete agent stays within the Track 2 inference-compute constraints.
Confirmed Track 2 teams may submit to both tracks. New registrations are Track 1 only because Track 2 registration is closed. Track 2 requires the direct Cerebras gpt-oss inference setup described in the starter kit, so teams usually adapt the agent for that track.
Yes. Track 2 is limited to 15 participating teams, and registration is now closed. Confirmed Track 2 teams will receive follow-up information about Cerebras access and increased rate limits compared with a free personal account.
If your agent does not respond within the evaluator timeout, that trial is scored as a failure and evaluation continues with remaining tasks. Track 2 also has inference-compute constraints: up to 5 sequential LLM calls per baseline LLM step and token usage up to 500k input, reasoning, and output tokens on average per task.
For official evaluation runs, LLM API costs are covered by the organizers as far as possible - no cost to you except special LLM setup or usage scenarios. For development, you normally use your own API keys. Free personal Cerebras accounts can have strict limits; Cerebras will provide increased rate limits for the competition compared with a free personal account, with access details to follow soon.
The hidden test set is a completely novel set that is never shared with participants. Evaluation runs in a controlled environment. Agents cannot observe or extract test set tasks. Any attempt to probe the test set is a rule violation resulting in disqualification.
Post-competition release will be decided and announced separately. The public train and validation splits are available under the MIT License.
The competition runs entirely online. You develop and submit remotely. Only the final presentations happen in-person at IJCAI-ECAI 2026 in Bremen (August 15–21), and attendance is encouraged but not required for prize eligibility.
Winners who cannot attend may present remotely or designate a co-author to present on their behalf. Prize eligibility is not affected by attendance.
Hard-coding means encoding specific task answers or building lookup tables that map task descriptions to pre-determined solutions. This is prohibited. Clever prompting means crafting system prompts, few-shot examples, or reasoning strategies that help the agent generalize - this is encouraged. The litmus test: would your approach work on novel tasks it hasn’t seen? If yes, it’s prompting. If it only works on tasks whose answers you’ve memorized, it’s hard-coding.
Yes. Retrieval-augmented generation over the environment data (tools, policies, world data) is explicitly allowed and encouraged as a research direction.
Yes. You may fine-tune on the provided training data using any method (SFT, RLHF, DPO, etc.). You may also use the public validation split for model and harness selection. The public data is released under the MIT License.
Selected technical reports will be published as non-archival proceedings on the competition website. This means they won’t preclude later publication at a venue of your choice. Reports are 4 pages using the IJCAI author kit, excluding references.
The A2A (Agent-to-Agent) protocol is the structured communication format between the CAR-bench evaluator and your agent under test. Your agent receives user messages, tool definitions, and tool results. It sends back user-facing responses and/or tool calls. The evaluator is the only component that executes CAR-bench tools and computes scores. See the Data & Starter Kit page for the quickstart.
Scenario TOML files are run configs. They specify which evaluator and agent to start or pull, which task split to run, how many tasks and trials to execute, and which environment variables, Docker images, or GHCR images to use. Start with local_smoke.toml, then move to Docker and GHCR scenarios before submission.
Submit a public digest-pinned GHCR image for your agent under test, a scenario.toml file using the official evaluator and hidden-set config, required and optional environment variable names, track selection, and a 4-page technical report using the IJCAI author kit and citing CAR-bench. Track 2 reports should include an architecture diagram for compute-use audit. Never submit secret values. Organizers run the submitted agent scenario on hidden evaluation infrastructure. See the submission checklist.
The Track 2 reference agents call Cerebras-hosted gpt-oss directly through the Cerebras SDK or direct API. During ordinary development, free personal Cerebras accounts can have strict limits, so use task scheduling or waiting between longer runs. Cerebras will provide increased rate limits for the competition compared with a free personal account, and teams may self-host the open-source executor models during development before testing the speed gain on Cerebras.
Join the Discord for real-time help, or email us directly.