Can I participate solo?

Yes. There is no minimum team size. Solo participants are welcome and compete on equal footing with teams.

Can I be on multiple teams?

Yes. However, if two teams sharing a member submit substantially overlapping agents, the worse-performing submission will be excluded from award eligibility.

Is any LLM allowed, or are there restrictions on proprietary vs open-source?

In the Open Track, any LLM is allowed - proprietary, open-source, any size. In the Cerebras Fast-Reasoning Track, submitted agents use direct Cerebras-hosted gpt-oss inference through the starter kit. Larger closed-source models may be used for internal planner, verifier, or condenser steps if the complete agent stays within the Track 2 inference-compute constraints.

Can I submit to both tracks with the same agent?

Confirmed Track 2 teams may submit to both tracks. New registrations are Track 1 only because Track 2 registration is closed. Track 2 requires the direct Cerebras gpt-oss inference setup described in the starter kit, so teams usually adapt the agent for that track.

Is Track 2 limited?

Yes. Track 2 is limited to 15 participating teams, and registration is now closed. Confirmed Track 2 teams will receive follow-up information about Cerebras access and increased rate limits compared with a free personal account.

What happens if my agent hangs during evaluation?

If your agent does not respond within the evaluator timeout, that trial is scored as a failure and evaluation continues with remaining tasks. Track 2 also has inference-compute constraints: up to 5 sequential LLM calls per baseline LLM step and token usage up to 500k input, reasoning, and output tokens on average per task.

Are API credits provided?

For official evaluation runs, LLM API costs are covered by the organizers as far as possible - no cost to you except special LLM setup or usage scenarios. For development, you normally use your own API keys. Free personal Cerebras accounts can have strict limits; Cerebras will provide increased rate limits for the competition compared with a free personal account, with access details to follow soon.

How is the hidden test set protected?

The hidden test set is a completely novel set that is never shared with participants. Evaluation runs in a controlled environment. Agents cannot observe or extract test set tasks. Any attempt to probe the test set is a rule violation resulting in disqualification.

Will the test set be released after the competition?

Post-competition release will be decided and announced separately. The public train and validation splits are available under the MIT License.

Is this competition in-person only, or can I participate remotely?

The competition runs entirely online. You develop and submit remotely. Only the final presentations happen in-person at IJCAI-ECAI 2026 in Bremen (August 15–21), and attendance is encouraged but not required for prize eligibility.

What if I can’t attend IJCAI-ECAI 2026 to present?

Winners who cannot attend may present remotely or designate a co-author to present on their behalf. Prize eligibility is not affected by attendance.

What counts as “hard-coding” vs “clever prompting”?

Hard-coding means encoding specific task answers or building lookup tables that map task descriptions to pre-determined solutions. This is prohibited. Clever prompting means crafting system prompts, few-shot examples, or reasoning strategies that help the agent generalize - this is encouraged. The litmus test: would your approach work on novel tasks it hasn’t seen? If yes, it’s prompting. If it only works on tasks whose answers you’ve memorized, it’s hard-coding.

Can I use retrieval / RAG over the environment data?

Yes. Retrieval-augmented generation over the environment data (tools, policies, world data) is explicitly allowed and encouraged as a research direction.

Can I use the training data for fine-tuning?

Yes. You may fine-tune on the provided training data using any method (SFT, RLHF, DPO, etc.). You may also use the public validation split for model and harness selection. The public data is released under the MIT License.

Will technical reports be published? Where?

Selected technical reports will be published as non-archival proceedings on the competition website. This means they won’t preclude later publication at a venue of your choice. Reports are 4 pages using the IJCAI author kit, excluding references.

How does the A2A protocol work?

The A2A (Agent-to-Agent) protocol is the structured communication format between the CAR-bench evaluator and your agent under test. Your agent receives user messages, tool definitions, and tool results. It sends back user-facing responses and/or tool calls. The evaluator is the only component that executes CAR-bench tools and computes scores. See the Data & Starter Kit page for the quickstart.

What are scenario TOML files?

Scenario TOML files are run configs. They specify which evaluator and agent to start or pull, which task split to run, how many tasks and trials to execute, and which environment variables, Docker images, or GHCR images to use. Start with local_smoke.toml, then move to Docker and GHCR scenarios before submission.

What exactly do I submit?

Submit a public digest-pinned GHCR image for your agent under test, a scenario.toml file using the official evaluator and hidden-set config, required and optional environment variable names, track selection, and a 4-page technical report using the IJCAI author kit and citing CAR-bench. Track 2 reports should include an architecture diagram for compute-use audit. Never submit secret values. Organizers run the submitted agent scenario on hidden evaluation infrastructure. See the submission checklist.

How does Track 2 use Cerebras during development?

The Track 2 reference agents call Cerebras-hosted gpt-oss directly through the Cerebras SDK or direct API. During ordinary development, free personal Cerebras accounts can have strict limits, so use task scheduling or waiting between longer runs. Cerebras will provide increased rate limits for the competition compared with a free personal account, and teams may self-host the open-source executor models during development before testing the speed gain on Cerebras.

Still Have Questions?

Join the Discord for real-time help, or email us directly.