Can’t find your answer? Reach us at johannes1.kirmayr@uni-a.de or on Discord.
Yes. There is no minimum team size. Solo participants are welcome and compete on equal footing with teams.
Yes. However, if two teams sharing a member submit substantially overlapping agents, the worse-performing submission will be excluded from award eligibility.
In the Open Track, any LLM is allowed - proprietary, open-source, any size. In the Cerebras Track, you use models on Cerebras infrastructure via the free tier.
Yes, you can enter both tracks. However, the Cerebras Track requires your agent to use Cerebras infrastructure, so you may need to adapt your agent. The same team can submit different agents to each track.
Each task has a time limit. If your agent doesn’t respond within the limit, that trial is scored as a failure. The evaluation continues with remaining tasks. For dev submissions, you can debug and resubmit immediately. For test set evaluations, there is a 24-hour grace window to fix and resubmit.
For official evaluation runs, LLM API costs are covered by the organizers as far as possible - no cost to you (except special LLM setup or usage scenarios). For development, you use your own API keys. Cerebras Track participants use the Cerebras free tier.
The hidden test set is a completely novel set that is never shared with participants. Evaluation runs in a controlled environment. Agents cannot observe or extract test set tasks. Any attempt to probe the test set is a rule violation resulting in disqualification.
Post-competition release will be decided and announced separately. The train and dev sets are available now under the MIT License.
The competition runs entirely online. You develop and submit remotely. Only the final presentations happen in-person at IJCAI-ECAI 2026 in Bremen (August 15–21), and attendance is encouraged but not required for prize eligibility.
Winners who cannot attend may present remotely or designate a co-author to present on their behalf. Prize eligibility is not affected by attendance.
Hard-coding means encoding specific task answers or building lookup tables that map task descriptions to pre-determined solutions. This is prohibited. Clever prompting means crafting system prompts, few-shot examples, or reasoning strategies that help the agent generalize - this is encouraged. The litmus test: would your approach work on novel tasks it hasn’t seen? If yes, it’s prompting. If it only works on tasks whose answers you’ve memorized, it’s hard-coding.
Yes. Retrieval-augmented generation over the environment data (tools, policies, world data) is explicitly allowed and encouraged as a research direction.
Yes. You may fine-tune on the provided training data using any method (SFT, RLHF, DPO, etc.). You may also use the dev set for validation. The training and dev data are released under the MIT License.
Selected technical reports will be published as non-archival proceedings on the competition website. This means they won’t preclude later publication at a venue of your choice. Reports are 4 pages in IJCAI format (excluding references).
The A2A (Agent-to-Agent) protocol is a structured communication format between the evaluation environment (Evaluator Agent) and your agent (Agent-under-Test). Your agent receives user messages and tool results, and sends back responses and tool calls. The starter kit includes a complete working example. See the Data & Starter Kit page for details.
Join the Discord for real-time help, or email us directly.