This repository collects real-world coding tasks across multiple languages and tech stacks, with varying difficulty. It helps developers evaluate how durable different agentic coding products are when tackling diverse, practical programming work.
This project does not aim to crown a single “best” product. Instead, it provides a set of use cases that enable fair comparisons across products.
We continuously publish practical task sets covering multiple languages and frameworks. Each task includes a concise, tool-friendly prompt (Prompt.md or Prompt.zh.md), so you can run it in different AI coding products (e.g., Qoder, Cursor, Windsurf, Kiro, Claude Code) and observe:
- The progress each tool makes toward the goal
- The quota/credits/tokens it consumes
- The number of interactions and total elapsed time
Then, you can compare durability across products within the same model tier.
To ensure representativeness, we prioritized the languages most widely used in Qoder and proportionally included their major tech stacks, which together comprise the current evaluation set.
- Visit the organization Agentic Coding Durability Evaluation Set.
- Choose a repository for the language/stack you care about.
- Open the project folder and read
Prompt.mdorPrompt.zh.md. - Run the prompt in your chosen AI coding product (e.g., Qoder, Cursor, Windsurf, Kiro, Claude Code).
- Iterate until you personally judge the task “done.”
- This evaluation set intentionally does not include automated unit/integration/UI tests to decide completion.
- We rely on your human judgment, similar to real-world work where completeness, test coverage, visual acceptance, and maintainability vary by person.
- Keep the model tier comparable across products whenever possible.
- Record and compare consumption within your paid plan to evaluate durability.
- Use the same or equivalent model tier across tools.
- Avoid mixing unrelated tasks in the same session.
- Reset or isolate sessions when re-running the same task on another tool.
- If supported, export chat/trace logs so others can review the process.
- Record any manual steps performed outside the tool (e.g., refactors, fixes).
- The current coverage of languages, stacks, task types, difficulty range, and counts is limited.
- While human judgment is inherently subjective, at this stage we find a comparative, human-in-the-loop evaluation more appropriate and closer to real-world usage.
Contributions are welcome to expand the coverage.
We welcome two ways to contribute:
- Open an Issue in this repository to:
- Share languages/stacks and representative tasks not yet covered here
- Share negative cases where Qoder consumed credits too quickly (if possible, include a sample project and the exact prompt)
- Submit a Pull Request to the sample repositories under the Agentic Coding Durability Evaluation Set organization to:
- Enrich and improve existing projects, task prompts, and related materials
This project is licensed under the MIT License.