Agentic Coding Durability Evaluation Set

This repository collects real-world coding tasks across multiple languages and tech stacks, with varying difficulty. It helps developers evaluate how durable different agentic coding products are when tackling diverse, practical programming work.

This project does not aim to crown a single “best” product. Instead, it provides a set of use cases that enable fair comparisons across products.

Project Content

We continuously publish practical task sets covering multiple languages and frameworks. Each task includes a concise, tool-friendly prompt (Prompt.md or Prompt.zh.md), so you can run it in different AI coding products (e.g., Qoder, Cursor, Windsurf, Kiro, Claude Code) and observe:

The progress each tool makes toward the goal
The quota/credits/tokens it consumes
The number of interactions and total elapsed time

Then, you can compare durability across products within the same model tier.

To ensure representativeness, we prioritized the languages most widely used in Qoder and proportionally included their major tech stacks, which together comprise the current evaluation set.

How to Use

Visit the organization Agentic Coding Durability Evaluation Set.
Choose a repository for the language/stack you care about.
Open the project folder and read Prompt.md or Prompt.zh.md.
Run the prompt in your chosen AI coding product (e.g., Qoder, Cursor, Windsurf, Kiro, Claude Code).
Iterate until you personally judge the task “done.”
- This evaluation set intentionally does not include automated unit/integration/UI tests to decide completion.
- We rely on your human judgment, similar to real-world work where completeness, test coverage, visual acceptance, and maintainability vary by person.
Keep the model tier comparable across products whenever possible.
Record and compare consumption within your paid plan to evaluate durability.

Fairness and Rigor Suggestions

Use the same or equivalent model tier across tools.
Avoid mixing unrelated tasks in the same session.
Reset or isolate sessions when re-running the same task on another tool.
If supported, export chat/trace logs so others can review the process.
Record any manual steps performed outside the tool (e.g., refactors, fixes).

Limitations

The current coverage of languages, stacks, task types, difficulty range, and counts is limited.
While human judgment is inherently subjective, at this stage we find a comparative, human-in-the-loop evaluation more appropriate and closer to real-world usage.

Contributions are welcome to expand the coverage.

Contributing

We welcome two ways to contribute:

Open an Issue in this repository to:
- Share languages/stacks and representative tasks not yet covered here
- Share negative cases where Qoder consumed credits too quickly (if possible, include a sample project and the exact prompt)
Submit a Pull Request to the sample repositories under the Agentic Coding Durability Evaluation Set organization to:
- Enrich and improve existing projects, task prompts, and related materials

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agentic Coding Durability Evaluation Set

Project Content

How to Use

Fairness and Rigor Suggestions

Limitations

Contributing

License

About

Uh oh!

Releases

Packages

License

QoderAI/agentic-coding-durability-evalset

Folders and files

Latest commit

History

Repository files navigation

Agentic Coding Durability Evaluation Set

Project Content

How to Use

Fairness and Rigor Suggestions

Limitations

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages