Skip to content

QoderAI/agentic-coding-durability-evalset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Coding Durability Evaluation Set

License: MIT PRs Welcome

This repository collects real-world coding tasks across multiple languages and tech stacks, with varying difficulty. It helps developers evaluate how durable different agentic coding products are when tackling diverse, practical programming work.

This project does not aim to crown a single “best” product. Instead, it provides a set of use cases that enable fair comparisons across products.

Project Content

We continuously publish practical task sets covering multiple languages and frameworks. Each task includes a concise, tool-friendly prompt (Prompt.md or Prompt.zh.md), so you can run it in different AI coding products (e.g., Qoder, Cursor, Windsurf, Kiro, Claude Code) and observe:

  • The progress each tool makes toward the goal
  • The quota/credits/tokens it consumes
  • The number of interactions and total elapsed time

Then, you can compare durability across products within the same model tier.

To ensure representativeness, we prioritized the languages most widely used in Qoder and proportionally included their major tech stacks, which together comprise the current evaluation set.

How to Use

  1. Visit the organization Agentic Coding Durability Evaluation Set.
  2. Choose a repository for the language/stack you care about.
  3. Open the project folder and read Prompt.md or Prompt.zh.md.
  4. Run the prompt in your chosen AI coding product (e.g., Qoder, Cursor, Windsurf, Kiro, Claude Code).
  5. Iterate until you personally judge the task “done.”
    • This evaluation set intentionally does not include automated unit/integration/UI tests to decide completion.
    • We rely on your human judgment, similar to real-world work where completeness, test coverage, visual acceptance, and maintainability vary by person.
  6. Keep the model tier comparable across products whenever possible.
  7. Record and compare consumption within your paid plan to evaluate durability.

Fairness and Rigor Suggestions

  • Use the same or equivalent model tier across tools.
  • Avoid mixing unrelated tasks in the same session.
  • Reset or isolate sessions when re-running the same task on another tool.
  • If supported, export chat/trace logs so others can review the process.
  • Record any manual steps performed outside the tool (e.g., refactors, fixes).

Limitations

  • The current coverage of languages, stacks, task types, difficulty range, and counts is limited.
  • While human judgment is inherently subjective, at this stage we find a comparative, human-in-the-loop evaluation more appropriate and closer to real-world usage.

Contributions are welcome to expand the coverage.

Contributing

We welcome two ways to contribute:

  • Open an Issue in this repository to:
    • Share languages/stacks and representative tasks not yet covered here
    • Share negative cases where Qoder consumed credits too quickly (if possible, include a sample project and the exact prompt)
  • Submit a Pull Request to the sample repositories under the Agentic Coding Durability Evaluation Set organization to:
    • Enrich and improve existing projects, task prompts, and related materials

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published