AGENTISSUE-BENCH

AGENTISSUE-BENCH is the first reproducible issue resolution benchmark focused on real-world agent system issues. It is designed to evaluate the efficacy of state-of-the-art software engineering (SE) agents in resolving these issues.

🗓️ Updates

2025-05: Initial benchmark release

📚 Benchmark Dataset

Through a multi-step filtering process—including failure reproduction, patch reproduction, and non-flakiness verification—we collect 50 reproducible agents issues, which form AGENTISSUE-BENCH.

Each issue is containerized as a Docker image and hosted on Docker Hub: 🔗 Docker Hub Repository

To retrieve the images for all issues, run:

$ python pull_images.py

To pull a specific image by tag, use:

$ python pull_images.py --tag <tag>

To remove all pulled Docker images and containers, run:

$ python remove_images.py

To remove a specific image and container by tag:

$ python remove_images.py --tag <tag>

📊 Results

Overall Resoultion Rate

The following figure shows the resolution rate of AgentIssue-Bench v.s. traditional software issues:

The following table presents the overall results of SE agents on AgentIssue-Bench:

The following figure shows the distribution of AgentIssue-Bench:

🔧 Patch Generation

We evaluate the capabilities of 3 state-of-the-art SE agents on AGENTISSUE-BENCH, collecting the patches they generate to resolve real-world agent issues.

🛠️ Setup Instructions

1. Clone the Repository

$ git clone https://github.com/To-D/AgentIssue-Bench.git

2. Run studied SE Agents

Note: please download the repo folder from 🔗Repo Link . Extract the file and store the repo/ folder in Agentless' root directory and AutoCodeRover's root directory for patch generation.

Agentless

$ cd Agentless
$ conda create -n agentless python=3.12
$ conda activate agentless
$ chmod +x run_agentless.sh
$ ./run_agentless.sh

AutoCodeRover

$ cd auto-code-rover
$ conda create -n auto-code-rover python=3.12
$ conda activate auto-code-rover
$ python run_autocoderover.py

SWE-agent

$ cd SWE-agent
$ conda create -n swe_agent python=3.12
$ conda activate swe_agent
$ chmod +x gen_patches_all.sh
$ ./gen_patches_all.sh

📁 Generated Patches

The Generated Patches directory contains all patches generated by our evaluation of different SE agents and Large Language Models (LLMs). The patches are organized as follows:

Generated Patches/
├── swe-agent/         # Patches generated by SWE-agent
├── Agentless/         # Patches generated by Agentless
└── Auto-code-rover/   # Patches generated by Auto-code-rover

Each agent directory contains patches generated using two state-of-the-art LLMs:

claude-3-5-sonnet-20241022
gpt-4o

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Agentless		Agentless
Generated Patches		Generated Patches
SWE-agent		SWE-agent
auto-code-rover		auto-code-rover
benchmark		benchmark
output/images		output/images
.gitattributes		.gitattributes
AgentIssue-Bench Spreadsheet.xlsx		AgentIssue-Bench Spreadsheet.xlsx
LICENSE		LICENSE
README.md		README.md
pull_images.py		pull_images.py
remove_images.py		remove_images.py
transfer_docker.py		transfer_docker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AGENTISSUE-BENCH

🗓️ Updates

📚 Benchmark Dataset

📊 Results

Overall Resoultion Rate

🔧 Patch Generation

🛠️ Setup Instructions

1. Clone the Repository

2. Run studied SE Agents

📁 Generated Patches

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

alfin06/AgentIssue-Bench

Folders and files

Latest commit

History

Repository files navigation

AGENTISSUE-BENCH

🗓️ Updates

📚 Benchmark Dataset

📊 Results

Overall Resoultion Rate

🔧 Patch Generation

🛠️ Setup Instructions

1. Clone the Repository

2. Run studied SE Agents

📁 Generated Patches

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages