SWE-Compass: Unified Evaluation Benchmark for Agentic Coding
π§ Overview
SWE-Compass is a unified benchmark and dataset designed to evaluate Agentic Coding capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.
It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.
Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.
Key Features
- β 2,000 curated tasks from real GitHub issues and pull requests.
- 8 task types Γ 8 scenarios Γ 10 programming languages for comprehensive coverage.
- Fully reproducible pipeline including setup scripts, environment dependencies, and test suites.
- Multi-dimensional evaluation of correctness, reasoning trace, and agentic efficiency.
π Dataset Structure
SWE-Compass/
ββ data/
β ββ test.jsonl # main evaluation set (~2,000 instances)
β ββ dev.jsonl # optional validation split
β ββ train.jsonl # optional training data
ββ scripts/
β ββ setup_env.sh # environment setup (dependency installation)
β ββ run_instance.py # run one instance end-to-end
β ββ eval_aggregate.py # aggregate evaluation metrics
ββ README.md
{
"instance_id": "compass_01234",
"task_type": "bug_fixing",
"scenario": "mono_repo_ci",
"language": "python",
"difficulty": "medium",
"source": {
"repo": "owner/project",
"commit": "abcdef123456",
"issue_or_pr": "PR#13091",
"gh_url": "https://github.com/owner/project/pull/13091"
},
"instruction": "Fix failing test in module X caused by Y...",
"context_files": ["path/to/file1.py", "path/to/file2.py"],
"tools_available": ["git", "pytest", "bash"],
"evaluation": {
"setup_cmds": ["pip install -e .", "pytest -q"],
"test_cmd": "pytest -q",
"timeout_sec": 1800
},
"reference_patch": "diff --git a/... b/...",
"verified": true
}
π§ͺ Usage
Load via π€ datasets
from datasets import load_dataset
dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
print(len(dataset), dataset[0].keys())
Run Local Evaluation
Run Local Evaluation
bash scripts/setup_env.sh
python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs
π Metrics
| Category | Metric | Description |
|---|---|---|
| Main | Solved@1 / Solved@k | Fraction of tasks solved within k attempts |
| Process | Tool Calls / Latency | Efficiency and reasoning stability |
| Failure Types | Build Error / Test Fail / Timeout | Root-cause classification |
π§© Citation
@article{xu2025swecompass,
title = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
author = {Jingxuan Xu and others},
journal = {arXiv preprint arXiv:2511.05459},
year = {2025}
}
π€ Contributing
We welcome community contributions β new verified instances, environment fixes, or evaluation scripts. Please open a pull request or issue on this repository.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support