Transformers
multilingual
Kwaipilot

SWE-Compass: Unified Evaluation Benchmark for Agentic Coding


🧭 Overview

SWE-Compass is a unified benchmark and dataset designed to evaluate Agentic Coding capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.

It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.
Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.

Key Features

  • β‰ˆ 2,000 curated tasks from real GitHub issues and pull requests.
  • 8 task types Γ— 8 scenarios Γ— 10 programming languages for comprehensive coverage.
  • Fully reproducible pipeline including setup scripts, environment dependencies, and test suites.
  • Multi-dimensional evaluation of correctness, reasoning trace, and agentic efficiency.

πŸ“ Dataset Structure

SWE-Compass/
β”œβ”€ data/
β”‚  β”œβ”€ test.jsonl              # main evaluation set (~2,000 instances)
β”‚  β”œβ”€ dev.jsonl               # optional validation split
β”‚  └─ train.jsonl             # optional training data
β”œβ”€ scripts/
β”‚  β”œβ”€ setup_env.sh            # environment setup (dependency installation)
β”‚  β”œβ”€ run_instance.py         # run one instance end-to-end
β”‚  └─ eval_aggregate.py       # aggregate evaluation metrics
└─ README.md
{
  "instance_id": "compass_01234",
  "task_type": "bug_fixing",
  "scenario": "mono_repo_ci",
  "language": "python",
  "difficulty": "medium",
  "source": {
    "repo": "owner/project",
    "commit": "abcdef123456",
    "issue_or_pr": "PR#13091",
    "gh_url": "https://github.com/owner/project/pull/13091"
  },
  "instruction": "Fix failing test in module X caused by Y...",
  "context_files": ["path/to/file1.py", "path/to/file2.py"],
  "tools_available": ["git", "pytest", "bash"],
  "evaluation": {
    "setup_cmds": ["pip install -e .", "pytest -q"],
    "test_cmd": "pytest -q",
    "timeout_sec": 1800
  },
  "reference_patch": "diff --git a/... b/...",
  "verified": true
}

πŸ§ͺ Usage

Load via πŸ€— datasets

from datasets import load_dataset

dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
print(len(dataset), dataset[0].keys())
Run Local Evaluation

Run Local Evaluation

bash scripts/setup_env.sh
python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs

πŸ“Š Metrics

Category Metric Description
Main Solved@1 / Solved@k Fraction of tasks solved within k attempts
Process Tool Calls / Latency Efficiency and reasoning stability
Failure Types Build Error / Test Fail / Timeout Root-cause classification

🧩 Citation

@article{xu2025swecompass,
  title   = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
  author  = {Jingxuan Xu and others},
  journal = {arXiv preprint arXiv:2511.05459},
  year    = {2025}
}

🀝 Contributing

We welcome community contributions β€” new verified instances, environment fixes, or evaluation scripts. Please open a pull request or issue on this repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support