A Python toolkit for loading, unifying, and enriching Vulnerability-Fixing Commit (VFC) datasets.
Plenty of VFC datasets exist, but they differ in schema, content, and completeness. Some were built years ago and contain stale data, such as project URLs that have since moved between hosting platforms. vfc_datasets loads the datasets through a single interface and yields a shared DatasetEntry. Transformations let you deduplicate, filter, sanitize, and enrich entries with commit data (message, diff, files changed, timestamp).
Open the repo in VS Code and run Reopen in Container to use the provisioned devcontainer.
Alternatively, install into your own environment (requires Python 3.12+):
pip install -e .All settings have sensible defaults. To override them, copy .env.example to .env and adjust as needed.
See examples/ for scripts covering loading, combining, transforming, and enriching entries with commit data.
Commit-level (20 datasets)
| Year | Dataset | VFCs | Non-VFCs | Projects | Paper |
|---|---|---|---|---|---|
| 2017 | SecBench | 676 | 0 | 248 | link |
| 2019 | Devign | 10,894 | 14,978 | 2 | link |
| 2019 | MSR2019 | 1,282 | 0 | 205 | link |
| 2020 | BigVul | 4,432 | 0 | 348 | link |
| 2021 | CC900 | 3,765 | 6,347 | 910 | link |
| 2021 | CVEFixes | 13,297 | 0 | 4,249 | link |
| 2021 | CrossVul | 5,877 | 0 | 1,675 | link |
| 2021 | PatchDB | 10,691 | 23,742 | 313 | link |
| 2021 | SPIDB | 10,894 | 14,979 | 2 | link |
| 2021 | TQRG | 8,057 | 110,161 | 1,339 | link |
| 2022 | Tracer | 3,017 | 0 | — | link |
| 2022 | VCMatch | 1,669 | 0 | 10 | link |
| 2022 | VUDEnc | 1,009 | 0 | — | link |
| 2023 | PySecDB | 1,142 | 2,721 | 351 | link |
| 2024 | JavaVFC | 784 | 0 | 263 | link |
| 2024 | JavaVFCExtended | 16,837 | 0 | 2,532 | link |
| 2024 | Morefixes | 35,130 | 0 | 6,945 | link |
| 2024 | RepoSPD | 18,127 | 31,397 | 348 | link |
| 2025 | FixSeekerBalanced | 9,916 | 10,979 | 2,094 | link |
| 2025 | FixSeekerImbalanced | 9,915 | 499,150 | 2,094 | link |
Function-level (6 datasets)
| Year | Dataset | Vuln. Fns | Benign Fns | Projects | Paper |
|---|---|---|---|---|---|
| 2023 | DiverseVul | 18,945 | 311,547 | 797 | link |
| 2023 | SVEN | 800 | 0 | — | link |
| 2024 | CleanVul | 8,198 | 0 | — | link |
| 2024 | MegaVul | 20,267 | 367,147 | 992 | link |
| 2024 | PrimeVul | 6,004 | 218,529 | 755 | link |
| 2025 | ICVul | 6,276 | 9,120 | 807 | link |
Issues and PRs are very welcome, especially for new VFC datasets:
- Adding one: see
vfc_datasets/commit_level/andvfc_datasets/function_level/for reference implementations. - Suggesting one: open an issue pointing to the dataset.
If you use any of these datasets for your research, please cite the original dataset authors. Paper titles and URLs are available via DatasetClass.metadata.paper_url.
If you find our toolkit useful, please consider citing it as:
TODO