Skip to content

UzL-ITS/vfc_datasets

Repository files navigation

VFC Datasets

A Python toolkit for loading, unifying, and enriching Vulnerability-Fixing Commit (VFC) datasets.

Plenty of VFC datasets exist, but they differ in schema, content, and completeness. Some were built years ago and contain stale data, such as project URLs that have since moved between hosting platforms. vfc_datasets loads the datasets through a single interface and yields a shared DatasetEntry. Transformations let you deduplicate, filter, sanitize, and enrich entries with commit data (message, diff, files changed, timestamp).

Installation

Open the repo in VS Code and run Reopen in Container to use the provisioned devcontainer.

Alternatively, install into your own environment (requires Python 3.12+):

pip install -e .

All settings have sensible defaults. To override them, copy .env.example to .env and adjust as needed.

Quick Start

See examples/ for scripts covering loading, combining, transforming, and enriching entries with commit data.

Supported Datasets

Commit-level (20 datasets)
Year Dataset VFCs Non-VFCs Projects Paper
2017 SecBench 676 0 248 link
2019 Devign 10,894 14,978 2 link
2019 MSR2019 1,282 0 205 link
2020 BigVul 4,432 0 348 link
2021 CC900 3,765 6,347 910 link
2021 CVEFixes 13,297 0 4,249 link
2021 CrossVul 5,877 0 1,675 link
2021 PatchDB 10,691 23,742 313 link
2021 SPIDB 10,894 14,979 2 link
2021 TQRG 8,057 110,161 1,339 link
2022 Tracer 3,017 0 link
2022 VCMatch 1,669 0 10 link
2022 VUDEnc 1,009 0 link
2023 PySecDB 1,142 2,721 351 link
2024 JavaVFC 784 0 263 link
2024 JavaVFCExtended 16,837 0 2,532 link
2024 Morefixes 35,130 0 6,945 link
2024 RepoSPD 18,127 31,397 348 link
2025 FixSeekerBalanced 9,916 10,979 2,094 link
2025 FixSeekerImbalanced 9,915 499,150 2,094 link
Function-level (6 datasets)
Year Dataset Vuln. Fns Benign Fns Projects Paper
2023 DiverseVul 18,945 311,547 797 link
2023 SVEN 800 0 link
2024 CleanVul 8,198 0 link
2024 MegaVul 20,267 367,147 992 link
2024 PrimeVul 6,004 218,529 755 link
2025 ICVul 6,276 9,120 807 link

Contributing

Issues and PRs are very welcome, especially for new VFC datasets:

Citation

If you use any of these datasets for your research, please cite the original dataset authors. Paper titles and URLs are available via DatasetClass.metadata.paper_url.

If you find our toolkit useful, please consider citing it as:

TODO

About

A Python toolkit for loading, parsing, and processing Vulnerability-Fixing Commit (VFC) datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors