This project is designed to extract, analyze, and filter CVEs (Common Vulnerabilities and Exposures) from the National Vulnerability Database (NVD) and then associate them with fix commits from GitHub open-source repositories. By using the SZZ algorithm, it identifies Vulnerability Contributing Commits (VCCs) and extracts relevant commit information.
This Dataset focus on improving Data Quality:
- ✅ Based on real-world CVEs from NVD linked to GitHub fix commits
- 🧠 Enriched with comprehensive metadata at the function, commit, and file levels
- 🔁 Uses the SZZ algorithm to trace Vulnerability-Contributing Commits (VCCs)
- 🔍 Introduces a novel ESC (Eliminate Suspicious Commit) technique to ensure label reliability
This project processes and analyzes data from CVE, CWE, and related repositories to generate mappings between vulnerabilities and their associated commits, repositories, and files. The pipeline involves several steps: extracting repository information, processing and filtering commit data, and removing noise to focus on relevant information.
Ensure you have Python installed along with the following packages:
pandas,pydriller,github,git
Set up a valid GitHub token to enable the script to fetch repository information. Replace the placeholder in the script:
- github_token = 'your_github_token'
This project automates the extraction of CVEs from the NVD and identifies fix commits that resolve those vulnerabilities in open-source software repositories. It uses the SZZ algorithm to trace VCCs (Vulnerable Commit Changes), which is vital for vulnerability remediation and analysis.
- Execution:
Updatestep = 1(3 steps in order )inmain.pyand run the following command:python main.py
-
data_structureDefines the structure of the five tables. -
vcc_extractionHandles the fetching of VCC URLs for commits (including SZZ algorithm). -
commit_extractionHandles the fetching of files and functions. -
repo_extractionProvides methods to extract repository metadata and validate repository URLs. -
fc_filter_ESCImplements filtering logic to eliminate suspicious or noisy commits. -
utilsProvides utility functions, including parallel processing of URLs.
All steps log progress and results to log/running.log
- Token Management: Ensure your GitHub token has sufficient permissions to access the repositories in question.
- Custom Adjustments: If you encounter commits requiring manual correction (e.g., fc_hash adjustments), edit the script as indicated in the comments.
https://drive.google.com/file/d/1Bnnb7kJa8GEfyESIAuGXj2z0g8FvXgRk/view?usp=drive_link
If you use this repository or its outputs in your research, please cite the associated paper:
- C. Lu, T. Li, T. Dehaene and B. Lagaisse, "ICVul: A Well-labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs," 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), Ottawa, ON, Canada, 2025, pp. 154-158, doi: 10.1109/MSR66628.2025.00034.
