Skip to content

Chaomeng-Lu/ICVul

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ICVul: A Well-labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs

This project is designed to extract, analyze, and filter CVEs (Common Vulnerabilities and Exposures) from the National Vulnerability Database (NVD) and then associate them with fix commits from GitHub open-source repositories. By using the SZZ algorithm, it identifies Vulnerability Contributing Commits (VCCs) and extracts relevant commit information.

This Dataset focus on improving Data Quality:

  • ✅ Based on real-world CVEs from NVD linked to GitHub fix commits
  • 🧠 Enriched with comprehensive metadata at the function, commit, and file levels
  • 🔁 Uses the SZZ algorithm to trace Vulnerability-Contributing Commits (VCCs)
  • 🔍 Introduces a novel ESC (Eliminate Suspicious Commit) technique to ensure label reliability

DatasetS

Project Overview

This project processes and analyzes data from CVE, CWE, and related repositories to generate mappings between vulnerabilities and their associated commits, repositories, and files. The pipeline involves several steps: extracting repository information, processing and filtering commit data, and removing noise to focus on relevant information.

Overview

Prerequisites

1. Python Environment

Ensure you have Python installed along with the following packages:

  • pandas, pydriller,github,git

2. GitHub Token

Set up a valid GitHub token to enable the script to fetch repository information. Replace the placeholder in the script:

  • github_token = 'your_github_token'

This project automates the extraction of CVEs from the NVD and identifies fix commits that resolve those vulnerabilities in open-source software repositories. It uses the SZZ algorithm to trace VCCs (Vulnerable Commit Changes), which is vital for vulnerability remediation and analysis.

Workflow

Step 1: Extract Repository Information

Step 2: Process Commit/file/function Data and cve_fc_vcc_mapping table

Step 3: Eliminate Suspicious Commits (ESC)

  • Execution:
    Update step = 1 (3 steps in order )in main.py and run the following command:
    python main.py
    

Key Modules

  • data_structure Defines the structure of the five tables.

  • vcc_extraction Handles the fetching of VCC URLs for commits (including SZZ algorithm).

  • commit_extraction Handles the fetching of files and functions.

  • repo_extraction Provides methods to extract repository metadata and validate repository URLs.

  • fc_filter_ESC Implements filtering logic to eliminate suspicious or noisy commits.

  • utils Provides utility functions, including parallel processing of URLs.

Logging

All steps log progress and results to log/running.log

Notes

  • Token Management: Ensure your GitHub token has sufficient permissions to access the repositories in question.
  • Custom Adjustments: If you encounter commits requiring manual correction (e.g., fc_hash adjustments), edit the script as indicated in the comments.

Dataset Available (Collect at Nov, 2024)

https://drive.google.com/file/d/1Bnnb7kJa8GEfyESIAuGXj2z0g8FvXgRk/view?usp=drive_link

Citation

If you use this repository or its outputs in your research, please cite the associated paper:

  1. C. Lu, T. Li, T. Dehaene and B. Lagaisse, "ICVul: A Well-labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs," 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), Ottawa, ON, Canada, 2025, pp. 154-158, doi: 10.1109/MSR66628.2025.00034.

About

A Well-labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages