Identifying Hidden Vulnerability Contributing Commits based on Ensemble Learning
-
Graphical Abstract
-
Abstract
In open-source software development, the increasing presence of vulnerability-contributing commits poses a significant security threat to the software supply chain. To address this issue in C/C++ repositories, we proposed the MG-VCC Recognizer, a deep learning-based tool designed to identify vulnerability-contributing commits by capturing code semantics at multiple granularity levels. The tool segmented code and its modifications into five hierarchical levels: file, function, AST-node, hunk, and line. To enhance semantic understanding from multiple perspectives, we employed advanced code-specific large language models for fine-tuning and code embedding. A large-scale experimental evaluation was conducted on a dataset consisting of 4622 commits from 65 projects. The experimental results demonstrated that the multi-granularity model outperformed single-level classification models in terms of both precision and recall. Furthermore, neural network ensemble models exhibited superior performance compared to voting ensemble models. An evaluation of various large language models indicated that CodeT5p-110M-Embedding achieved the highest performance across all metrics, with precision, recall, and F1 scores all at 96.03%, and an accuracy of 95.80%.
-
-