Identifying Hidden Vulnerability Contributing Commits based on Ensemble Learning

YuanYuan Zhai; ZiWei Chen; LingYun Situ; ShiJie Guo

YuanYuan Zhai, ZiWei Chen, LingYun Situ, ShiJie Guo. Identifying Hidden Vulnerability Contributing Commits based on Ensemble Learning[J]. Journal of Command and Control, 2025, 4(1): 23-34.

Citation:

YuanYuan Zhai, ZiWei Chen, LingYun Situ, ShiJie Guo. Identifying Hidden Vulnerability Contributing Commits based on Ensemble Learning[J]. Journal of Command and Control, 2025, 4(1): 23-34.

Citation:

YuanYuan Zhai, ZiWei Chen, LingYun Situ, ShiJie Guo. Identifying Hidden Vulnerability Contributing Commits based on Ensemble Learning[J]. Journal of Command and Control, 2025, 4(1): 23-34.

Identifying Hidden Vulnerability Contributing Commits based on Ensemble Learning

Graphical Abstract

Graphical Abstract

Abstract

Abstract

In open-source software development, the increasing presence of vulnerability-contributing commits poses a significant security threat to the software supply chain. To address this issue in C/C++ repositories, we proposed the MG-VCC Recognizer, a deep learning-based tool designed to identify vulnerability-contributing commits by capturing code semantics at multiple granularity levels. The tool segmented code and its modifications into five hierarchical levels: file, function, AST-node, hunk, and line. To enhance semantic understanding from multiple perspectives, we employed advanced code-specific large language models for fine-tuning and code embedding. A large-scale experimental evaluation was conducted on a dataset consisting of 4622 commits from 65 projects. The experimental results demonstrated that the multi-granularity model outperformed single-level classification models in terms of both precision and recall. Furthermore, neural network ensemble models exhibited superior performance compared to voting ensemble models. An evaluation of various large language models indicated that CodeT5p-110M-Embedding achieved the highest performance across all metrics, with precision, recall, and F1 scores all at 96.03%, and an accuracy of 95.80%.

FullText(HTML)

References (23)

Cited By

Identifying Hidden Vulnerability Contributing Commits based on Ensemble Learning

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content