Mississippi State University
Date of Degree
Dissertation - Campus Access Only
Doctor of Philosophy (Ph.D)
James Worth Bagley College of Engineering
Department of Computer Science and Engineering
With an increase in complexity of software, developers rely more on reuse and dependencies in their source code via code snippets. As a result, it is becoming harder to identify and mitigate vulnerabilities. Although traditional analysis tools are still utilized, machine learning models are being adopted to expand efforts and combat such threats. Given the possibilities towards usage of such models, research in this area has introduced various approaches which vary in usability and prediction. In generalizing models to a more natural language approach, researchers have opted to train models on source code to identify existing and potential vulnerabilities. Exploratory research has been performed by treating source code as plain text, creating “text-based” models. With a motivation to prevent vulnerable code snippets, we present a dissertation on the effectiveness of text-based machine learning models for vulnerability detection. We utilize datasets composed of open-source projects and vulnerability types to generate our own training and testing data via extracted function pairings. Using this data, we evaluate a series of text-based machine learning models, coupled with natural language processing (NLP) techniques and our own data processing methods. Through empirical research, we demonstrate the effectiveness of such models based on statistical evidence. From these results, we determine negative correlations and identify "cross-cutting" features. Finally, we present analysis of models with "cross-cutting" feature removal to improve performance while providing explainability towards model decisions.
Napier, Kollin Ryne, "An analysis of text-based machine learning models for vulnerability detection" (2023). Theses and Dissertations. 5849.