Theses and Dissertations

ORCID

https://orcid.org/0000-0003-0025-5524

Issuing Body

Mississippi State University

Advisor

Bhowmik, Tanmay

Committee Member

Iannucci, Stefano

Committee Member

Chen, Zhiqian

Committee Member

Torri, Stephen

Date of Degree

5-12-2023

Document Type

Dissertation - Campus Access Only

Major

Computer Science

Degree Name

Doctor of Philosophy (Ph.D)

College

James Worth Bagley College of Engineering

Department

Department of Computer Science and Engineering

Abstract

With an increase in complexity of software, developers rely more on reuse and dependencies in their source code via code snippets. As a result, it is becoming harder to identify and mitigate vulnerabilities. Although traditional analysis tools are still utilized, machine learning models are being adopted to expand efforts and combat such threats. Given the possibilities towards usage of such models, research in this area has introduced various approaches which vary in usability and prediction. In generalizing models to a more natural language approach, researchers have opted to train models on source code to identify existing and potential vulnerabilities. Exploratory research has been performed by treating source code as plain text, creating “text-based” models. With a motivation to prevent vulnerable code snippets, we present a dissertation on the effectiveness of text-based machine learning models for vulnerability detection. We utilize datasets composed of open-source projects and vulnerability types to generate our own training and testing data via extracted function pairings. Using this data, we evaluate a series of text-based machine learning models, coupled with natural language processing (NLP) techniques and our own data processing methods. Through empirical research, we demonstrate the effectiveness of such models based on statistical evidence. From these results, we determine negative correlations and identify "cross-cutting" features. Finally, we present analysis of models with "cross-cutting" feature removal to improve performance while providing explainability towards model decisions.

Share

COinS