A Machine Learning Approach to Dataset Imputation for Software Vulnerabilities

This paper proposes a supervised machine learning approach for the imputation of missing categorical values in a dataset where the majority of samples are incomplete. Twelve models have been designed that can predict nine of the twelve Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) tactic categories using only the Common Attack Pattern Enumeration and Classification (CAPEC). The proposed method has been evaluated on a test dataset consisting of 867 unseen samples, with the classification accuracy ranging from 99.88% to 100%. These models were employed to generate a more complete dataset with no missing ATT&CK tactic features.

[1]  William A. Arbaugh,et al.  IEEE 52 Computer , 1985 .

[2]  Eric Michael Hutchins,et al.  Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains , 2010 .

[3]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[6]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[7]  D B Rubin,et al.  Multiple imputation in health-care databases: an overview and some applications. , 1991, Statistics in medicine.

[8]  Cheng Guo,et al.  Entity Embeddings of Categorical Variables , 2016, ArXiv.

[9]  Tshilidzi Marwala,et al.  Missing data: A comparison of neural network and expectation maximization techniques , 2007 .

[10]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[11]  Kyriakos Kritikos,et al.  A survey on vulnerability assessment tools and databases for cloud-based web applications , 2019, Array.

[12]  Pascal Vincent,et al.  Artificial Neural Networks Applied to Taxi Destination Prediction , 2015, DC@PKDD/ECML.

[13]  C G Wilmot,et al.  Comparison of Hot-Deck and Neural-Network Imputation , 2003 .

[14]  Douglas G Altman,et al.  Developing a prognostic model in the presence of missing data: an ovarian cancer case study. , 2003, Journal of clinical epidemiology.

[15]  Yashwant K. Malaiya,et al.  A Framework for Software Security Risk Evaluation using the Vulnerability Lifecycle and CVSS Metrics , 2010 .

[16]  Nikhil R. Pal,et al.  Imputation of missing data with neural networks for classification , 2019, Knowl. Based Syst..

[17]  Radosław Żuber Actionable Information for Security Incident Response , 2015 .

[18]  Cheng Huang,et al.  A study on Web security incidents in China by analyzing vulnerability disclosure platforms , 2016, Comput. Secur..