A Semi-supervised Learning Methodology for Malware Categorization using Weighted Word Embeddings

Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.

[1]  Claudia Eckert,et al.  Feature Selection and Extraction for Malware Classification , 2015, J. Inf. Sci. Eng..

[2]  Andrew Walenstein,et al.  Exploiting Similarity Between Variants to Defeat Malware “ Vilo ” Method for Comparing and Searching Binary Programs , 2007 .

[3]  Kieran McLaughlin,et al.  Obfuscation: The Hidden Malware , 2011, IEEE Security & Privacy.

[4]  Smaine Mazouzi,et al.  Statistical Study of Imported APIs by PE Type Malware , 2014, 2014 International Conference on Advanced Networking Distributed Systems and Applications.

[5]  Christopher Krügel,et al.  Dynamic Analysis of Malicious Code , 2006, Journal in Computer Virology.

[6]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[7]  Huy Kang Kim,et al.  Function-Oriented Mobile Malware Analysis as First Aid , 2016, Mob. Inf. Syst..

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Gabriel Sanchez-Perez,et al.  Methodology for Malware Classification using a Random Forest Classifier , 2018, 2018 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC).

[10]  Claudia Eckert,et al.  Finding the Needle: A Study of the PE32 Rich Header and Respective Malware Triage , 2017, DIMVA.

[11]  Vitor Monte Afonso,et al.  Toward a Taxonomy of Malware Behaviors , 2015, Comput. J..

[12]  Nathan S. Netanyahu,et al.  DeepSign: Deep learning for automatic malware signature generation and classification , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[13]  Igor Popov,et al.  Malware detection using machine learning based on word2vec embeddings of machine code instructions , 2017, 2017 Siberian Symposium on Data Science and Engineering (SSDSE).

[14]  Michael Carl Tschantz,et al.  Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels , 2015, AISec@CCS.

[15]  Gaute Wangen,et al.  The Role of Malware in Reported Cyber Espionage: A Review of the Impact and Mechanism , 2015, Inf..

[16]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[17]  Win Zaw,et al.  Permission-Based Android Malware Detection , 2013 .

[18]  Mohamed Nassar,et al.  Modeling Malware as a Language , 2018, 2018 IEEE International Conference on Communications (ICC).

[19]  Mansour Ahmadi,et al.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , 2015, CODASPY.

[20]  Latifur Khan,et al.  A Machine Learning Approach to Android Malware Detection , 2012, 2012 European Intelligence and Security Informatics Conference.

[21]  Andrew Honig,et al.  Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software , 2012 .

[22]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[23]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[24]  Salvatore J. Stolfo,et al.  Unsupervised Anomaly-Based Malware Detection Using Hardware Features , 2014, RAID.

[25]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[26]  David A. Mundie,et al.  An Ontology for Malware Analysis , 2013, 2013 International Conference on Availability, Reliability and Security.

[27]  Ian Thornton-Trump Malicious Attacks and Actors: An Examination of the Modern Cyber Criminal , 2018 .

[28]  Tao Li,et al.  An intelligent PE-malware detection system based on association mining , 2008, Journal in Computer Virology.

[29]  Hassan B. Kazemian,et al.  Comparisons of machine learning techniques for detecting malicious webpages , 2015, Expert Syst. Appl..

[30]  Divya Bansal,et al.  Malware Analysis and Classification: A Survey , 2014 .