Malware detection using augmented naive Bayes with domain knowledge and under presence of class noise

Malicious software (malware) attacks on the internet are on the rise in frequency and sophistication. Malware detection based on its content can detect malware more accurate because it relies on screening the payload for known malware signatures. New malware variants still exhibit prevalent contents that can be detected by looking at fixed substrings especially when using n-grams and machine learning technique. This paper focuses on detecting malware based on content classification technique that is augmented with domain knowledge (Snort signatures) to abridge features set and improve detection accuracy. Using 15 days dataset, the generated naive Bayes model with domain knowledge using the most descriptive 91,127 features shows the lowest false negative (around 2%). However, the presence of class noise has a significant impact on the results, even for machine learning technique augmented with domain knowledge.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Yuval Elovici,et al.  Unknown malcode detection via text categorization and the imbalance problem , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[3]  Robert Layton,et al.  Malware Detection Based on Structural and Behavioural Features of API Calls , 2010 .

[4]  Robert K. Cunningham,et al.  A taxonomy of computer worms , 2003, WORM '03.

[5]  Douglas S. Reeves,et al.  Polymorphic and metamorphic malware detection , 2008 .

[6]  Tankut Acarman,et al.  Proposal of n-gram Based Algorithm for Malware Classification , 2011, SECURWARE 2011.

[7]  Sulaiman Mohd Nor,et al.  Detecting Worms Using Data Mining Techniques: Learning in the Presence of Class Noise , 2010, 2010 Sixth International Conference on Signal-Image Technology and Internet Based Systems.

[8]  Paul A. Watters,et al.  Cybercrime: The Case of Obfuscated Malware , 2011, ICGS3/e-Democracy.

[9]  Muhammad Zubair Shafiq,et al.  Improving accuracy of immune-inspired malware detectors by using intelligent features , 2008, GECCO '08.

[10]  Ameer Al-Nemrat,et al.  Malicious Code Detection Using Penalized Splines on OPcode Frequency , 2012, 2012 Third Cybercrime and Trustworthy Computing Workshop.

[11]  Liva Ralaivola,et al.  Efficient learning of Naive Bayes classifiers under class-conditional classification noise , 2006, ICML.

[12]  Joseph R. Rabaiotti,et al.  Malware Detection using Structural and Behavioural Features and Machine Learning , 2007 .

[13]  Lior Rokach,et al.  Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features , 2012, J. Mach. Learn. Res..

[14]  Mahi Lohi,et al.  A Comparative Study of Selected Classifiers with Classification Accuracy in User Profiling , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[15]  Peter Grabosky,et al.  Crime in Cyberspace: Offenders and the Role of Organized Crime Groups , 2013 .

[16]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[17]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[18]  Lawrence M. Rudner,et al.  Automated Essay Scoring Using Bayes' Theorem , 2002 .

[19]  Yun Chi,et al.  Learning Naive Bayes Classifier from Noisy Data , 2003 .

[20]  Paul A. Watters,et al.  Zero-day Malware Detection based on Supervised Learning Algorithms of API call Signatures , 2011, AusDM.

[21]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[22]  Andrew Walenstein,et al.  VILO: a rapid learning nearest-neighbor classifier for malware triage , 2013, Journal of Computer Virology and Hacking Techniques.

[23]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[24]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[25]  Yuval Elovici,et al.  Detecting unknown malicious code by applying classification techniques on OpCode patterns , 2012, Security Informatics.

[26]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[27]  Yanrong Yang,et al.  Malware Detection Through Mining Symbol Table of Linux Executables , 2013 .