Combining file content and file relations for cloud based malware detection

Due to their damages to Internet security, malware (such as virus, worms, trojans, spyware, backdoors, and rootkits) detection has caught the attention not only of anti-malware industry but also of researchers for decades. Resting on the analysis of file contents extracted from the file samples, like Application Programming Interface (API) calls, instruction sequences, and binary strings, data mining methods such as Naive Bayes and Support Vector Machines have been used for malware detection. However, besides file contents, relations among file samples, such as a "Downloader" is always associated with many Trojans, can provide invaluable information about the properties of file samples. In this paper, we study how file relations can be used to improve malware detection results and develop a file verdict system (named "Valkyrie") building on a semi-parametric classifier model to combine file content and file relations together for malware detection. To the best of our knowledge, this is the first work of using both file content and file relations for malware detection. A comprehensive experimental study on a large collection of PE files obtained from the clients of anti-malware products of Comodo Security Solutions Incorporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our Valkyrie system outperform other popular anti-malware software tools such as Kaspersky AntiVirus and McAfee VirusScan, as well as other alternative data mining based detection systems.

[1]  Sandeep B. Damodhare,et al.  Intelligent malware detection system , 2013 .

[2]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[3]  Éric Filiol,et al.  Computer Viruses: from theory to applications (Collection IRIS) , 2005 .

[4]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[5]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[6]  Lilly Suriani Affendey,et al.  Intrusion detection using data mining techniques , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).

[7]  Eric Filiol,et al.  Metamorphism, Formal Grammars and Undecidable Code Mutation , 2007 .

[8]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Éric Filiol Computer Viruses: from Theory to Applications , 2005 .

[11]  Andrew H. Sung,et al.  Static analyzer of vicious executables (SAVE) , 2004, 20th Annual Computer Security Applications Conference.

[12]  Eric Filiol,et al.  On the possibility of practically obfuscating programs towards a unified perspective of code protection , 2007, Journal in Computer Virology.

[13]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[14]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[15]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining and Inference for Malware Detection , 2011 .

[16]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[17]  H. Read,et al.  Metamorphism , 1940, Nature.

[18]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[19]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[20]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[21]  Jugal K. Kalita,et al.  Efficient handling of high-dimensional feature spaces by randomized classifier ensembles , 2002, KDD.

[22]  Xuxian Jiang,et al.  vEye: behavioral footprinting for self-propagating worm detection and profiling , 2008, Knowledge and Information Systems.

[23]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[24]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[25]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[26]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[27]  Gerald Tesauro,et al.  Neural networks for computer virus recognition , 1996 .

[28]  Shenghuo Zhu,et al.  Learning multiple graphs for document recommendations , 2008, WWW.

[30]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[31]  Arun K. Pujari,et al.  N-gram analysis for computer virus detection , 2006, Journal in Computer Virology.

[32]  Christopher Krügel,et al.  Dynamic Analysis of Malicious Code , 2006, Journal in Computer Virology.

[33]  Jau-Hwang Wang,et al.  Virus detection using data mining techinques , 2003, IEEE 37th Annual 2003 International Carnahan Conference onSecurity Technology, 2003. Proceedings..

[34]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.

[35]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[36]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.