N-gram analysis for computer virus detection

Generic computer virus detection is the need of the hour as most commercial antivirus software fail to detect unknown and new viruses. Motivated by the success of datamining/machine learning techniques in intrusion detection systems, recent research in detecting malicious executables is directed towards devising efficient non-signature-based techniques that can profile the program characteristics from a set of training examples. Byte sequences and byte n-grams are considered to be basis of feature extraction. But as the number of n-grams is going to be very large, several methods of feature selections were proposed in literature. A recent report on use of information gain based feature selection has yielded the best-known result in classifying malicious executables from benign ones. We observe that information gain models the presence of n-gram in one class and its absence in the other. Through a simple example we show that this may lead to erroneous results. In this paper, we describe a new feature selection measure, class-wise document frequency of byte n-grams. We empirically demonstrate that the proposed method is a better method for feature selection. For detection, we combine several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier. Our experimental results show that such a scheme detects virus program far more efficiently than the earlier known methods.

[1]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[2]  Fred Cohen,et al.  Computer viruses—theory and experiments , 1990 .

[3]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[4]  Philippe Smets,et al.  Belief functions: The disjunctive rule of combination and the generalized Bayesian theorem , 1993, Int. J. Approx. Reason..

[5]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[6]  Jeffrey O. Kephart,et al.  Biologically Inspired Defenses Against Computer Viruses , 1995, IJCAI.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[10]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12]  Catherine K. Murphy Combining belief functions when evidence conflicts , 2000, Decis. Support Syst..

[13]  William C. Arnold,et al.  AUTOMATICALLY GENERATED WIN32 HEURISTIC VIRUS DETECTION , 2000 .

[14]  Robert P. W. Duin,et al.  Experiments with Classifier Combining Rules , 2000, Multiple Classifier Systems.

[15]  Gary McGraw,et al.  Attacking Malicious Code: A Report to the Infosec Research Council , 2000, IEEE Software.

[16]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[17]  Salvatore J. Stolfo,et al.  USENIX Association Proceedings of the FREENIX Track : 2001 USENIX Annual , 2001 .

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[19]  Kari Sentz,et al.  Combination of Evidence in Dempster-Shafer Theory , 2002 .

[20]  Eric Lefevre,et al.  Belief function combination and conflict management , 2002, Inf. Fusion.

[21]  Sargur N. Srihari,et al.  Class-wise multi-classifier combination based on Dempster-Shafer theory , 2002, 7th International Conference on Control, Automation, Robotics and Vision, 2002. ICARCV 2002..

[22]  Somesh Jha,et al.  Static Analysis of Executables to Detect Malicious Patterns , 2003, USENIX Security Symposium.

[23]  Vlado Keselj,et al.  Detection of New Malicious Code Using N-grams Signatures , 2004, PST.

[24]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[25]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[26]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[27]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[28]  Ulrich Ultes-Nitsche,et al.  Towards establishing a unknown virus detection technique using SOM , 2006 .

[29]  Ulrich Ultes-Nitsche,et al.  Non-signature based virus detection , 2006, Journal in Computer Virology.