Data mining methods for malware detection

This research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods. The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework to detect malicious programs. We collected, analyzed and processed several thousand malicious and clean programs to find out the best features and build models that can classify a given program into a malware or a clean class. Our research is closely related to information retrieval and classification techniques and borrows a number of ideas from the field. We used a vector space model to represent the programs in our collection. Our data mining framework includes two separate and distinct classes of experiments. The first are the supervised learning experiments that used a dataset, consisting of several thousand malicious and clean program samples to train, validate and test, an array of classifiers. In the second class of experiments, we proposed using sequential association analysis for feature selection and automatic signature extraction. With our experiments, we were able to achieve as high as 98.4% detection rate and as low as 1.9% false positive rate on novel malwares.

[1]  Aditya P. Mathur,et al.  A Survey of Malware Detection Techniques , 2007 .

[2]  R. Sekar,et al.  On Preventing Intrusions by Process Behavior Monitoring , 1999, Workshop on Intrusion Detection and Network Monitoring.

[3]  Y. Radai Checksumming Techniques for Anti-Viral Purposes , 1992, IFIP Congress.

[4]  Jose Nazario,et al.  Defense and Detection Strategies against Internet Worms , 2003 .

[5]  Nathalie Japkowicz,et al.  A Feature Selection and Evaluation Scheme for Computer Virus Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Stephanie Forrest,et al.  Intrusion Detection Using Sequences of System Calls , 1998, J. Comput. Secur..

[7]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[8]  Daniel R. Ellis,et al.  A behavioral approach to worm detection , 2004, WORM '04.

[9]  Fred B. Schneider,et al.  Enforceable security policies , 2000, TSEC.

[10]  Ashok N. Srivastava,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2005, J. Comput. Inf. Sci. Eng..

[11]  Andrew H. Sung,et al.  Static analyzer of vicious executables (SAVE) , 2004, 20th Annual Computer Security Applications Conference.

[12]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[13]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14]  Arun Lakhotia,et al.  Analysis and detection of computer viruses and worms: an annotated bibliography , 2002, SIGP.

[15]  Gerald Tesauro,et al.  Neural networks for computer virus recognition , 1996 .

[16]  N. Lavesson,et al.  Automated Spyware Detection Using End User License Agreements , 2008, 2008 International Conference on Information Security and Assurance (isa 2008).

[17]  Michael Schatz,et al.  A toolkit for detecting and analyzing malicious software , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[18]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[19]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[20]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[21]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[22]  D. Edwards Data Mining: Concepts, Models, Methods, and Algorithms , 2003 .

[23]  Salvatore J. Stolfo,et al.  USENIX Association Proceedings of the FREENIX Track : 2001 USENIX Annual , 2001 .

[24]  Michael D. Smith,et al.  Host-based detection of worms through peer-to-peer cooperation , 2005, WORM '05.

[25]  Nadia Tawbi,et al.  Dynamic Monitoring of Malicious Activity in Software Systems , 2000 .

[26]  Usama M. Fayyad,et al.  Knowledge Discovery in Databases: An Overview , 1997, ILP.

[27]  A. Kohn [Computer viruses]. , 1989, Harefuah.

[28]  L. J. Hoffman Rogue programs: viruses, worms and Trojan horses , 1990 .

[29]  Ulrich Ultes-Nitsche,et al.  Non-signature based virus detection , 2006, Journal in Computer Virology.

[30]  Zdravko Markov,et al.  Data mining the web - uncovering patterns in web content, structure, and usage , 2007 .

[31]  Frederick B. Cohen,et al.  A cost analysis of typical computer viruses and defenses , 1991, Comput. Secur..

[32]  rey O. Kephart,et al.  Automatic Extraction of Computer Virus SignaturesJe , 2006 .

[33]  David A. Wagner,et al.  Intrusion detection via static analysis , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[34]  Mourad Debbabi,et al.  Static analysis of binary code to isolate malicious behaviors , 1999, Proceedings. IEEE 8th International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE'99).

[35]  David Ferbrache BSc A Pathology of Computer Viruses , 1992, Springer London.

[36]  Gustavo E. A. P. A. Batista,et al.  Learning with Skewed Class Distributions , 2002 .

[37]  Karl N. Levitt,et al.  MCF: a malicious code filter , 1995, Comput. Secur..

[38]  Shi-Jinn Horng,et al.  A Surveillance Spyware Detection System Based on Data Mining Methods , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[39]  Akira Mori Detecting Unknown Computer Viruses - A New Approach , 2003, ISSS.

[40]  Ulrich Ultes-Nitsche,et al.  Non-signature based virus detection Towards establishing a unknown virus detection technique using SOM In , 2006 .

[41]  Bhavani M. Thuraisingham,et al.  A scalable multi-level feature extraction technique to detect malicious executables , 2007, Inf. Syst. Frontiers.

[42]  InSeon Yoo,et al.  Visualizing windows executable viruses using self-organizing maps , 2004, VizSEC/DMSEC '04.

[43]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[44]  Mourad Debbabi,et al.  Detection of Malicious Code in Cots Software: A Short Survey , 1999 .

[45]  Joohan Lee,et al.  Data mining methods for malware detection using instruction sequences , 2008 .

[46]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[47]  William C. Arnold,et al.  AUTOMATICALLY GENERATED WIN32 HEURISTIC VIRUS DETECTION , 2000 .

[48]  David Gerrold,et al.  When HARLIE Was One , 1972 .

[49]  John F. Shoch,et al.  The “worm” programs—early experience with a distributed computation , 1982, CACM.

[50]  John Brunner,et al.  Shockwave Rider , 1975 .

[51]  Karl N. Levitt,et al.  Execution monitoring of security-critical programs in distributed systems: a specification-based approach , 1997, Proceedings. 1997 IEEE Symposium on Security and Privacy (Cat. No.97CB36097).

[52]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[53]  Jesse C. Rabek,et al.  Detection of injected, dynamically generated, and obfuscated malicious code , 2003, WORM '03.

[54]  Robert K. Cunningham,et al.  A taxonomy of computer worms , 2003, WORM '03.

[55]  R. Sekar,et al.  A fast automaton-based method for detecting anomalous program behaviors , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[56]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[57]  Understanding Heuristics : Symantec ’ s Bloodhound Technology , 1997 .

[58]  Jules Desharnais,et al.  Static Detection of Malicious Code in Executable Programs , 2000 .

[59]  Vlado Keselj,et al.  Detection of New Malicious Code Using N-grams Signatures , 2004, PST.

[60]  Daniel P. W. Ellis,et al.  Worm anatomy and model , 2003, WORM '03.

[61]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[62]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.