CIMDS: Adapting Postprocessing Techniques of Associative Classification for Malware Detection

Malware is software designed to infiltrate or damage a computer system without the owner's informed consent (e.g., viruses, backdoors, spyware, trojans, and worms). Nowadays, numerous attacks made by the malware pose a major security threat to computer users. Unfortunately, along with the development of the malware writing techniques, the number of file samples that need to be analyzed, named "gray list," on a daily basis is constantly increasing. In order to help our virus analysts, quickly and efficiently pick out the malicious executables from the "gray list," an automatic and robust tool to analyze and classify the file samples is needed. In our previous work, we have developed an intelligent malware detection system (IMDS) by adopting associative classification method based on the analysis of application programming interface (API) execution calls. Despite its good performance in malware detection, IMDS still faces the following two challenges: (1) handling the large set of the generated rules to build the classifier; and (2) finding effective rules for classifying new file samples. In this paper, we first systematically evaluate the effects of the postprocessing techniques (e.g., rule pruning, rule ranking, and rule selection) of associative classification in malware detection, and then, propose an effective way, i.e., CIDCPF, to detect the malware from the "gray list." To the best of our knowledge, this is the first effort on using postprocessing techniques of associative classification in malware detection. CIDCPF adapts the postprocessing techniques as follows: first applying Chi-square testing and Insignificant rule pruning followed by using Database coverage based on the Chi-square measure rule ranking mechanism and Pessimistic error estimation, and finally performing prediction by selecting the best First rule. We have incorporated the CIDCPF method into our existing IMDS system, and we call the new system as CIMDS system. Case studies are performed on the large collection of file samples obtained from the Antivirus Laboratory at Kingsoft Corporation and promising experimental results demonstrate that the efficiency and ability of detecting malware from the "gray list" of our CIMDS system outperform popular antivirus software tools, such as McAfee VirusScan and Norton Antivirus, as well as previous data-mining-based detection systems, which employed Naive Bayes, support vector machine, and decision tree techniques. In particular, our CIMDS system can greatly reduce the number of generated rules, which makes it easy for our virus analysts to identify the useful ones.

[1]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[2]  Elena Baralis,et al.  On support thresholds in associative classification , 2004, SAC '04.

[3]  Tao Li,et al.  An intelligent PE-malware detection system based on association mining , 2008, Journal in Computer Virology.

[4]  W. J. Langford Statistical Methods , 1959, Nature.

[5]  Lilly Suriani Affendey,et al.  Intrusion detection using data mining techniques , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  HanJiawei,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998 .

[8]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[9]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[10]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12]  Somesh Jha,et al.  Synthesizing Near-Optimal Malware Specifications from Suspicious Behaviors , 2010, 2010 IEEE Symposium on Security and Privacy.

[13]  Jau-Hwang Wang,et al.  Virus detection using data mining techinques , 2003, IEEE 37th Annual 2003 International Carnahan Conference onSecurity Technology, 2003. Proceedings..

[14]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[15]  Peter I. Cowling,et al.  MCAR: multi-class classification based on association rule , 2005, The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005..

[16]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[17]  Frans Coenen,et al.  A Novel Rule Ordering Approach in Classification Association Rule Mining , 2007, MLDM.

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[20]  Xuxian Jiang,et al.  vEye: behavioral footprinting for self-propagating worm detection and profiling , 2008, Knowledge and Information Systems.

[21]  Osmar R. Zaïane,et al.  An associative classifier based on positive and negative rules , 2004, DMKD '04.

[22]  Ke Wang,et al.  Growing decision trees on support-less association rules , 2000, KDD '00.

[23]  Heikki Mannila,et al.  Pruning and grouping of discovered association rules , 1995 .

[24]  Andrew H. Sung,et al.  Static analyzer of vicious executables (SAVE) , 2004, 20th Annual Computer Security Applications Conference.

[25]  Frans Coenen,et al.  An evaluation of approaches to classification rule selection , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[26]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[27]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[28]  Elena Baralis,et al.  A lazy approach to pruning classification rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[29]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Peter I. Cowling,et al.  MMAC: a new multi-class, multi-label associative classification approach , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[31]  Osmar R. Zaïane,et al.  Associative Classifiers for Medical Images , 2002, Revised Papers from MDM/KDD and PAKDD/KDMCD.

[32]  Sanjay Chawla,et al.  CCCS: a top-down associative classifier for imbalanced class distribution , 2006, KDD '06.

[33]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[34]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[35]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[36]  Fadi A. Thabtah,et al.  A review of associative classification mining , 2007, The Knowledge Engineering Review.

[37]  Hong Shen,et al.  Construct robust rule sets for classification , 2002, KDD.

[38]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[39]  Frans Coenen,et al.  A Novel Rule Weighting Approach in Classification Association Rule Mining , 2007 .

[40]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[41]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[42]  Christopher Krügel,et al.  Dynamic Analysis of Malicious Code , 2006, Journal in Computer Virology.

[43]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.