Cloud-based malware detection for evolving data streams

Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.

[1]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[2]  Li Guo,et al.  Mining Data Streams with Labeled and Unlabeled Training Examples , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[4]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[5]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[6]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[7]  HanJiawei,et al.  Cloud-based malware detection for evolving data streams , 2008 .

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Mohammad M. Masud,et al.  Mining Concept-Drifting Data Stream to Detect Peer to Peer Botnet Traffic , 2008 .

[10]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[11]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[12]  Philip S. Yu,et al.  Stop Chasing Trends: Discovering High Order Models in Evolving Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Vinod Yegneswaran,et al.  An Inside Look at Botnets , 2007, Malware Detection.

[14]  Zhendong Su,et al.  On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits , 2005, CCS '05.

[15]  Philip S. Yu,et al.  A framework for on-demand classification of evolving data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Bhavani M. Thuraisingham,et al.  A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams , 2009, PAKDD.

[17]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[18]  Geoffrey J. Gordon,et al.  Closed-form supervised dimensionality reduction with generalized linear models , 2008, ICML '08.

[19]  Brent Byunghoon Kang,et al.  Peer-to-Peer Botnets: Overview and Case Study , 2007, HotBots.

[20]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[21]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[22]  Ralf Klinkenberg,et al.  An Ensemble Classifier for Drifting Concepts , 2005 .

[23]  Bhavani M. Thuraisingham,et al.  Exploiting an antivirus interface , 2009, Comput. Stand. Interfaces.

[24]  Ming-Yang Kao,et al.  Hamsa: fast signature generation for zero-day polymorphic worms with provable attack resilience , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[25]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[26]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[27]  Sattar Hashemi,et al.  Adapted One-versus-All Decision Trees for Data Stream Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[28]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[29]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[30]  Bhavani M. Thuraisingham,et al.  A scalable multi-level feature extraction technique to detect malicious executables , 2007, Inf. Syst. Frontiers.

[31]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Xindong Wu,et al.  Combining proactive and reactive predictions for data streams , 2005, KDD '05.

[34]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[35]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[36]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[37]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[38]  James Newsome,et al.  Polygraph: automatically generating signatures for polymorphic worms , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[39]  Alon Orlitsky,et al.  Supervised dimensionality reduction using mixture models , 2005, ICML.