Streaming Malware Classification in the Presence of Concept Drift and Class Imbalance

Malware, or malicious software, is capable of performing any action or command that can be expressed in code and is typically used for illicit activities, such as e-mail spamming, corporate espionage, and identity theft. Most organizations rely on anti-virus software to identifymalware, which typically utilize signatures that can only identify previously-seen malware instances. We consider the detection ofmalware executables that are downloaded in streaming network data as a supervised machine learning problem. Using malwaredata collected over multiple years, we characterize the effect of concept drift and class imbalance on batch and streaming decision tree ensembles. In particular, we illustrate a surprising vulnerability generated by precisely the aspect of streaming methods that seemed most likely to help them, when compared to batch methods.

[1]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[2]  João Gama,et al.  Decision trees for mining data streams , 2006, Intell. Data Anal..

[3]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[6]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[7]  Matt Pietrek,et al.  An in-depth look into the win32 portable executable le format , 2002 .

[8]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[9]  Steve Kelling,et al.  Mining citizen science data to predict orevalence of wild bird species , 2006, KDD '06.

[10]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[11]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.