Transductive malware label propagation: Find your lineage from your neighbors

The numerous malware variants existing in the cyberspace have posed severe threats to its security. Supervised learning techniques have been applied to automate the process of classifying malware variants. Supervised learning, however, suffers in situations where we have only scarce labeled malware samples. In this work, we propose a transductive malware classification framework, which propagates label information from labeled instances to unlabeled ones. We improve the existing Harmonic function approach based on the maximum confidence principle. We apply this framework on the structural information collected from malware programs, and propose a PageRank-like algorithm to evaluate the distance between two malware programs. We evaluate the performance of our method against the standard Harmonic function method as well as two popular supervised learning techniques. Experimental results suggest that our method outperforms these existing approaches in classifying malware variants when only a small number of labeled samples are available.

[1]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[2]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[3]  Peng Li,et al.  On Challenges in Evaluating Malware Clustering , 2010, RAID.

[4]  Somesh Jha,et al.  Mining specifications of malicious behavior , 2008, ISEC '08.

[5]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[6]  Guanhua Yan,et al.  Exploring Discriminatory Features for Automated Malware Classification , 2013, DIMVA.

[7]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[8]  Guanhua Yan,et al.  Discriminant malware distance learning on structural information for automated malware classification , 2013, SIGMETRICS.

[9]  Somesh Jha,et al.  Automatic Generation of Remediation Procedures for Malware Infections , 2010, USENIX Security Symposium.

[10]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[11]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[12]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[13]  Wenke Lee,et al.  McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[14]  Somesh Jha,et al.  Synthesizing Near-Optimal Malware Specifications from Suspicious Behaviors , 2010, 2010 IEEE Symposium on Security and Privacy.

[15]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[16]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[17]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Chris H. Q. Ding,et al.  Robust nonnegative matrix factorization using L21-norm , 2011, CIKM '11.

[19]  Igor Santos,et al.  Semi-supervised Learning for Unknown Malware Detection , 2011, DCAI.

[20]  Mark Stamp,et al.  Deriving common malware behavior through graph clustering , 2013, Comput. Secur..

[21]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[22]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[23]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[24]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[25]  Christopher Krügel,et al.  A quantitative study of accuracy in system call-based malware detection , 2012, ISSTA 2012.

[26]  Muhammad Zubair Shafiq,et al.  PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime , 2009, RAID.

[27]  Somesh Jha,et al.  Synthesizing near-optimal malware specifications from suspicious behaviors , 2013, 2013 8th International Conference on Malicious and Unwanted Software: "The Americas" (MALWARE).

[28]  Guanhua Yan Finding common ground among experts' opinions on data clustering: With applications in malware analysis , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[29]  Christopher Krügel,et al.  JACKSTRAWS: Picking Command and Control Connections from Bot Traffic , 2011, USENIX Security Symposium.

[30]  Vinod Yegneswaran,et al.  A comparative assessment of malware classification using binary texture analysis and dynamic analysis , 2011, AISec '11.

[31]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[32]  Nick Feamster,et al.  Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces , 2010, NSDI.

[33]  Chris H. Q. Ding,et al.  A learning framework using Green's function and kernel regularization with application to recommender system , 2007, KDD '07.