Scalable Detection of Server-Side Polymorphic Malware

Abstract Server-side polymorphism is used by malware distributors in order to evade detection by anti-virus (AV) scanners. It is difficult for traditional AVs to detect this type of malware because the transformation code is not visible for security analysis. Using a tera-scale dataset consisting of antivirus telemetry reports pertaining to more than half a billion files, we conduct what is, to the best of our knowledge, the most wide-scale analysis of the properties of web-borne polymorphic malware done to date. We cluster the files population based on their locality-sensitive hash (LSH) values and analyze the resulting LSH clusters. Using ground truth labels, we identify benign and malicious clusters and analyse the differences between them in terms of the distributions of cluster-size, file download numbers and activity period, and in terms of their web domain utilization patterns. The results of this analysis are then leveraged for devising SPADE - a scalable Server-side Polymorphic mAlware DEtector that provides high-quality detection of both malicious files and malicious web domains.

[1]  Mark Stamp,et al.  Hunting for metamorphic JavaScript malware , 2015, Journal of Computer Virology and Hacking Techniques.

[2]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[3]  Helen J. Wang,et al.  Scalable Telemetry Classification for Automated Malware Detection , 2012, ESORICS.

[4]  Heng Yin,et al.  Panorama: capturing system-wide information flow for malware detection and analysis , 2007, CCS '07.

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6]  Wei Dai,et al.  Control flow-based opcode behavior analysis for Malware detection , 2014, Comput. Secur..

[7]  Muhammad Zubair Shafiq,et al.  Using spatio-temporal information in API calls with machine learning algorithms for malware detection , 2009, AISec '09.

[8]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Christopher Krügel,et al.  Polymorphic Worm Detection Using Structural Information of Executables , 2005, RAID.

[10]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[11]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining and Inference for Malware Detection , 2011 .

[12]  Chih-Hung Lin,et al.  Efficient dynamic malware analysis using virtual time control mechanics , 2018, Comput. Secur..

[13]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[14]  Duen Horng Chau,et al.  Guilt by association: large scale malware detection by mining file-relation graphs , 2014, KDD.

[15]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[16]  Zhou Li,et al.  Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data , 2014, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[17]  Igor Popov,et al.  Malware detection using machine learning based on word2vec embeddings of machine code instructions , 2017, 2017 Siberian Symposium on Data Science and Engineering (SSDSE).

[18]  Vijay Laxmi,et al.  Mining control flow graph as API call-grams to detect portable executable malware , 2012, SIN '12.

[19]  Kang G. Shin,et al.  MutantX-S: Scalable Malware Clustering Based on Static Features , 2013, USENIX Annual Technical Conference.

[20]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[21]  Christopher Krügel,et al.  Dynamic Analysis of Malicious Code , 2006, Journal in Computer Virology.

[22]  Ciprian Oprisa,et al.  Locality-sensitive hashing optimizations for fast malware clustering , 2014, 2014 IEEE 10th International Conference on Intelligent Computer Communication and Processing (ICCP).

[23]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[24]  Mark Stamp,et al.  Eigenvalue analysis for metamorphic detection , 2014, Journal of Computer Virology and Hacking Techniques.

[25]  Danny Hendler,et al.  Detection of malicious webmail attachments based on propagation patterns , 2018, Knowl. Based Syst..

[26]  Chen Li,et al.  Malware variant detection using similarity search over content fingerprint , 2014, The 26th Chinese Control and Decision Conference (2014 CCDC).

[27]  Raymond J. Mooney,et al.  Creating diversity in ensembles using artificial data , 2005, Inf. Fusion.

[28]  Yoseba K. Penya,et al.  Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[29]  Christopher Krügel,et al.  Improving the efficiency of dynamic malware analysis , 2010, SAC '10.

[30]  Mattia Monga,et al.  Detecting Self-mutating Malware Using Control-Flow Graph Matching , 2006, DIMVA.

[31]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[32]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.

[33]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[34]  Yanfang Ye,et al.  FindMal: A file-to-file social network based malware detection framework , 2016, Knowl. Based Syst..

[35]  Kangbin Yim,et al.  Malware Obfuscation Techniques: A Brief Survey , 2010, 2010 International Conference on Broadband, Wireless Computing, Communication and Applications.

[36]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[37]  Ehud Gudes,et al.  A Topology Based Flow Model for Computing Domain Reputation , 2015, DBSec.

[38]  Yoseba K. Penya,et al.  N-grams-based File Signatures for Malware Detection , 2009, ICEIS.

[39]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[40]  Igor Santos,et al.  Opcode-Sequence-Based Semi-supervised Unknown Malware Detection , 2011, CISIS.

[41]  R. Nigel Horspool,et al.  Sliding window and control flow weight for metamorphic malware detection , 2014, Journal of Computer Virology and Hacking Techniques.

[42]  Guido van Rossum,et al.  Python Programming Language , 2007, USENIX Annual Technical Conference.

[43]  Igor Santos,et al.  OPEM: A Static-Dynamic Approach for Machine-Learning-Based Malware Detection , 2012, CISIS/ICEUTE/SOCO Special Sessions.

[44]  Christian Rossow,et al.  RUHR-UNIVERSITÄT BOCHUM , 2014 .

[45]  S. Sitharama Iyengar,et al.  A Survey on Malware Detection Using Data Mining Techniques , 2017, ACM Comput. Surv..

[46]  P. Harmya,et al.  Malware detection using assembly code and control flow graph optimization , 2010, A2CWiC '10.

[47]  Cheng Huang,et al.  Gossip: Automatically Identifying Malicious Domains from Mailing List Discussions , 2017, AsiaCCS.

[48]  Herbert Bos,et al.  Large-Scale Analysis of Malware Downloaders , 2012, DIMVA.

[49]  Curtis B. Storlie,et al.  Graph-based malware detection using dynamic analysis , 2011, Journal in Computer Virology.

[50]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[51]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[52]  Guillaume Bonfante,et al.  Control Flow Graphs as Malware Signatures , 2007 .

[53]  Yang Xiang,et al.  Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[54]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[55]  Ciprian Oprisa,et al.  Malware clustering using suffix trees , 2014, Journal of Computer Virology and Hacking Techniques.