Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels

We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization for this fully unsupervised technique. We evaluate our method using 279,327 distinct binaries from VirusTotal, each of which appeared for the first time between January 2012 and June 2014. Our evaluation shows that our statistical model is consistently more accurate at predicting the future-derived ground truth than all unweighted rules of the form "k out of n" AV detections. In addition, we evaluate the scenario where partial ground truth is available for model building. We train a logistic regression predictor on the partial label information. Our results show that as few as a 100 randomly selected training instances with ground truth are enough to achieve 80% true positive rate for 0.1% false positive rate. In comparison, the best unweighted threshold rule provides only 60% true positive rate at the same false positive rate.

[1]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[2]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[3]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[4]  David C. Yen,et al.  Classification methods in the detection of new malicious emails , 2005, Inf. Sci..

[5]  Pushpak Bhattacharyya,et al.  A model for handling approximate, noisy or incomplete labeling in text classification , 2005, ICML.

[6]  Arun K. Pujari,et al.  N-gram analysis for computer virus detection , 2006, Journal in Computer Virology.

[7]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[8]  Nathalie Japkowicz,et al.  A Feature Selection and Evaluation Scheme for Computer Virus Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[10]  Salvatore J. Stolfo,et al.  Towards Stealthy Malware Detection , 2007, Malware Detection.

[11]  Bhavani M. Thuraisingham,et al.  A Hybrid Model to Detect Malicious Executables , 2007, 2007 IEEE International Conference on Communications.

[12]  Wenke Lee,et al.  McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  Tao Li,et al.  An intelligent PE-malware detection system based on association mining , 2008, Journal in Computer Virology.

[15]  Julio Canto,et al.  Large scale malware collection : lessons learned , 2008 .

[16]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[17]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[18]  Min Zhao,et al.  SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging , 2009, Journal in Computer Virology.

[19]  Bhavani M. Thuraisingham,et al.  A scalable multi-level feature extraction technique to detect malicious executables , 2007, Inf. Syst. Frontiers.

[20]  Yoseba K. Penya,et al.  N-grams-based File Signatures for Malware Detection , 2009, ICEIS.

[21]  Lior Rokach,et al.  Improving malware detection by applying multi-inducer ensemble , 2009, Comput. Stat. Data Anal..

[22]  Md. Rafiqul Islam,et al.  An automated classification system based on the strings of trojan and virus families , 2009, 2009 4th International Conference on Malicious and Unwanted Software (MALWARE).

[23]  Muhammad Zubair Shafiq,et al.  Malware detection using statistical analysis of byte-level file content , 2009, CSI-KDD '09.

[24]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[25]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[26]  Sahin Albayrak,et al.  Static Analysis of Executables for Collaborative Malware Detection on Android , 2009, 2009 IEEE International Conference on Communications.

[27]  Dragos Gavrilut,et al.  Malware Detection Using Perceptrons and Support Vector Machines , 2009, 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns.

[28]  Nick Feamster,et al.  Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces , 2010, NSDI.

[29]  Yoseba K. Penya,et al.  Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[30]  Peng Li,et al.  On Challenges in Evaluating Malware Clustering , 2010, RAID.

[31]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[32]  Andreas Dewald,et al.  Cujo: efficient detection and prevention of drive-by-download attacks , 2010, ACSAC '10.

[33]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[34]  Igor Santos,et al.  Semi-supervised Learning for Unknown Malware Detection , 2011, DCAI.

[35]  Paul A. Watters,et al.  Zero-day Malware Detection based on Supervised Learning Algorithms of API call Signatures , 2011, AusDM.

[36]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[37]  Pavel Laskov,et al.  Static detection of malicious JavaScript-bearing PDF documents , 2011, ACSAC '11.

[38]  Shipeng Yu,et al.  Ranking annotators for crowdsourced labeling tasks , 2011, NIPS.

[39]  Stefano Zanero,et al.  Finding Non-trivial Malware Naming Inconsistencies , 2011, ICISS.

[40]  Benjamin Livshits,et al.  ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection , 2011, USENIX Security Symposium.

[41]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining and Inference for Malware Detection , 2011 .

[42]  Jian Peng,et al.  Variational Inference for Crowdsourcing , 2012, NIPS.

[43]  Latifur Khan,et al.  A Machine Learning Approach to Android Malware Detection , 2012, 2012 European Intelligence and Security Informatics Conference.

[44]  Konrad Rieck,et al.  Autonomous learning for detection of JavaScript attacks: vision or reality? , 2012, AISec '12.

[45]  Giorgio Giacinto,et al.  A Pattern Recognition System for Malicious PDF Files Detection , 2012, MLDM.

[46]  Blaine Nelson,et al.  Poisoning Attacks against Support Vector Machines , 2012, ICML.

[47]  Christopher Krügel,et al.  FlashDetect: ActionScript 3 Malware Detection , 2012, RAID.

[48]  Gonzalo Álvarez,et al.  PUMA: Permission Usage to Detect Malware in Android , 2012, CISIS/ICEUTE/SOCO Special Sessions.

[49]  Roberto Perdisci,et al.  VAMO: towards a fully automated malware clustering validity analysis , 2012, ACSAC '12.

[50]  Lior Rokach,et al.  Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features , 2012, J. Mach. Learn. Res..

[51]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[52]  Pavel Laskov,et al.  Detection of Malicious PDF Files Based on Hierarchical Document Structure , 2013, NDSS.

[53]  Gianluca Stringhini,et al.  Shady paths: leveraging surfing crowds to detect malicious web pages , 2013, CCS.

[54]  Jack W. Stokes,et al.  Large-scale malware classification using random projections and neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.

[56]  Niels Provos,et al.  CAMP: Content-Agnostic Malware Protection , 2013, NDSS.

[57]  Lior Rokach,et al.  Novel active learning methods for enhanced PC malware detection in windows OS , 2014, Expert Syst. Appl..

[58]  Zane Markel,et al.  Building a machine learning classifier for malware detection , 2014, 2014 Second Workshop on Anti-malware Testing Research (WATeR).

[59]  Aziz Mohaisen,et al.  AV-Meter: An Evaluation of Antivirus Scans and Labels , 2014, DIMVA.

[60]  Yuval Elovici,et al.  ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files , 2014, 2014 IEEE Joint Intelligence and Security Informatics Conference.

[61]  Ling Huang,et al.  Back to the Future: Malware Detection with Temporally Consistent Labels , 2015, ArXiv.