Technical Report 2016-1 — Royal Holloway , University of London Misleading Metrics : On Evaluating Machine Learning for Malware with Confidence

Malware pose a serious and challenging threat across the Internet and the need for automated learning-based approaches has become rapidly clear. Machine learning has long been acknowledged as a promising technique to identify and classify malware threats; such a powerful technique is unfortunately often seen as a black-box panacea, where little is understood and the results—especially with high accuracy—are taken without questioning their quality. For such reasons, results are often biased by the choice of empirical thresholds or dataset-specific artifacts, hindering the ability to set easy-to-understand error metrics and thus compare different approaches. This setting, calls for new metrics that look beyond quantitative measurements (e.g., precision and recall), and help in scientifically assessing the soundness of the underlying machine learning tasks. To this end, we propose conformal evaluator, a framework designed at evaluating the quality of a result in terms of statistical metrics such as credibility and confidence. Credibility tells you how much a sample is credited with one given prediction (e.g., a label), whereas confidence focuses on pointing out how much a given sample is distinguished from other predictions. Such evaluation metrics give useful insights, providing a quantifiable per-choice level of assurance and reliability. Core of conformal evaluator is a non-conformity measure, which, in essence, allows for measuring the difference between a sample and a set of samples. For this reason, our framework is general enough to be immediately applied by a large class of algorithms that rely on distances to identify and classify malware, allowing to better understand and compare machine learning results. To further support our claim, we present case studies where the outcome of three different algorithms are evaluated under conformal evaluator settings. We show how traditional metrics mislead about the performance of different algorithms. Instead, conformal evaluator’s metrics enable to understand the reasons behind the performance of a given algorithm, and reveal shortcomings of apparently highly accurate methods.

[1]  G. Ridgeway The State of Boosting ∗ , 1999 .

[2]  W. Gasarch,et al.  The Book Review Column 1 Coverage Untyped Systems Simple Types Recursive Types Higher-order Systems General Impression 3 Organization, and Contents of the Book , 2022 .

[3]  Alexander Gammerman,et al.  Hedging predictions in machine learning , 2006, ArXiv.

[4]  S. Jha,et al.  Mining specifications of malicious behavior , 2007, ESEC-FSE '07.

[5]  Luca Salgarelli,et al.  Comparing traffic classifiers , 2007, CCRV.

[6]  Vinod Yegneswaran,et al.  BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation , 2007, USENIX Security Symposium.

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  Mia Hubert,et al.  An adjusted boxplot for skewed distributions , 2008, Comput. Stat. Data Anal..

[9]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[10]  Guofei Gu,et al.  BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection , 2008, USENIX Security Symposium.

[11]  Vladimir Vovk,et al.  A tutorial on conformal prediction , 2007, J. Mach. Learn. Res..

[12]  Leyla Bilge,et al.  Automatically Generating Models for Botnet Detection , 2009, ESORICS.

[13]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[14]  Nick Feamster,et al.  Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces , 2010, NSDI.

[15]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[16]  Peng Li,et al.  On Challenges in Evaluating Malware Clustering , 2010, RAID.

[17]  Andreas Haeberlen,et al.  Challenges in Experimenting with Botnet Detection Systems , 2011, CSET.

[18]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[19]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[20]  Christopher Krügel,et al.  BotFinder: finding bots in network traffic without deep packet inspection , 2012, CoNEXT '12.

[21]  A. Izenman Linear Discriminant Analysis , 2013 .

[22]  Juan Caballero,et al.  FIRMA: Malware Clustering and Network Signature Generation with Mixed Network Behaviors , 2013, RAID.

[23]  Roberto Perdisci,et al.  Scalable fine-grained behavioral clustering of HTTP-based malware , 2013, Comput. Networks.

[24]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[25]  Marcelo R. Campo,et al.  Survey on network-based botnet detection methods , 2014, Secur. Commun. Networks.

[26]  Jacques Klein,et al.  Machine Learning-Based Malware Detection for Android Applications: History Matters! , 2014 .

[27]  Mansour Ahmadi,et al.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , 2015, CODASPY.

[28]  Mansour Ahmadi,et al.  Microsoft Malware Classification Challenge , 2018, ArXiv.