Malware poses a serious and challenging threat and due to the sheer scale the need for automated learning-based approaches to deal with it has become rapidly clear. Swift analysis and prompt detection of these threats present one of the most pressing and important issues that plague the security of the Internet and its users. With more than 550,000 unique malware samples per day reported in Q4 20151, it is clear that manual analysis does not scale and therefore the shift is towards automatic and adaptive techniques that can identify unknown and previously-unseen threats. To this end, machine learning, with a particular emphasis on clustering and classification, has long been acknowledged as a promising technique to address such a fundamental need in a number of security-related domains, including botnet [4, 10], mobile [1], and traditional malware [2, 3, 8, 9, 11]. The advances in the area would seem to suggest that the problem is almost solved. However, assessing the actual results of a given algorithm is problematic. With few exceptions, e.g., [1, 10, 13], the lack of publicly-available datasets hinders the ability to reproduce and compare results. Furthermore, the usage of traditional metrics (e.g., accuracy, precision, recall) to assess the performance of a machine learning algorithm might produce misleading results: such metrics report statistics on correct and incorrect decisions, but do not capture their quality and are hence ill-suited to evaluate a given task. The problem is further exacerbated when machine learning algorithms are deployed in real-world settings, especially in a context which often sees new labels (malware families) and changes in the underlying data distribution (malware variants, new behaviors). Li et al. consider this problem [6], empirically showing that traditional metrics with high accuracy do not necessarily imply that the underlying machine learning is good. They show how the dataset is often chosen to support the claim of the author. Their work focuses primarily on methods specifically built on the available datasets that suffered from data over-fitting issues. Conversely, in our work, we aim at tackling the problem on a broader scope, providing a way to assess the quality of a given algorithm in a scientific and rigorous manner. Our work aims to provide quality metrics that can help in the development of machine learning algorithms that provide an insight into the process and furthermore help predict the performance of a deployed algorithm.
[1]
Christopher Krügel,et al.
Scalable, Behavior-Based Malware Clustering
,
2009,
NDSS.
[2]
Konrad Rieck,et al.
DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket
,
2014,
NDSS.
[3]
Juan Caballero,et al.
FIRMA: Malware Clustering and Network Signature Generation with Mixed Network Behaviors
,
2013,
RAID.
[4]
Nick Feamster,et al.
Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces
,
2010,
NSDI.
[5]
Peng Li,et al.
On Challenges in Evaluating Malware Clustering
,
2010,
RAID.
[6]
Carsten Willems,et al.
Learning and Classification of Malware Behavior
,
2008,
DIMVA.
[7]
Christian Platzer,et al.
MARVIN: Efficient and Comprehensive Mobile App Classification through Static and Dynamic Analysis
,
2015,
2015 IEEE 39th Annual Computer Software and Applications Conference.
[8]
Ian T. Jolliffe,et al.
Principal Component Analysis
,
2002,
International Encyclopedia of Statistical Science.
[9]
Yajin Zhou,et al.
Dissecting Android Malware: Characterization and Evolution
,
2012,
2012 IEEE Symposium on Security and Privacy.
[10]
Guofei Gu,et al.
BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection
,
2008,
USENIX Security Symposium.
[11]
Roberto Perdisci,et al.
Scalable fine-grained behavioral clustering of HTTP-based malware
,
2013,
Comput. Networks.