BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware

We describe and release an open PE malware dataset called BODMAS to facilitate research efforts in machine learning based malware analysis. By closely examining existing open PE malware datasets, we identified two missing capabilities (i.e., recent/timestamped malware samples, and well-curated family information), which have limited researchers’ ability to study pressing issues such as concept drift and malware family evolution. For these reasons, we release a new dataset to fill in the gaps. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families). We also perform a preliminary analysis to illustrate the impact of concept drift and discuss how this dataset can help to facilitate existing and future research efforts.

[1]  Alexander Pretschner,et al.  Robust and Effective Malware Detection Through Quantitative Data Flow Graph Metrics , 2015, DIMVA.

[2]  Lorenzo Cavallaro,et al.  TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time , 2018, USENIX Security Symposium.

[3]  Ilia Nouretdinov,et al.  Transcend: Detecting Concept Drift in Malware Classification Models , 2017, USENIX Security Symposium.

[4]  Mansour Ahmadi,et al.  Microsoft Malware Classification Challenge , 2018, ArXiv.

[5]  Roberto Perdisci,et al.  MAXS: Scaling Malware Execution with Sequential Multi-Hypothesis Testing , 2016, AsiaCCS.

[6]  Roberto Baldoni,et al.  Survey on the Usage of Machine Learning Techniques for Malware Analysis , 2017, Comput. Secur..

[7]  Richard E. Harang,et al.  SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection , 2020, ArXiv.

[8]  Prateek Mittal,et al.  Better the Devil you Know: An Analysis of Evasion Attacks using Out-of-Distribution Adversarial Examples , 2019, ArXiv.

[9]  Curtis B. Storlie,et al.  Graph-based malware detection using dynamic analysis , 2011, Journal in Computer Virology.

[10]  Md. Rafiqul Islam,et al.  Classification of malware based on integrated static and dynamic features , 2013, J. Netw. Comput. Appl..

[11]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[12]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[13]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[14]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[15]  Wenke Lee,et al.  McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[16]  Hyrum S. Anderson,et al.  Learning to Evade Static PE Machine Learning Malware Models via Reinforcement Learning , 2018, ArXiv.

[17]  Heeyoung Kim,et al.  OOD-MAML: Meta-Learning for Few-Shot Out-of-Distribution Detection and Classification , 2020, NeurIPS.

[18]  Yizheng Chen,et al.  On Training Robust PDF Malware Classifiers , 2019, USENIX Security Symposium.

[19]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[20]  Scott E. Coull,et al.  Exploring Backdoor Poisoning Attacks Against Malware Classifiers , 2020, ArXiv.

[21]  Yong Wang,et al.  MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics , 2019, Comput. Secur..

[22]  Martina Lindorfer,et al.  Detecting Environment-Sensitive Malware , 2011, RAID.

[23]  Juan Caballero,et al.  AVclass: A Tool for Massive Malware Labeling , 2016, RAID.

[24]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[25]  Xinyu Xing,et al.  CADE: Detecting and Explaining Concept Drift Samples for Security Applications , 2021, USENIX Security Symposium.

[26]  Leyla Bilge,et al.  Needles in a Haystack: Mining Information from Public Dynamic Analysis Sandboxes for Malware Intelligence , 2015, USENIX Security Symposium.

[27]  Kang G. Shin,et al.  MutantX-S: Scalable Malware Clustering Based on Static Features , 2013, USENIX Annual Technical Conference.

[28]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[29]  Giovanni Vigna,et al.  When Malware is Packin' Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features , 2020, NDSS.

[30]  Hyrum S. Anderson,et al.  EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models , 2018, ArXiv.

[31]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.