Automatic analysis of malware behavior using machine learning

Malicious software - so called malware - poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in the form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software. In this article, we propose a framework for the automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable of processing the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing accurate discovery and discrimination of novel malware variants.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Stephanie Forrest,et al.  Intrusion Detection Using Sequences of System Calls , 1998, J. Comput. Secur..

[3]  Wenke Lee,et al.  Misleading worm signature generators using deliberate noise injection , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[4]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[5]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[6]  Karl N. Levitt,et al.  MCF: a malicious code filter , 1995, Comput. Secur..

[7]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[8]  Carsten Willems,et al.  A Malware Instruction Set for Behavior-Based Analysis , 2010, Sicherheit.

[9]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[10]  Eric Filiol,et al.  Malware Behavioral Detection by Attribute-Automata Using Abstraction from Platform and Language , 2009, RAID.

[11]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[12]  Philip K. Chan,et al.  Learning Patterns from Unix Process Execution Traces for Intrusion Detection , 1997 .

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[15]  Kymie M. C. Tan,et al.  Undermining an Anomaly-Based Intrusion Detection System Using Common Exploits , 2002, RAID.

[16]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[17]  Dawn Xiaodong Song,et al.  Limits of Learning-based Signature Generation with Adversaries , 2008, NDSS.

[18]  Somesh Jha,et al.  A semantics-based approach to malware detection , 2007, POPL '07.

[19]  Somesh Jha,et al.  Static Analysis of Executables to Detect Malicious Patterns , 2003, USENIX Security Symposium.

[20]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[21]  Felix C. Freiling,et al.  Learning More about the Underground Economy: A Case-Study of Keyloggers and Dropzones , 2009, ESORICS.

[22]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[23]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[24]  Wenke Lee,et al.  Ether: malware analysis via hardware virtualization extensions , 2008, CCS.

[25]  Zhenkai Liang,et al.  BitBlaze: A New Approach to Computer Security via Binary Analysis , 2008, ICISS.

[26]  Konrad Rieck,et al.  Linear-Time Computation of Similarity Measures for Sequential Data , 2008, J. Mach. Learn. Res..

[27]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[28]  David A. Wagner,et al.  Mimicry attacks on host-based intrusion detection systems , 2002, CCS '02.

[29]  Engin Kirda,et al.  Insights into current malware behavior , 2009 .

[30]  Wenke Lee,et al.  K-Tracer: A System for Extracting Kernel Malware Behavior , 2009, NDSS.

[31]  Marc Dacier,et al.  Automatic Handling of Protocol Dependencies and Reaction to 0-Day Attacks with ScriptGen Based Honeypots , 2006, RAID.

[32]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[33]  Christopher Krügel,et al.  Dynamic Analysis of Malicious Code , 2006, Journal in Computer Virology.

[34]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[35]  Somesh Jha,et al.  A semantics-based approach to malware detection , 2008, TOPL.

[36]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[37]  Saumya K. Debray,et al.  Obfuscation of executable code to improve resistance to static disassembly , 2003, CCS '03.

[38]  Wenke Lee,et al.  PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[39]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[40]  Salvatore J. Stolfo,et al.  Towards Stealthy Malware Detection , 2007, Malware Detection.

[41]  Christopher Krügel,et al.  Effective and Efficient Malware Detection at the End Host , 2009, USENIX Security Symposium.

[42]  Stefan Savage,et al.  An inquiry into the nature and causes of the wealth of internet miscreants , 2007, CCS '07.

[43]  U. Bayer,et al.  TTAnalyze: A Tool for Analyzing Malware , 2006 .

[44]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[45]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[46]  Klaus-Robert Müller,et al.  From outliers to prototypes: Ordering data , 2006, Neurocomputing.

[47]  Jonathon T. Giffin,et al.  Automatic Reverse Engineering of Malware Emulators , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[48]  Christopher Krügel,et al.  Automating Mimicry Attacks Using Static Binary Analysis , 2005, USENIX Security Symposium.

[49]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[50]  Gary Carpenter 동적 사용자를 위한 Scalable 인증 그룹 키 교환 프로토콜 , 2005 .

[51]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[52]  Galen C. Hunt,et al.  Detours: binary interception of Win32 functions , 1999 .

[53]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[54]  Wenke Lee,et al.  Polymorphic Blending Attacks , 2006, USENIX Security Symposium.

[55]  Xu Chen,et al.  Towards an understanding of anti-virtualization and anti-debugging behavior in modern malware , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[56]  Felix C. Freiling,et al.  The Nepenthes Platform: An Efficient Approach to Collect Malware , 2006, RAID.

[57]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[58]  Wenke Lee,et al.  McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[59]  Christopher Krügel,et al.  Exploring Multiple Execution Paths for Malware Analysis , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[60]  Kymie M. C. Tan,et al.  "Why 6?" Defining the operational limits of stide, an anomaly-based intrusion detector , 2002, Proceedings 2002 IEEE Symposium on Security and Privacy.

[61]  Christopher Krügel,et al.  Limits of Static Analysis for Malware Detection , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[62]  Gregory R. Andrews,et al.  Binary Obfuscation Using Signals , 2007, USENIX Security Symposium.

[63]  Klaus-Robert Müller,et al.  Approximate Tree Kernels , 2010, J. Mach. Learn. Res..

[64]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[65]  Van-Hau Pham,et al.  on the Advantages of Deploying a Large Scale Distributed Honeypot Platform , 2005 .