Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software

The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet S of the regular language) based on two pieces of information: function call patterns to access objects and type state information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet S to produce P's behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.

[1]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[2]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[3]  Xiangyu Zhang,et al.  Matching execution histories of program versions , 2005, ESEC/FSE-13.

[4]  Somesh Jha,et al.  Mining specifications of malicious behavior , 2008, ISEC '08.

[5]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[6]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[7]  Jens Stoye,et al.  Simple and flexible detection of contiguous repeats using a suffix tree , 2002, Theor. Comput. Sci..

[8]  Qinghua Zhang,et al.  MetaAware: Identifying Metamorphic Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[9]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[10]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  Bjorn De Sutter,et al.  Matching Control Flow of Program Versions , 2007, 2007 IEEE International Conference on Software Maintenance.

[13]  Jiong Yang,et al.  CLUSEQ: efficient and effective sequence clustering , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  Rajiv Gupta,et al.  Detecting virus mutations via dynamic matching , 2009, 2009 IEEE International Conference on Software Maintenance.

[15]  Christopher Krügel,et al.  Effective and Efficient Malware Detection at the End Host , 2009, USENIX Security Symposium.

[16]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[17]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[18]  It Informatics,et al.  Kaspersky Anti-Virus , 2010 .

[19]  Sencun Zhu,et al.  Behavior based software theft detection , 2009, CCS.

[20]  David Schuler,et al.  A dynamic birthmark for java , 2007, ASE.

[21]  Sencun Zhu,et al.  Detecting Software Theft via System Call Based Birthmarks , 2009, 2009 Annual Computer Security Applications Conference.