Credible, resilient, and scalable detection of software plagiarism using authority histograms

Software plagiarism has become a serious threat to the health of software industry. A software birthmark indicates unique characteristics of a program that can be used to analyze the similarity between two programs and provide proof of plagiarism. In this paper, we propose a novel birthmark, Authority Histograms (AH), which can satisfy three essential requirements for good birthmarks-resiliency, credibility, and scalability. Existing birthmarks fail to satisfy all of them simultaneously. AH reflects not only the frequency of APIs, but also their call orders, whereas previous birthmarks rarely consider them together. This property provides more accurate plagiarism detection, making our birthmark more resilient and credible than previously proposed birthmarks. By random walk with restart when generating AH, we make our proposal fully applicable to even large programs. Extensive experiments with a set of Windows applications verify that both the credibility and resiliency of AH exceed those of existing birthmarks; therefore AH provides improved accuracy in detecting plagiarism. Moreover, the construction and comparison phases of AH are established within a reasonable time.

[1]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[2]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Frances E. Allen,et al.  Control-flow analysis , 2022 .

[4]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[5]  Sencun Zhu,et al.  Behavior based software theft detection , 2009, CCS.

[6]  Akito Monden,et al.  Design and evaluation of birthmarks for detecting theft of java programs , 2004, IASTED Conf. on Software Engineering.

[7]  Xingming Sun,et al.  A Combined Static and Dynamic Software Birthmark Based on Component Dependence Graph , 2008, 2008 International Conference on Intelligent Information Hiding and Multimedia Signal Processing.

[8]  P. Foggia,et al.  Performance evaluation of the VF graph matching algorithm , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[9]  Akito Monden,et al.  Dynamic Software Birthmarks to Detect the Theft of Windows Applications , 2004 .

[10]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[11]  Christian S. Collberg,et al.  K-gram based software birthmarks , 2005, SAC '05.

[12]  Sang-Chul Lee,et al.  Software plagiarism detection via the static API call frequency birthmark , 2013, SAC '13.

[13]  Hyun-il Lim,et al.  A Static Birthmark of Binary Executables Based on API Call Structure , 2007, ASIAN.

[14]  Hyun-il Lim,et al.  A method for detecting the theft of Java programs through analysis of the control flow information , 2009, Inf. Softw. Technol..

[15]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[16]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[17]  Youngsu Park,et al.  An efficient similarity comparison based on core API calls , 2013, SAC '13.

[18]  Chris Eagle,et al.  The IDA Pro Book: The Unofficial Guide to the World's Most Popular Disassembler , 2008 .

[19]  Hyun-il Lim,et al.  A static API birthmark for Windows binary executables , 2009, J. Syst. Softw..

[20]  Hyun-il Lim,et al.  A Static Java Birthmark Based on Control Flow Edges , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[21]  Christian S. Collberg,et al.  Software watermarking: models and dynamic embeddings , 1999, POPL '99.

[22]  David Schuler,et al.  A dynamic birthmark for java , 2007, ASE.

[23]  Hyun-il Lim,et al.  Detecting Java Theft Based on Static API Trace Birthmark , 2008, IWSEC.

[24]  Sencun Zhu,et al.  Value-based program characterization and its application to software plagiarism detection , 2011, 2011 33rd International Conference on Software Engineering (ICSE).