Clonewise - Detecting Package-Level Clones Using Machine Learning

Developers sometimes maintain an internal copy of another software or fork development of an existing project. This practice can lead to software vulnerabilities when the embedded code is not kept up to date with upstream sources. We propose an automated solution to identify clones of packages without any prior knowledge of these relationships. We then correlate clones with vulnerability information to identify outstanding security problems. This approach motivates software maintainers to avoid using cloned packages and link against system wide libraries. We propose over 30 novel features that enable us to use to use pattern classification to accurately identify package-level clones. To our knowledge, we are the first to consider clone detection as a classification problem. Our results show our system, Clonewise, compares well to manually tracked databases. Based on our work, over 30 unknown package clones and vulnerabilities have been identified and patched.

[1]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[2]  Dongmei Zhang,et al.  Code clone detection experience at microsoft , 2011, IWSC '11.

[3]  Chris Tyler Fedora Linux , 2006 .

[4]  L JonesEdward Metrics based plagarism monitoring , 2001 .

[5]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[6]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[7]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[10]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[11]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[12]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[13]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[14]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[15]  Seong-Bae Park,et al.  Program Plagiarism Detection Using Parse Tree Kernels , 2006, PRICAI.

[16]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[17]  Hwan-Gue Cho,et al.  A source code linearization technique for detecting plagiarized programs , 2007, ITiCSE.

[18]  Heejung Kim,et al.  MeCC: memory comparison-based clone detector , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[19]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[20]  LiXuelong,et al.  Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval , 2006 .

[21]  Stan Jarzabek,et al.  A Data Mining Approach for Detecting Higher-Level Clones in Software , 2009, IEEE Transactions on Software Engineering.

[22]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[23]  Zhendong Su,et al.  Context-based detection of clone-related bugs , 2007, ESEC-FSE '07.

[24]  Geoffrey I. Webb,et al.  PRICAI 2006: Trends in Artificial Intelligence, 9th Pacific Rim International Conference on Artificial Intelligence, Guilin, China, August 7-11, 2006, Proceedings , 2006, PRICAI.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[27]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[28]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[29]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .