Selecting Features in Origin Analysis

When applying a machine-learning approach to develop classifiers in a new domain, an important question is what measurements to take and how they will be used to construct informative features. This paper develops a novel set of machine-learning classifiers for the domain of classifying files taken from software projects; the target classifications are based on origin analysis. Our approach adapts the output of four copy-analysis tools, generating a number of different measurements. By combining the measures and the files on which they operate, a large set of features is generated in a semi-automatic manner. After which, standard attribute selection and classifier training techniques yield a pool of high quality classifiers (accuracy in the range of 90%), and information on the most relevant features.

[1]  Austen Rainer,et al.  Analysing Ferret XML reports to estimate the density of copied code , 2010 .

[2]  Austen Rainer,et al.  Unscrambling code clones for one-to-one matching of duplicated code , 2010 .

[3]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[4]  Miryung Kim,et al.  Program element matching for multi-version program analyses , 2006, MSR '06.

[5]  Michael W. Godfrey,et al.  Using origin analysis to detect merging and splitting of source code entities , 2005, IEEE Transactions on Software Engineering.

[6]  Austen Rainer,et al.  Using n-grams to rapidly characterise the evolution of software code , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering - Workshops.

[7]  Giuliano Antoniol,et al.  An automatic approach to identify class evolution discontinuities , 2004, Proceedings. 7th International Workshop on Principles of Software Evolution, 2004..

[8]  Sunghun Kim,et al.  When functions change their names: automatic detection of origin relationships , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[9]  Luc De Raedt,et al.  Feature Construction with Version Spaces for Biochemical Applications , 2001, ICML.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[11]  Krzysztof Krawiec,et al.  Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks , 2002, Genetic Programming and Evolvable Machines.

[12]  James Theiler,et al.  Online feature selection for pixel classification , 2005, ICML.

[13]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[14]  Katsuro Inoue,et al.  Similarity of software system and its measurement tool SMMT , 2007, Syst. Comput. Jpn..

[15]  Austen Rainer,et al.  Building Classifiers to Identify Split Files , 2009, MLDM Posters.