论文信息 - Selecting Features in Origin Analysis

Selecting Features in Origin Analysis

When applying a machine-learning approach to develop classifiers in a new domain, an important question is what measurements to take and how they will be used to construct informative features. This paper develops a novel set of machine-learning classifiers for the domain of classifying files taken from software projects; the target classifications are based on origin analysis. Our approach adapts the output of four copy-analysis tools, generating a number of different measurements. By combining the measures and the files on which they operate, a large set of features is generated in a semi-automatic manner. After which, standard attribute selection and classifier training techniques yield a pool of high quality classifiers (accuracy in the range of 90%), and information on the most relevant features.

[1] Austen Rainer,et al. Analysing Ferret XML reports to estimate the density of copied code , 2010 .

[2] Austen Rainer,et al. Unscrambling code clones for one-to-one matching of duplicated code , 2010 .

[3] Shinji Kusumoto,et al. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[4] Miryung Kim,et al. Program element matching for multi-version program analyses , 2006, MSR '06.

[5] Michael W. Godfrey,et al. Using origin analysis to detect merging and splitting of source code entities , 2005, IEEE Transactions on Software Engineering.

[6] Austen Rainer,et al. Using n-grams to rapidly characterise the evolution of software code , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering - Workshops.

[7] Giuliano Antoniol,et al. An automatic approach to identify class evolution discontinuities , 2004, Proceedings. 7th International Workshop on Principles of Software Evolution, 2004..

[8] Sunghun Kim,et al. When functions change their names: automatic detection of origin relationships , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[9] Luc De Raedt,et al. Feature Construction with Version Spaces for Biochemical Applications , 2001, ICML.

[10] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[11] Krzysztof Krawiec,et al. Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks , 2002, Genetic Programming and Evolvable Machines.

[12] James Theiler,et al. Online feature selection for pixel classification , 2005, ICML.

[13] Chanchal Kumar Roy,et al. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[14] Katsuro Inoue,et al. Similarity of software system and its measurement tool SMMT , 2007, Syst. Comput. Jpn..

[15] Austen Rainer,et al. Building Classifiers to Identify Split Files , 2009, MLDM Posters.