Source code identifier splitting using Yahoo image and web search engine

Source-code or program identifiers are sequence of characters consisting of one or more tokens representing domain concepts. Splitting or tokenizing identifiers that does not contain explicit markers or clues (such as came-casing or using underscore as a token separator) is a technically challenging problem. In this paper, we present a technique for automatic tokenization and splitting of source-code identifiers using Yahoo web search and image search similarity distance. We present an algorithm that decides the split position based on various factors such as conceptual correlations and semantic relatedness between the left and right splits strings of a given identifier, popularity of the token and its length. The number of hits or search results returned by the web and image search engine serves as a proxy to measures such as term popularity and correlation. We perform a series of experiments to validate the proposed approach and present performance results.

[1]  Dawn J Lawrie,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR EXTRACTING CONCEPT ABBREVIATIONS FROM IDENTIFIERS , 2006 .

[2]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[3]  David W. Binkley,et al.  Extracting Meaning from Abbreviated Identifiers , 2007, Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007).

[4]  Nenghai Yu,et al.  Flickr distance , 2008, ACM Multimedia.

[5]  Gang Lu,et al.  A new semantic similarity measuring method based on web search engines , 2010 .

[6]  Nenghai Yu,et al.  Flickr Distance: A Relationship Measure for Visual Concepts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yijun Yu,et al.  Improving the Tokenisation of Identifier Names , 2011, ECOOP.

[8]  Yann-Gaël Guéhéneuc,et al.  TIDIER: an identifier splitting approach using speech recognition techniques , 2013, J. Softw. Evol. Process..

[9]  Danushka Bollegala,et al.  Mining for personal name aliases on the web , 2008, WWW.

[10]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[12]  Preslav Nakov,et al.  A study of using search engine page hits as a proxy for n-gram frequencies , 2005 .

[13]  David W. Binkley,et al.  Normalizing Source Code Vocabulary , 2010, 2010 17th Working Conference on Reverse Engineering.

[14]  Emily Hill,et al.  AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools , 2008, MSR '08.

[15]  John A. Keane,et al.  Using Web-Search Results to Measure Word-Group Similarity , 2008, COLING.

[16]  Yann-Gaël Guéhéneuc,et al.  Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.