Improving the Tokenisation of Identifier Names

Identifier names are the main vehicle for semantic information during program comprehension. Identifier names are tokenised into their semantic constituents by tools supporting program comprehension tasks, including concept location and requirements traceability. We present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves tokenisation accuracy for identifier names of a single case and those containing digits. Second, performance gains over existing techniques are achieved using smaller oracles. Accuracy was evaluated by comparing the output of our algorithm to manual tokenisations of 28,000 identifier names drawn from 60 open source Java projects totalling 16.5 MSLOC. We also undertook a study of the typographical features of identifier names (single case, use of digits, etc.) per object-oriented construct (class names, method names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available.

[1]  Scott W. Ambler,et al.  The Elements of Java Style , 2000 .

[2]  Jeremy Singer,et al.  Exploiting the Correspondence between Micro Patterns and Class Names , 2008, 2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation.

[3]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[4]  Jan Jürjens,et al.  Extracting Domain Ontologies from Domain Specific APIs , 2008, 2008 12th European Conference on Software Maintenance and Reengineering.

[5]  Yijun Yu,et al.  Exploring the Influence of Identifier Names on Code Quality: An Empirical Study , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Yann-Gaël Guéhéneuc,et al.  Mining the Lexicon Used by Programmers during Sofware Evolution , 2007, 2007 IEEE International Conference on Software Maintenance.

[8]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[9]  Yann-Gaël Guéhéneuc,et al.  Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[10]  Andrian Marcus,et al.  Static techniques for concept location in object-oriented code , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[11]  Dawn J Lawrie,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR EXTRACTING CONCEPT ABBREVIATIONS FROM IDENTIFIERS , 2006 .

[12]  Scott W. Ambler,et al.  The Elements of Java™ Style: Index , 2000 .

[13]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[14]  Ewan D. Tempero,et al.  Indexing the Java API Using Source Code , 2008, 19th Australian Conference on Software Engineering (aswec 2008).

[15]  M. G. Rekoff,et al.  On reverse engineering , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  David W. Binkley,et al.  Quantifying identifier quality: an analysis of trends , 2006, Empirical Software Engineering.

[17]  Sophia Drossopoulou ECOOP 2009 - Object-Oriented Programming, 23rd European Conference, Genoa, Italy, July 6-10, 2009. Proceedings , 2009, ECOOP.

[18]  Einar W. Høst,et al.  The Java Programmer's Phrase Book , 2009, SLE.

[19]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[20]  Paolo Tonella,et al.  Natural Language Parsing of Program Element Names for Concept Extraction , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[21]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..