Automatic identifier inconsistency detection using code dictionary

Inconsistent identifiers make it difficult for developers to understand source code. In particular, large software systems written by several developers can be vulnerable to identifier inconsistency. Unfortunately, it is not easy to detect inconsistent identifiers that are already used in source code. Although several techniques have been proposed to address this issue, many of these techniques can result in false alarms since such techniques do not accept domain words and idiom identifiers that are widely used in programming practice. This paper proposes an approach to detecting inconsistent identifiers based on a custom code dictionary. It first automatically builds a Code Dictionary from the existing API documents of popular Java projects by using an Natural Language Processing (NLP) parser. This dictionary records domain words with dominant part-of-speech (POS) and idiom identifiers. This set of domain words and idioms can improve the accuracy when detecting inconsistencies by reducing false alarms. The approach then takes a target program and detects inconsistent identifiers of the program by leveraging the Code Dictionary. We provide CodeAmigo, a GUI-based tool support for our approach. We evaluated our approach on seven Java based open-/proprietary- source projects. The results of the evaluations show that the approach can detect inconsistent identifiers with 85.4 % precision and 83.59 % recall values. In addition, we conducted an interview with developers who used our approach, and the interview confirmed that inconsistent identifiers frequently and inevitably occur in most software projects. The interviewees then stated that our approach can help to better detect inconsistent identifiers that would have been missed through manual detection.

[1]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[2]  Paolo Tonella,et al.  Automated Identifier Completion and Replacement , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[3]  Yann-Gaël Guéhéneuc,et al.  Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[4]  A F Monk,et al.  Errors in proofreading: Evidence for the use of word shape in word recognition , 1983, Memory & cognition.

[5]  David W. Binkley,et al.  Quantifying identifier quality: an analysis of trends , 2006, Empirical Software Engineering.

[6]  S Abramovici,et al.  Errors in proofreading: Evidence for syntactic control of letter processing? , 1983, Memory & cognition.

[7]  Mark Davies,et al.  Mining Programming Language Vocabularies from Source Code , 2009, PPIG.

[8]  Elliott Hughes,et al.  Checking spelling in source code , 2004, SIGP.

[9]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[10]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[11]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[12]  Sooyong Park,et al.  Detecting Inconsistent Names of Source Code Using NLP , 2012, FGIT-EL/DTA/UNESST.

[13]  D. Powers Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation , 2008 .

[14]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[15]  David W. Binkley,et al.  Improving identifier informativeness using part of speech information , 2011, MSR '11.

[16]  David W. Binkley,et al.  Normalizing Source Code Vocabulary , 2010, 2010 17th Working Conference on Reverse Engineering.

[17]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[18]  Yann-Gaël Guéhéneuc,et al.  A New Family of Software Anti-patterns: Linguistic Anti-patterns , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[19]  Shinji Kusumoto,et al.  How often do unintended inconsistencies happen? Deriving modification patterns and detecting overlooked code fragments , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[20]  Robert C. Martin Clean Code - a Handbook of Agile Software Craftsmanship , 2008 .

[21]  Pete Goodliffe,et al.  Code Craft: The Practice of Writing Excellent Code , 2006 .

[22]  Paolo Tonella,et al.  Lexicon Bad Smells in Software , 2009, 2009 16th Working Conference on Reverse Engineering.

[23]  Clémentine Nebut,et al.  Automatic Extraction of a WordNet-Like Identifier Network from Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[24]  Robert M. Schindler,et al.  Error in proofreading: Evidence of syntactic control of letter processing? , 1981 .

[25]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[26]  Joshua J. Bloch Effective Java : programming language guide , 2001 .

[27]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[28]  David W. Binkley,et al.  Syntactic Identifier Conciseness and Consistency , 2006, 2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation.

[29]  Jeffrey C. Carver,et al.  Part-of-speech tagging of program identifiers for improved text-based software engineering tools , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[30]  Paolo Tonella,et al.  Natural Language Parsing of Program Element Names for Concept Extraction , 2010, 2010 IEEE 18th International Conference on Program Comprehension.