REPENT: Analyzing the Nature of Identifier Renamings

Source code lexicon plays a paramount role in software quality: poor lexicon can lead to poor comprehensibility and even increase software fault-proneness. For this reason, renaming a program entity, i.e., altering the entity identifier, is an important activity during software evolution. Developers rename when they feel that the name of an entity is not (anymore) consistent with its functionality, or when such a name may be misleading. A survey that we performed with 71 developers suggests that 39 percent perform renaming from a few times per week to almost every day and that 92 percent of the participants consider that renaming is not straightforward. However, despite the cost that is associated with renaming, renamings are seldom if ever documented-for example, less than 1 percent of the renamings in the five programs that we studied. This explains why participants largely agree on the usefulness of automatically documenting renamings. In this paper we propose REanaming Program ENTities (REPENT), an approach to automatically document-detect and classify-identifier renamings in source code. REPENT detects renamings based on a combination of source code differencing and data flow analyses. Using a set of natural language tools, REPENT classifies renamings into the different dimensions of a taxonomy that we defined. Using the documented renamings, developers will be able to, for example, look up methods that are part of the public API (as they impact client applications), or look for inconsistencies between the name and the implementation of an entity that underwent a high risk renaming (e.g., towards the opposite meaning). We evaluate the accuracy and completeness of REPENT on the evolution history of five open-source Java programs. The study indicates a precision of 88 percent and a recall of 92 percent. In addition, we report an exploratory study investigating and discussing how identifiers are renamed in the five programs, according to our taxonomy.

[1]  Stephan Diehl,et al.  Identifying Refactorings from Source-Code Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[2]  Jeffrey S. Foster,et al.  Understanding source code evolution using abstract syntax tree matching , 2005, MSR.

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Paolo Tonella,et al.  Restructuring program identifier names , 2000, Proceedings 2000 International Conference on Software Maintenance.

[5]  Anselm L. Strauss,et al.  Qualitative Analysis For Social Scientists , 1987 .

[6]  M.M. Lehman,et al.  Programs, life cycles, and laws of software evolution , 1980, Proceedings of the IEEE.

[7]  Andreas Zeller,et al.  Mining Version Histories to Guide Software Changes , 2004 .

[8]  Harald C. Gall,et al.  Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction , 2007, IEEE Transactions on Software Engineering.

[9]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[10]  Walter F. Tichy,et al.  Renaming Detection , 2004, Automated Software Engineering.

[11]  Lerina Aversano,et al.  An empirical study on the maintenance of source code clones , 2010, Empirical Software Engineering.

[12]  Rainer Koschke,et al.  Revisiting the Delta IC approach to component recovery , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[13]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[14]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.

[15]  Yann-Gaël Guéhéneuc,et al.  A New Family of Software Anti-patterns: Linguistic Anti-patterns , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[16]  Yann-Gaël Guéhéneuc,et al.  An exploratory study of identifier renamings , 2011, MSR '11.

[17]  Andrea De Lucia,et al.  Improving IR‐based traceability recovery via noun‐based indexing of software artifacts , 2013, J. Softw. Evol. Process..

[18]  David W. Binkley,et al.  What’s in a Name? A Study of Identifiers , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[19]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Miryung Kim,et al.  Template-based reconstruction of complex refactorings , 2010, 2010 IEEE International Conference on Software Maintenance.

[22]  Jeffrey C. Carver,et al.  Part-of-speech tagging of program identifiers for improved text-based software engineering tools , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[23]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[24]  Paolo Tonella,et al.  Natural Language Parsing of Program Element Names for Concept Extraction , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[25]  Yann-Gaël Guéhéneuc,et al.  TIDIER: an identifier splitting approach using speech recognition techniques , 2013, J. Softw. Evol. Process..

[26]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[27]  George Santayana Introduction and reason in common sense , 1922 .

[28]  Anke Schmid Doing Quantitative Research In The Social Sciences An Integrated Approach To Research Design Measurement And Statistics , 2016 .

[29]  Yann-Gaël Guéhéneuc,et al.  Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[30]  Bernd Bruegge,et al.  Object-Oriented Software Engineering Using UML, Patterns, and Java , 2009 .

[31]  Jinqiu Yang,et al.  SWordNet: Inferring semantically related words from software context , 2014, Empirical Software Engineering.

[32]  David W. Binkley,et al.  Syntactic Identifier Conciseness and Consistency , 2006, 2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation.

[33]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[34]  Eleni Stroulia,et al.  Refactoring Detection based on UMLDiff Change-Facts Queries , 2006, 2006 13th Working Conference on Reverse Engineering.

[35]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[36]  Rudolf Ferenc,et al.  Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[37]  Gabriele Bavota,et al.  Identifying Extract Class refactoring opportunities using structural and semantic cohesion measures , 2011, J. Syst. Softw..

[38]  Oscar Nierstrasz,et al.  Finding refactorings via change metrics , 2000, OOPSLA '00.

[39]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[40]  Markus Pizka,et al.  Concise and Consistent Naming , 2005, IWPC.

[41]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[42]  Ralph E. Johnson,et al.  Automated Detection of Refactorings in Evolving Components , 2006, ECOOP.

[43]  Giuliano Antoniol,et al.  Analyzing the Evolution of the Source Code Vocabulary , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[44]  Rainer Koschke,et al.  Revisiting the Delta IC approach to component recovery , 2006, Sci. Comput. Program..

[45]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[46]  David W. Binkley,et al.  Improving identifier informativeness using part of speech information , 2011, MSR '11.

[47]  Martin P. Robillard,et al.  Non-essential changes in version histories , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[48]  Giuliano Antoniol,et al.  3rd international workshop on traceability in emerging forms of software engineering (TEFSE 2005) , 2005, ASE '05.

[49]  Russ Abbott Program design by informal English descriptions , 1983, CACM.

[50]  David W. Binkley,et al.  Effective identifier names for comprehension and memory , 2007, Innovations in Systems and Software Engineering.

[51]  Lori L. Pollock,et al.  Automatically mining software-based, semantically-similar words from comment-code mappings , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[52]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[53]  Thomas Zimmermann,et al.  Preprocessing CVS Data for Fine-Grained Analysis , 2004, MSR.