TRIS: A Fast and Accurate Identifiers Splitting and Expansion Algorithm

Understanding source code identifiers, by identifying words composing them, is a necessary step for many program comprehension, reverse engineering, or redocumentation tasks. To this aim, researchers have proposed several identifier splitting and expansion approaches such as Samurai, TIDIER and more recently GenTest. The ultimate goal of such approaches is to help disambiguating conceptual information encoded in compound (or abbreviated) identifiers. This paper presents TRIS, TRee-based Identifier Splitter, a two-phases approach to split and expand program identifiers. First, TRIS pre-compiles transformed dictionary words into a tree representation, associating a cost to each transformation. In a second phase, it maps the identifier splitting/expansion problem into a minimization problem, i.e., the search of the shortest path (optimal split/expansion) in a weighted graph. We apply TRIS to a sample of 974 identifiers extracted from JHotDraw, 3,085 from Lynx, and to a sample of 489 identifiers extracted from 340 C programs. Also, we compare TRIS with GenTest on a set of 2,663 mixed Java, C and C++ identifiers. We report evidence that TRIS split (and expansion) is more accurate than state-of-the-art approaches and that it is also efficient in terms of computation time.

[1]  Robert D. Macredie,et al.  The effects of comments and identifier names on program comprehensibility: an experimental investigation , 1996, J. Program. Lang..

[2]  Rudolf Ferenc,et al.  Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[3]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[4]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[5]  Yann-Gaël Guéhéneuc,et al.  TIDIER: an identifier splitting approach using speech recognition techniques , 2013, J. Softw. Evol. Process..

[6]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[7]  R. Grissom,et al.  Effect sizes for research: A broad practical approach. , 2005 .

[8]  David W. Binkley,et al.  Effective identifier names for comprehension and memory , 2007, Innovations in Systems and Software Engineering.

[9]  Yann-Gaël Guéhéneuc,et al.  Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[10]  K. Goulden,et al.  Effect Sizes for Research: A Broad Practical Approach , 2006 .

[11]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[12]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[13]  Ahmed E. Hassan,et al.  Examining the evolution of code comments in PostgreSQL , 2006, MSR '06.

[14]  David W. Binkley,et al.  What’s in a Name? A Study of Identifiers , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[15]  Markus Pizka,et al.  Concise and Consistent Naming , 2005, IWPC.

[16]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[17]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[18]  David W. Binkley,et al.  Normalizing Source Code Vocabulary , 2010, 2010 17th Working Conference on Reverse Engineering.