Normalizing Source Code Vocabulary

Information Retrieval (IR) based tools complement traditional static and dynamic analysis tools by exploiting the natural language found within a program's text. Tools incorporating IR have tackled problems, such as feature location, that previously required considerable human effort. However, to reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirement and design documents, test plans, as well as the source code) must be consistent. Vocabulary normalization aligns the vocabulary found in source code with that found in other software artifacts. Normalization both splits an identifier into its constituent parts and expands each part into a full dictionary word to match vocabulary in other artifacts. An algorithm for normalization is presented. Its current implementation incorporates a greatly improved splitter that exploits a collection of resources including several dictionaries, frequency distributions derived from the corpus of programs, and co-occurrence data. Empirical study of this new splitter, GenTest, on almost 8000 identifiers finds that it correctly splits 82%, outperforming the current state-of-the-art. A preliminary experiment with the normalization algorithm finds it improving the FLAT feature locator's scores of relevant code from 0.60 to 0.95 on a scale from 0 to 1.

[1]  Nicolas Anquetil,et al.  Assessing the relevance of identifier names in a legacy software system , 1998, CASCON.

[2]  Denys Poshyvanyk,et al.  FLAT3: feature location and textual tracing tool , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[3]  David W. Binkley,et al.  Quantifying identifier quality: an analysis of trends , 2006, Empirical Software Engineering.

[4]  David W. Binkley,et al.  Extracting Meaning from Abbreviated Identifiers , 2007, Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007).

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Kari Laitinen,et al.  Estimating understandability of software documents , 1996, SOEN.

[7]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[8]  G. Molenberghs,et al.  Models for Discrete Longitudinal Data , 2005 .

[9]  Letha H. Etzkorn,et al.  Special issue on information retrieval for program comprehension , 2008, Empirical Software Engineering.

[10]  Wei Zhao,et al.  SNIAFL: towards a static non-interactive approach to feature location , 2004, Proceedings. 26th International Conference on Software Engineering.

[11]  Yi Liu,et al.  A maximum coherence model for dictionary-based cross-language information retrieval , 2005, SIGIR '05.

[12]  Juergen Rilling,et al.  Identifying comprehension bottlenecks using program slicing and cognitive complexity metrics , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[13]  Paolo Tonella,et al.  Restructuring program identifier names , 2000, Proceedings 2000 International Conference on Software Maintenance.

[14]  Hamid Mcheick,et al.  An experiment in software component retrieval , 2003, Inf. Softw. Technol..

[15]  Sushil Krishna Bajracharya,et al.  Mining Eclipse Developer Contributions via Author-Topic Models , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[16]  Emily Hill,et al.  Using natural language program analysis to locate and understand action-oriented concerns , 2007, AOSD.

[17]  W. Bruce Croft,et al.  Probabilistic techniques for phrase extraction , 2001, Inf. Process. Manag..

[18]  Francis Jack Smith,et al.  A Review of Statistical Language Processing Techniques , 1998, Artificial Intelligence Review.

[19]  Tibor Gyimóthy,et al.  Columbus - reverse engineering tool and schema for C++ , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[20]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[21]  David W. Binkley,et al.  Source Code Analysis: A Road Map , 2007, Future of Software Engineering (FOSE '07).

[22]  Yann-Gaël Guéhéneuc,et al.  Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[23]  Gerardo Canfora,et al.  Jimpa: An Eclipse plug-in for impact analysis , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[24]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[25]  Emily Hill,et al.  AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools , 2008, MSR '08.

[26]  Rudolf Ferenc,et al.  Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[27]  Dawn J Lawrie,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR EXTRACTING CONCEPT ABBREVIATIONS FROM IDENTIFIERS , 2006 .

[28]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .