Extracting Meaning from Abbreviated Identifiers

Informative identifiers are made up of full (natural language) words and (meaningful) abbreviations. Readers of programs typically have little trouble understanding the purpose of identifiers composed of full words. In addition, those familiar with the code can (most often) determine the meaning of abbreviations used in identifiers. However, when faced with unfamiliar code, abbreviations often carry little useful information. Furthermore, tools that focus on the natural language used in the code have a hard time in the presence of abbreviations. One approach to providing meaning to programmers and tools is to translate (expand) abbreviations into full words. This paper presents a methodology for expanding identifiers and evaluates the process on a code based of just over 35 million lines of code. For example, using phrase extraction, fs_exists is expanded to file_status_exists illustrating how the expansion process can facilitate comprehension. On average, 16 percent of the identifiers in a program are expanded. Finally, as an example application, the approach is used to improve the syntactic identification of violations to Deissenbock and Pizka's rules for concise and consistent identifier construction.

[1]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[2]  Renato De Mori,et al.  Source Code Informal Information Analysis Using Connectionist Models , 1993, IJCAI.

[3]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[4]  Harry M. Sneed Object-oriented COBOL recycling , 1996, Proceedings of WCRE '96: 4rd Working Conference on Reverse Engineering.

[5]  Nicolas Anquetil,et al.  Assessing the relevance of identifier names in a legacy software system , 1998, CASCON.

[6]  Nicolas Anquetil,et al.  Extracting concepts from file names; a new file clustering criterion , 1998, Proceedings of the 20th International Conference on Software Engineering.

[7]  Letha H. Etzkorn,et al.  An approach to program understanding by natural language understanding , 1999, Natural Language Engineering.

[8]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[9]  Paolo Tonella,et al.  Restructuring program identifier names , 2000, Proceedings 2000 International Conference on Software Maintenance.

[10]  Paul Ogilvie,et al.  Acrophile: an automated acronym extractor and server , 2000, DL '00.

[11]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[12]  W. Bruce Croft,et al.  Probabilistic techniques for phrase extraction , 2001, Inf. Process. Manag..

[13]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[14]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[15]  Juergen Rilling,et al.  Identifying comprehension bottlenecks using program slicing and cognitive complexity metrics , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[16]  Katsuro Inoue,et al.  Automatic categorization algorithm for evolvable software archive , 2003, Sixth International Workshop on Principles of Software Evolution, 2003. Proceedings..

[17]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[18]  P. Kantor Foundations of Statistical Natural Language Processing , 2001, Information Retrieval.

[19]  Derek M. Jones Memory for a short sequence of assignment statements (part 2 of 2) , 2004 .

[20]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[21]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[22]  David W. Binkley,et al.  What’s in a Name? A Study of Identifiers , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[23]  David W. Binkley,et al.  Leveraged Quality Assessment using Information Retrieval Techniques , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[24]  Dawn J Lawrie,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR EXTRACTING CONCEPT ABBREVIATIONS FROM IDENTIFIERS , 2006 .

[25]  David W. Binkley,et al.  Syntactic Identifier Conciseness and Consistency , 2006, 2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation.