Automatic Extraction of a WordNet-Like Identifier Network from Software

A large part of the time allocated to software maintenance is dedicated to the program comprehension. Many approaches that uses the program structure or the external documentation have been created to assist program comprehension. However, the identifiers of the program are an important source of information that is still not widely used for this purpose. In this article, we propose an approach, based upon Natural Language Processing techniques, that automatically extracts and organizes concepts from software identifiers in a WordNet-like structure that we call \textit{lexical views}. These lexical views give useful insight on an overall software architecture and can be used to improve results of many software engineering tasks. The proposal is evaluated against a corpus of 24 open source programs.

[1]  Emily Hill,et al.  Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[2]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[3]  Ralf Lämmel,et al.  The 16th IEEE International Conference on Program Comprehension, ICPC 2008, Amsterdam, The Netherlands, June 10-13, 2008 , 2008, ICPC.

[4]  Kim Mens,et al.  Mining aspectual views using formal concept analysis , 2004, Source Code Analysis and Manipulation, Fourth IEEE International Workshop on.

[5]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[6]  William G. Griswold,et al.  AspectBrowser: Tool Support for Managing Dispersed Aspects , 1999 .

[7]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[8]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[9]  Nicolas Anquetil,et al.  Extracting concepts from file names; a new file clustering criterion , 1998, Proceedings of the 20th International Conference on Software Engineering.

[10]  Clémentine Nebut,et al.  Building abstractions in class models: formal concept analysis in a model-driven approach , 2006, MoDELS'06.

[11]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[12]  Aniello Cimitile,et al.  Identifying objects in legacy systems using design metrics , 1999, J. Syst. Softw..

[13]  Emily Hill,et al.  AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools , 2008, MSR '08.

[14]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[15]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[16]  Kim Mens,et al.  Mining aspectual views using formal concept analysis , 2004 .

[17]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[18]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[19]  Yann-Gaël Guéhéneuc,et al.  A Domain Analysis to Specify Design Defects and Generate Detection Algorithms , 2008, FASE.

[20]  Emily Hill,et al.  Automatically capturing source code context of NL-queries for software maintenance and reuse , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[21]  A Straw,et al.  Guide to the Software Engineering Body of Knowledge , 1998 .

[22]  Stéphane Ducasse,et al.  Enriching reverse engineering with semantic clustering , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[23]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[24]  Emily Hill,et al.  Analysing source code: looking for useful verbdirect object pairs in all the right places , 2008, IET Softw..

[25]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[26]  Alfred V. Aho,et al.  The Transitive Reduction of a Directed Graph , 1972, SIAM J. Comput..

[27]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[28]  Rokia Missaoui,et al.  Design of Class Hierarchies Based on Concept (Galois) Lattices , 1998, Theory Pract. Object Syst..

[29]  Hans-Arno Jacobsen,et al.  PRISM is research in aSpect mining , 2004, OOPSLA '04.