Automatically mining software-based, semantically-similar words from comment-code mappings

Many software development and maintenance tools involve matching between natural language words in different software artifacts (e.g., traceability) or between queries submitted by a user and software artifacts (e.g., code search). Because different people likely created the queries and various artifacts, the effectiveness of these tools is often improved by expanding queries and adding related words to textual artifact representations. Synonyms are particularly useful to overcome the mismatch in vocabularies, as well as other word relations that indicate semantic similarity. However, experience shows that many words are semantically similar in computer science situations, but not in typical natural language documents. In this paper, we present an automatic technique to mine semantically similar words, particularly in the software context. We leverage the role of leading comments for methods and programmer conventions in writing them. Our evaluation of our mined related comment-code word mappings that do not already occur in WordNet are indeed viewed as computer science, semantically-similar word pairs in high proportions.

[1]  James H. Martin,et al.  Speech and Language Processing An Introduction to Natural Language Processing , Computational Linguistics , and Speech Recognition Second Edition , 2008 .

[2]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[3]  Martin P. Robillard,et al.  Automatic generation of suggestions for program investigation , 2005, ESEC/FSE-13.

[4]  Premkumar T. Devanbu,et al.  Recommending random walks , 2007, ESEC-FSE '07.

[5]  David W. Binkley,et al.  Leveraged Quality Assessment using Information Retrieval Techniques , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[6]  Jinqiu Yang,et al.  Inferring semantically related words from software context , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[7]  Yuanyuan Zhou,et al.  aComment: mining annotations from comments and code to detect interrupt related concurrency bugs , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[8]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[9]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[10]  Ted Pedersen,et al.  Maximizing Semantic Relatedness to Perform Word Sense Disambiguation , 2005 .

[11]  Harald C. Gall,et al.  Do Code and Comments Co-Evolve? On the Relation between Source Code and Comment Changes , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[12]  Lori Pollock,et al.  Automatic generation of descriptive summary comments for methods in object-oriented programs , 2012 .

[13]  Emily Hill,et al.  Exploring the neighborhood with dora to expedite software maintenance , 2007, ASE '07.

[14]  Yuanyuan Zhou,et al.  /*icomment: bugs or bad comments?*/ , 2007, SOSP.

[15]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[16]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[17]  Gail C. Murphy,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[18]  Alessandro Orso,et al.  A differencing algorithm for object-oriented programs , 2004 .

[19]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[20]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[21]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[22]  Sushil Krishna Bajracharya,et al.  Sourcerer: a search engine for open source code supporting structure-based search , 2006, OOPSLA '06.

[23]  Emily Hill,et al.  Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[24]  Emily Hill,et al.  Using natural language program analysis to locate and understand action-oriented concerns , 2007, AOSD.

[25]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[26]  Paolo Tonella,et al.  Natural Language Parsing of Program Element Names for Concept Extraction , 2010, 2010 IEEE 18th International Conference on Program Comprehension.