Inferring semantically related words from software context

Code search is an integral part of software development and program comprehension. The difficulty of code search lies in the inability to guess the exact words used in the code. Therefore, it is crucial for keyword-based code search to expand queries with semantically related words, e.g., synonyms and abbreviations, to increase the search effectiveness. However, it is limited to rely on resources such as English dictionaries and WordNet to obtain semantically related words in software, because many words that are semantically related in software are not semantically related in English. This paper proposes a simple and general technique to automatically infer semantically related words in software by leveraging the context of words in comments and code. We achieve a reasonable accuracy in seven large and popular code bases written in C and Java. Our further evaluation against the state of art shows that our technique can achieve a higher precision and recall.

[1]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[2]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[3]  Denys Poshyvanyk,et al.  Source Code Exploration with Google , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[4]  Emily Hill,et al.  AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools , 2008, MSR '08.

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[7]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[8]  Emily Hill,et al.  Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[9]  Lori Pollock,et al.  Integrating natural language and program structure information to improve software search and exploration , 2010 .

[10]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[11]  Lori Pollock,et al.  Investigating how to effectively combine static concern location techniques , 2011, SUITE '11.

[12]  David W. Binkley,et al.  Improving identifier informativeness using part of speech information , 2011, MSR '11.

[13]  Yuanyuan Zhou,et al.  aComment: mining annotations from comments and code to detect interrupt related concurrency bugs , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[14]  Lori L. Pollock,et al.  Towards supporting on-demand virtual remodularization using program graphs , 2006, AOSD.

[15]  Tao Xie,et al.  Identifying security bug reports via text mining: An industrial case study , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[16]  Oscar Nierstrasz,et al.  Assigning bug reports using a vocabulary-based expertise model of developers , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[17]  Tao Xie,et al.  Inferring Resource Specifications from Natural Language API Documentation , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[18]  Seung-won Hwang,et al.  CosTriage: A Cost-Aware Triage Algorithm for Bug Reporting Systems , 2011, AAAI.

[19]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[20]  Peter Membrey,et al.  The Linux Kernel , 2009 .

[21]  Emily Hill,et al.  Using natural language program analysis to locate and understand action-oriented concerns , 2007, AOSD.

[22]  David W. Binkley,et al.  Normalizing Source Code Vocabulary , 2010, 2010 17th Working Conference on Reverse Engineering.

[23]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[24]  Emily Hill,et al.  Improving source code search with natural language phrasal representations of method signatures , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[25]  Jeannette M. Wing,et al.  Signature matching: a tool for using software libraries , 1995, TSEM.

[26]  Denys Poshyvanyk,et al.  JIRiSS - an Eclipse plug-in for Source Code Exploration , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[27]  Paolo Tonella,et al.  Natural Language Parsing of Program Element Names for Concept Extraction , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[28]  Kris De Volder,et al.  Navigating and querying code without getting lost , 2003, AOSD '03.

[29]  R. Holmes,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[30]  Yuanyuan Zhou,et al.  /*icomment: bugs or bad comments?*/ , 2007, SOSP.

[31]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[32]  Amy J. Ko,et al.  Eliciting design requirements for maintenance-oriented IDEs: a detailed study of corrective and perfective maintenance tasks , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[33]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[34]  Yuanyuan Zhou,et al.  Have things changed now?: an empirical study of bug characteristics in modern open source software , 2006, ASID '06.

[35]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[36]  Premkumar T. Devanbu,et al.  Recommending random walks , 2007, ESEC-FSE '07.

[37]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[38]  Emily Hill,et al.  Automatically capturing source code context of NL-queries for software maintenance and reuse , 2009, 2009 IEEE 31st International Conference on Software Engineering.