Navigating source code with words

The hierarchical method of organizing information has proven beneficial in learning in part because it maps well onto the human brain's memory. Exploiting this organizational strategy may help engineers cope with large software systems. In fact such an strategy is already present in source code and is manifested in the class hierarchies of objected-oriented programs. However, an engineer faced with fixing a bug or any similar need to locate the implementation of a particular feature in the code is less interested in the syntactic organization of the code and more interested in its conceptual organization. Therefore, a conceptual hierarchy would bring clear benefit. Fortunately, such a view can be extracted automatically the source code. The hierarchy generating tool HierIT performs this task using an information-theoretic approach to identify “content-bearing” words and associate them hierarchically. The resulting hierarchy enables an engineer to better understand the concepts contained in a software system. To study their value, an experiment was conducted to quantitatively and qualitatively investigate the value that hierarchies bring. The quantitative evaluation first considers the Expected Mutual Information Measure (EMIM) between the set of topic words and natural language extracted from the source code. It then considers the Best Case Tree Walk (BCTW), which captures how “expensive” it is to find interesting documents. Finally, the hierarchies are considered qualitatively by investigating their perceived usefulness in a case study involving three engineers.

[1]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[2]  C. D. Batty An introduction to the Dewey Decimal Classification , 1966 .

[3]  Grace Hui Yang Browsing Hierarchy Construction by Minimum Evolution , 2015, TOIS.

[4]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[5]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[6]  David W. Binkley,et al.  Enabling improved IR-based feature location , 2015, J. Syst. Softw..

[7]  Mausam,et al.  Hierarchical Summarization: Scaling Up Multi-Document Summarization , 2014, ACL.

[8]  W. Bruce Croft,et al.  Quantifying query ambiguity , 2002 .

[9]  Stephen W. Thomas Mining software repositories using topic models , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[10]  Martin Lee Applying cognitive science to education: Thinking and learning in scientific and other complex domains , 2009 .

[11]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[12]  Andrea De Lucia,et al.  Labeling source code with information retrieval methods: an empirical study , 2013, Empirical Software Engineering.

[13]  Peter G. Anick,et al.  The paraphrase search assistant: terminological feedback for iterative information seeking , 1999, SIGIR '99.

[14]  Katsuhiko Gondow,et al.  Toward mining "concept keywords" from identifiers in large software projects , 2005, MSR.

[15]  Eneko Agirre,et al.  Evaluating hierarchical organisation structures for exploring digital libraries , 2014, Information Retrieval.

[16]  Paolo Tonella,et al.  Supporting concept location through identifier parsing and ontology extraction , 2013, J. Syst. Softw..