Concept location using formal concept analysis and information retrieval

The article addresses the problem of concept location in source code by proposing an approach that combines Formal Concept Analysis and Information Retrieval. In the proposed approach, Latent Semantic Indexing, an advanced Information Retrieval approach, is used to map textual descriptions of software features or bug reports to relevant parts of the source code, presented as a ranked list of source code elements. Given the ranked list, the approach selects the most relevant attributes from the best ranked documents, clusters the results, and presents them as a concept lattice, generated using Formal Concept Analysis. The approach is evaluated through a large case study on concept location in the source code on six open-source systems, using several hundred features and bugs. The empirical study focuses on the analysis of various configurations of the generated concept lattices and the results indicate that our approach is effective in organizing different concepts and their relationships present in the subset of the search results. In consequence, the proposed concept location method has been shown to outperform a standalone Information Retrieval based concept location technique by reducing the number of irrelevant search results across all the systems and lattice configurations evaluated, potentially reducing the programmers' effort during software maintenance tasks involving concept location.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  J.A. Gomez,et al.  Locating user functionality in old code , 1992, Proceedings Conference on Software Maintenance 1992.

[5]  Ted J. Biggerstaff,et al.  The concept assignment problem in program understanding , 1993, [1993] Proceedings Working Conference on Reverse Engineering.

[6]  Robert K. Yin,et al.  Applications of case study research , 1993 .

[7]  L. Beran,et al.  [Formal concept analysis]. , 1996, Casopis lekaru ceskych.

[8]  Václav Rajlich,et al.  Case study of feature location using dependence graph , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[9]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[10]  Andrian Marcus,et al.  Source code files as structured documents , 2002, Proceedings 10th International Workshop on Program Comprehension.

[11]  S. Read Applications of Case Study Research , 2003 .

[12]  Rainer Koschke,et al.  Locating Features in Source Code , 2003, IEEE Trans. Software Eng..

[13]  Norman Wilde,et al.  A comparison of methods for locating features in legacy software , 2003, J. Syst. Softw..

[14]  Paolo Tonella,et al.  Using a Concept Lattice of Decomposition Slices for Program Understanding and Impact Analysis , 2003, IEEE Trans. Software Eng..

[15]  Mariano Ceccato,et al.  Aspect mining through the formal concept analysis of execution traces , 2004, 11th Working Conference on Reverse Engineering.

[16]  Tibor Gyimóthy,et al.  Extracting facts from open source software , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[17]  Wei Zhao,et al.  SNIAFL: towards a static non-interactive approach to feature location , 2004, Proceedings. 26th International Conference on Software Engineering.

[18]  Václav Rajlich,et al.  Incremental change in object-oriented programming , 2004, IEEE Software.

[19]  Frank Tip,et al.  Chianti: a tool for change impact analysis of java programs , 2004, OOPSLA.

[20]  Martin P. Robillard,et al.  How effective developers investigate source code: an exploratory study , 2004, IEEE Transactions on Software Engineering.

[21]  Julio Gonzalo,et al.  Browsing Search Results via Formal Concept Analysis: Automatic Selection of Attributes , 2004, ICFCA.

[22]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[23]  Gregor Snelting Concept Lattices in Software Analysis , 2005, Formal Concept Analysis.

[24]  Janice Singer,et al.  Hipikat: a project memory for software development , 2005, IEEE Transactions on Software Engineering.

[25]  Stéphane Ducasse,et al.  Identifying traits with formal concept analysis , 2005, ASE.

[26]  Julio Gonzalo,et al.  Evaluating Hierarchical Clustering of Search Results , 2005, SPIRE.

[27]  Andrian Marcus,et al.  Recovery of Traceability Links between Software Documentation and Source Code , 2005, Int. J. Softw. Eng. Knowl. Eng..

[28]  Andrian Marcus,et al.  Static techniques for concept location in object-oriented code , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[29]  Denys Poshyvanyk,et al.  IRiSS - A Source Code Exploration Tool , 2005, ICSM.

[30]  Martin P. Robillard,et al.  Automatic generation of suggestions for program investigation , 2005, ESEC/FSE-13.

[31]  Kim Mens,et al.  Delving source code with formal concept analysis , 2005, Comput. Lang. Syst. Struct..

[32]  Mik Kersten,et al.  Using task context to improve programmer productivity , 2006, SIGSOFT '06/FSE-14.

[33]  Norman Wilde,et al.  Industrial tools for the feature location problem: an exploratory study: Practice Articles , 2006 .

[34]  Denys Poshyvanyk,et al.  JIRiSS - an Eclipse plug-in for Source Code Exploration , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[35]  Elisa L. A. Baniassad,et al.  Isolating and relating concerns in requirements using latent semantic analysis , 2006, OOPSLA '06.

[36]  Denys Poshyvanyk,et al.  Source Code Exploration with Google , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[37]  Kiarash Mahdavi,et al.  Allowing Overlapping Boundaries in Source Code using a Search Based Approach to Concept Binding , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[38]  Yann-Gaël Guéhéneuc,et al.  Feature Identification: An Epidemiological Metaphor , 2006, IEEE Transactions on Software Engineering.

[39]  Arie van Deursen,et al.  Can LSI help reconstructing requirements traceability in design and test? , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[40]  Norman Wilde,et al.  Industrial tools for the feature location problem: an exploratory study , 2006, J. Softw. Maintenance Res. Pract..

[41]  Brad A. Myers,et al.  An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks , 2006, IEEE Transactions on Software Engineering.

[42]  Jane Huffman Hayes,et al.  Advancing candidate link generation for requirements tracing: the study of methods , 2006, IEEE Transactions on Software Engineering.

[43]  Emily Hill,et al.  Exploring the neighborhood with dora to expedite software maintenance , 2007, ASE '07.

[44]  Lori Pollock,et al.  Natural language program analysis: combining natural language processing with program analysis to improve software maintenance tools , 2007 .

[45]  Emily Hill,et al.  Using natural language program analysis to locate and understand action-oriented concerns , 2007, AOSD.

[46]  Denys Poshyvanyk,et al.  Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[47]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[48]  Genny Tortora,et al.  Recovering traceability links in software artifact management systems using information retrieval methods , 2007, TSEM.

[49]  Martin P. Robillard,et al.  A Comparative Study of Three Program Exploration Tools , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[50]  Denys Poshyvanyk,et al.  Feature location via information retrieval based filtering of a single scenario execution trace , 2007, ASE.

[51]  Thomas Fritz,et al.  Does a programmer's activity indicate knowledge of code? , 2007, ESEC-FSE '07.

[52]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[53]  Florian Deißenböck,et al.  From Reality to Programs and (Not Quite) Back Again , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[54]  Alfred V. Aho,et al.  Do Crosscutting Concerns Cause Defects? , 2008, IEEE Transactions on Software Engineering.

[55]  Martin P. Robillard,et al.  Topology analysis of software dependencies , 2008, TSEM.

[56]  Jeffrey G. Gray,et al.  An information retrieval process to aid in the analysis of code clones , 2009, Empirical Software Engineering.

[57]  Andrea De Lucia,et al.  Using structural and semantic metrics to improve class cohesion , 2008, 2008 IEEE International Conference on Software Maintenance.

[58]  Michael English,et al.  An empirical analysis of information retrieval based concept location techniques in software comprehension , 2008, Empirical Software Engineering.

[59]  Gail C. Murphy,et al.  Asking and Answering Questions during a Programming Change Task , 2008, IEEE Transactions on Software Engineering.

[60]  Hsinyi Jiang,et al.  Incremental Latent Semantic Indexing for Effective , Automatic Traceability Link Evolution Management , 2008 .

[61]  Emily Hill,et al.  AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools , 2008, MSR '08.

[62]  Alfred V. Aho,et al.  CERBERUS: Tracing Requirements to Source Code Using Information Retrieval, Dynamic Analysis, and Program Analysis , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[63]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[64]  David B. Skillicorn,et al.  Automated Concept Location Using Independent Component Analysis , 2008, 2008 15th Working Conference on Reverse Engineering.

[65]  Carl K. Chang,et al.  Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution Management , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[66]  Tibor Gyimóthy,et al.  Using information retrieval based coupling measures for impact analysis , 2009, Empirical Software Engineering.

[67]  Rudolf Ferenc,et al.  Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[68]  Alberto Bacchelli,et al.  Benchmarking Lightweight Techniques to Link E-Mails and Source Code , 2009, 2009 16th Working Conference on Reverse Engineering.

[69]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[70]  Serge Demeyer,et al.  Feature location in COBOL mainframe systems: An experience report , 2009, 2009 IEEE International Conference on Software Maintenance.

[71]  Harald C. Gall,et al.  Analyzing the co-evolution of comments and source code , 2009, Software Quality Journal.

[72]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[73]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[74]  Denys Poshyvanyk,et al.  An exploratory study on assessing feature location techniques , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[75]  Emily Hill,et al.  Automatically capturing source code context of NL-queries for software maintenance and reuse , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[76]  Jonathan Sillito,et al.  Searching and skimming: An exploratory study , 2009, 2009 IEEE International Conference on Software Maintenance.

[77]  Adrian Kuhn,et al.  Automatic labeling of software components and their evolution using log-likelihood ratio of word frequencies in source code , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[78]  Gail C. Murphy,et al.  Summarizing software artifacts: a case study of bug reports , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[79]  Bogdan Dit,et al.  Using Data Fusion and Web Mining to Support Feature Location in Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[80]  Alberto Bacchelli,et al.  Extracting Source Code from E-Mails , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[81]  Lori Pollock,et al.  Towards Automatically Generating Comments for Java Methods , 2010 .

[82]  Romain Robbes,et al.  Linking e-mails and source code artifacts , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[83]  Andrian Marcus,et al.  On the Use of Automated Text Summarization Techniques for Summarizing Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[84]  Denys Poshyvanyk,et al.  FLAT3: feature location and textual tracing tool , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[85]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[86]  Giuliano Antoniol,et al.  Can Better Identifier Splitting Techniques Help Feature Location? , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[87]  Bernhard Ganter,et al.  Formal Concept Analysis , 2013 .

[88]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..