Inferring Links between Concerns and Methods with Multi-abstraction Vector Space Model

Concern localization refers to the process of locating code units that match a particular textual description. It takes as input textual documents such as bug reports and feature requests and outputs a list of candidate code units that are relevant to the bug reports or feature requests. Many information retrieval (IR) based concern localization techniques have been proposed in the literature. These techniques typically represent code units and textual descriptions as a bag of tokens at one level of abstraction, e.g., each token is a word, or each token is a topic. In this work, we propose a multi-abstraction concern localization technique named MULAB. MULAB represents a code unit and a textual description at multiple abstraction levels. Similarity of a textual description and a code unit is now made by considering all these abstraction levels. We combine a vector space model and multiple topic models to compute the similarity and apply a genetic algorithm to infer semi-optimal topic model configurations. We have evaluated our solution on 136 concerns from 8 open source Java software systems. The experimental results show that MULAB outperforms the state-of-art baseline PR, which is proposed by Scanniello et al. in terms of effectiveness and rank.

[1]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[2]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[3]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[4]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[5]  N. Cliff Ordinal methods for behavioral data analysis , 1996 .

[6]  Emden R. Gansner,et al.  Bunch: a clustering tool for the recovery and maintenance of software system structures , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[7]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[8]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[11]  M. Harman,et al.  A robust search-based approach to project management in the presence of abandonment, rework, error and uncertainty , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[12]  Wei Zhao,et al.  SNIAFL: towards a static non-interactive approach to feature location , 2004, Proceedings. 26th International Conference on Software Engineering.

[13]  Mark Harman,et al.  10 th International Software Metrics Symposium (METRICS 2004) , 2004 .

[14]  Paolo Tonella,et al.  Evolutionary testing of classes , 2004, ISSTA '04.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Gail C. Murphy,et al.  Coping with an open bug repository , 2005, eclipse '05.

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  Kiarash Mahdavi,et al.  Allowing Overlapping Boundaries in Source Code using a Search Based Approach to Concept Binding , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[19]  Dapeng Liu,et al.  A Combined Concept Location Method for Java Programs , 2007, 31st Annual International Computer Software and Applications Conference (COMPSAC 2007).

[20]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[21]  Martin P. Robillard,et al.  Representing concerns in source code , 2007, TSEM.

[22]  Mark Harman,et al.  Search Algorithms for Regression Test Case Prioritization , 2007, IEEE Transactions on Software Engineering.

[23]  Alfred V. Aho,et al.  CERBERUS: Tracing Requirements to Source Code Using Information Retrieval, Dynamic Analysis, and Program Analysis , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[24]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[25]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[26]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[27]  Bogdan Dit,et al.  Integrating information retrieval, execution and link analysis algorithms to improve feature location in software , 2012, Empirical Software Engineering.

[28]  Giuseppe Scanniello,et al.  Clustering Support for Static Concept Location in Source Code , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[29]  Stephen W. Thomas Mining software repositories using topic models , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[30]  David Lo,et al.  Search-based fault localization , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[31]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[32]  Zhenchang Xing,et al.  Concern Localization using Information Retrieval: An Empirical Study on Linux Kernel , 2011, 2011 18th Working Conference on Reverse Engineering.

[33]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[34]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[35]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[36]  Yuanyuan Zhang,et al.  Search-based software engineering: Trends, techniques and applications , 2012, CSUR.

[37]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[38]  Günther Ruhe,et al.  Search Based Software Engineering , 2013, Lecture Notes in Computer Science.

[39]  Andrian Marcus,et al.  On the Relationship between the Vocabulary of Bug Reports and Source Code , 2013, 2013 IEEE International Conference on Software Maintenance.

[40]  Gerardo Canfora,et al.  Multi-objective Cross-Project Defect Prediction , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[41]  David Lo,et al.  Multi-abstraction Concern Localization , 2013, 2013 IEEE International Conference on Software Maintenance.

[42]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[43]  Mark Harman,et al.  Searching for better configurations: a rigorous approach to clone evaluation , 2013, ESEC/FSE 2013.

[44]  Jane Cleland-Huang,et al.  Improving trace accuracy through data-driven configuration and composition of tracing features , 2013, ESEC/FSE 2013.

[45]  David Lo,et al.  Cross-language bug localization , 2014, ICPC 2014.

[46]  Giuseppe Scanniello,et al.  Link analysis algorithms for static concept location: an empirical assessment , 2014, Empirical Software Engineering.

[47]  David Lo,et al.  Compositional Vector Space Models for Improved Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[48]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[49]  David Lo,et al.  Information retrieval and spectrum based bug localization: better together , 2015, ESEC/SIGSOFT FSE.

[50]  David Lo,et al.  Collective Personalized Change Classification With Multiobjective Search , 2016, IEEE Transactions on Reliability.

[51]  David Lo,et al.  Enhancing Automated Program Repair with Deductive Verification , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[52]  David Lo,et al.  History Driven Program Repair , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[53]  Stefanie Seiler,et al.  Finding Groups In Data , 2016 .

[54]  David Lo,et al.  HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.