Toward Optimal Selection of Information Retrieval Models for Software Engineering Tasks

Information Retrieval (IR) plays a pivotal role in diverse Software Engineering (SE) tasks, e.g., bug localization and triaging, bug report routing, code retrieval, requirements analysis, etc. SE tasks operate on diverse types of documents including code, text, stack-traces, and structured, semi-structured and unstructured meta-data that often contain specialized vocabularies. As the performance of any IR-based tool critically depends on the underlying document types, and given the diversity of SE corpora, it is essential to understand which models work best for which types of SE documents and tasks. We empirically investigate the interaction between IR models and document types for two representative SE tasks (bug localization and relevant project search), carefully chosen as they require a diverse set of SE artifacts (mixtures of code and text), and confirm that the models' performance varies significantly with mix of document types. Leveraging this insight, we propose a generalized framework, SRCH, to automatically select the most favorable IR model(s) for a given SE task. We evaluate SRCH w.r.t. these two tasks and confirm its effectiveness. Our preliminary user study shows that SRCH's intelligent adaption of the IR model(s) to the task at hand not only improves precision and recall for SE tasks but may also improve users' satisfaction.

[1]  Katsuro Inoue,et al.  MUDABlue: An Automatic Categorization System for Open Source Repositories , 2004, APSEC.

[2]  Hao Chen,et al.  Attack of the Clones: Detecting Cloned Applications on Android Markets , 2012, ESORICS.

[3]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[4]  Mordechai Nisenson,et al.  A Traceability Technique for Specifications , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[5]  Amir Michail,et al.  Assessing software libraries by browsing similar classes, functions and relationships , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[6]  David Lo,et al.  Detecting similar repositories on GitHub , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[7]  Iulian Neamtiu,et al.  Fine-grained incremental learning and multi-feature tossing graphs to improve bug triaging , 2010, 2010 IEEE International Conference on Software Maintenance.

[8]  Peng Liu,et al.  Achieving accuracy and scalability simultaneously in detecting application clones on Android markets , 2014, ICSE.

[9]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[10]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[11]  Thomas Zimmermann,et al.  Extraction of bug localization benchmarks from history , 2007, ASE.

[12]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[13]  Qinbao Song,et al.  An empirical study of BM25 and BM25F based feature location techniques , 2014, InnoSWDev@SIGSOFT FSE.

[14]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[15]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  Andrea De Lucia,et al.  On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[17]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Letha H. Etzkorn,et al.  Configuring latent Dirichlet allocation based feature location , 2014, Empirical Software Engineering.

[20]  Jane Cleland-Huang,et al.  Improving trace accuracy through data-driven configuration and composition of tracing features , 2013, ESEC/FSE 2013.

[21]  Steve Hanna,et al.  Juxtapp: A Scalable System for Detecting Code Reuse among Android Applications , 2012, DIMVA.

[22]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[23]  Hong Mei,et al.  A survey on bug-report analysis , 2015, Science China Information Sciences.

[24]  J. Crussell,et al.  Scalable Semantics-Based Detection of Similar Android Applications , 2013 .

[25]  Miryung Kim,et al.  A case study of cross-system porting in forked projects , 2012, SIGSOFT FSE.

[26]  Collin McMillan,et al.  Detecting similar software applications , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[27]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[28]  David Lo,et al.  Detecting similar applications with collaborative tagging , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[29]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[30]  Gabriele Bavota,et al.  Query-based configuration of text retrieval solutions for software engineering tasks , 2015, ESEC/SIGSOFT FSE.

[31]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[32]  Baishakhi Ray,et al.  Some from Here, Some from There: Cross-Project Code Reuse in GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[33]  Alessandra Gorla,et al.  Checking app behavior against app descriptions , 2014, ICSE.

[34]  Andrea De Lucia,et al.  Parameterizing and Assembling IR-Based Solutions for SE Tasks Using Genetic Algorithms , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[35]  David M. Blei,et al.  Hierarchical relational models for document networks , 2009, 0909.4331.

[36]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[37]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[38]  Michael Hucka,et al.  Software search is not a science, even among scientists , 2016, J. Syst. Softw..

[39]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[40]  Giuliano Antoniol,et al.  The Use of Text Retrieval and Natural Language Processing in Software Engineering , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[41]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[42]  Mario Linares Vásquez,et al.  On automatically detecting similar Android apps , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[43]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44]  Avinash C. Kak,et al.  Assisting code search with automatic Query Reformulation for bug localization , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[45]  Andrea De Lucia,et al.  On integrating orthogonal information retrieval methods to improve traceability recovery , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[46]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[47]  Anh Tuan Nguyen,et al.  Bug Localization with Combination of Deep Learning and Information Retrieval , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[48]  Cristina V. Lopes,et al.  How Well Do Search Engines Support Code Retrieval on the Web? , 2011, TSEM.

[49]  Ming Wen,et al.  Locus: Locating bugs from software changes , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[50]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[51]  David Lo,et al.  Information retrieval and spectrum based bug localization: better together , 2015, ESEC/SIGSOFT FSE.

[52]  David W. Binkley,et al.  Learning to Rank Improves IR in SE , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[53]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[54]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[55]  Emerson R. Murphy-Hill,et al.  The design of bug fixes , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[56]  Ahmed E. Hassan,et al.  The Impact of Classifier Configuration and Classifier Combination on Bug Localization , 2013, IEEE Transactions on Software Engineering.

[57]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[58]  Andreas Zeller,et al.  Where Should We Fix This Bug? A Two-Phase Recommendation Model , 2013, IEEE Transactions on Software Engineering.

[59]  Hung Viet Nguyen,et al.  A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[60]  Grant T. Harris,et al.  Comparing Effect Sizes in Follow-Up Studies: ROC Area, Cohen's d, and r , 2005, Law and human behavior.

[61]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[62]  Miryung Kim,et al.  REPERTOIRE: a cross-system porting analysis tool for forked software projects , 2012, SIGSOFT FSE.

[63]  Razvan C. Bunescu,et al.  Mapping Bug Reports to Relevant Files: A Ranking Model, a Fine-Grained Benchmark, and Feature Evaluation , 2016, IEEE Transactions on Software Engineering.

[64]  Kathryn T. Stolee,et al.  Evaluating How Developers Use General-Purpose Web-Search for Code Retrieval , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).