Learning to Rank Improves IR in SE

Learning to Rank (LtR) encompasses a class of machine learning techniques developed to automatically learn how to better rank the documents returned for an information retrieval (IR) search. Such techniques offer great promise to software engineers because they better adapt to the wider range of differences in the documents and queries seen in software corpora. To encourage the greater use of LtR in software maintenance and evolution research, this paper explores the value that LtR brings to two common maintenance problems: feature location and traceability. When compared to the worst, median, and best models identified from among hundreds of alternative models for performing feature location, LtR ubiquitously provides a statistically significant improvement in MAP, MRR, and MnDCG scores. Looking forward a further motivation for the use of LtR is its ability to enable the development of software specific retrieval models.

[1]  Ahmed E. Hassan,et al.  The Impact of Classifier Configuration and Classifier Combination on Bug Localization , 2013, IEEE Transactions on Software Engineering.

[2]  W. Bruce Croft,et al.  Feature Selection for Document Ranking using Best First Search and Coordinate Ascent , 2010 .

[3]  Jane Cleland-Huang,et al.  Improving trace accuracy through data-driven configuration and composition of tracing features , 2013, ESEC/FSE 2013.

[4]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[5]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[6]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[8]  Gabriele Bavota,et al.  Using code ownership to improve IR-based Traceability Link Recovery , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[11]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[12]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[13]  Andrea De Lucia,et al.  On integrating orthogonal information retrieval methods to improve traceability recovery , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[14]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[15]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[16]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[17]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[18]  W. Bruce Croft,et al.  Compact query term selection using topically related text , 2013, SIGIR.