论文信息 - Mining Unstructured Software Repositories Using IR Models

Mining Unstructured Software Repositories Using IR Models

MINING SOFTWARE REPOSITORIES, which is the process of analyzing the data related to software development practices, is an emerging field which aims to aid development teams in their day to day tasks. However, data in many software repositories is currently unused because the data is unstructured, and therefore difficult to mine and analyze. Information Retrieval (IR) techniques, which were developed specifically to handle unstructured data, have recently been used by researchers to mine and analyze the unstructured data in software repositories, with some success. The main contribution of this thesis is the idea that the research and practice of using IR models to mine unstructured software repositories can be improved by going beyond the current state of affairs. First, we propose new applications of IR models to existing software engineering tasks. Specifically, we present a technique to prioritize test cases based on their IR similarity, giving highest priority to those test cases that are most dissimilar. In another new application of IR models, we empirically recover how developers use their mailing list while developing software. Next, we show how the use of advanced IR techniques can improve results. Using a framework for combining disparate IR models, we find that bug localization performance can be improved by 14–56% on average, compared to the best individual IR model. In addition, by using topic evolution models on the history of source code, we can uncover the evolution of source code concepts with an accuracy of 87–89%. i Finally, we show the risks of current research, which uses IR models as black boxes without fully understanding their assumptions and parameters. We show that data duplication in source code has undesirable effects for IR models, and that by eliminating the duplication, the accuracy of IR models improves. Additionally, we find that in the bug localization task, an unwise choice of parameter values results in an accuracy of only 1%, where optimal parameters can achieve an accuracy of 55%. Through empirical case studies on real-world systems, we show that all of our proposed techniques and methodologies significantly improve the state-of-the-art.

Stephen W. Thomas

[1] Thomas Zimmermann,et al. Security Trend Analysis with CVE Topic Models , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[2] David Leon,et al. An Empirical Study of Test Case Filtering Techniques Based on Exercising Information Flows , 2007, IEEE Transactions on Software Engineering.

[3] Thomas L. Griffiths,et al. Probabilistic Topic Models , 2007 .

[4] Claude E. Shannon,et al. The Mathematical Theory of Communication , 1950 .

[5] Bogdan Dit,et al. Using Data Fusion and Web Mining to Support Feature Location in Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[6] Sushil Krishna Bajracharya,et al. Mining Internet-Scale Software Repositories , 2007, NIPS.

[7] Thomas Hofmann,et al. Probabilistic latent semantic indexing , 1999, SIGIR '99.

[8] Lionel C. Briand,et al. Achieving scalable model-based testing through test case diversity , 2013, TSEM.

[9] Tibor Gyimóthy,et al. New Conceptual Coupling and Cohesion Metrics for Object-Oriented Systems , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[10] Carl K. Chang,et al. Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution Management , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[11] Pierre Baldi,et al. Software Analysis with Unsupervised Topic Models , 2009 .

[12] Sushil Krishna Bajracharya,et al. SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[13] A.E. Hassan,et al. The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[14] Ewan D. Tempero,et al. A Java reuse repository for Eclipse using LSI , 2006, Australian Software Engineering Conference (ASWEC'06).

[15] Michael W. Godfrey,et al. Software process recovery using Recovered Unified Process Views , 2010, 2010 IEEE International Conference on Software Maintenance.

[16] Collin McMillan,et al. Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[17] David M. Blei,et al. Relational Topic Models for Document Networks , 2009, AISTATS.

[18] Andrian Marcus,et al. Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[19] Joseph Robert Horgan,et al. A study of effective regression testing in practice , 1997, Proceedings The Eighth International Symposium on Software Reliability Engineering.

[20] Rudolf Ferenc,et al. Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[21] Sushil Krishna Bajracharya,et al. Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[22] L. Erlikh,et al. Leveraging legacy system dollars for e-business , 2000 .

[23] T. H. Tse,et al. Adaptive Random Test Case Prioritization , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[24] Denys Poshyvanyk,et al. An exploratory study on assessing feature location techniques , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[25] Stéphane Ducasse,et al. Enriching reverse engineering with semantic clustering , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[26] Ahmed E. Hassan,et al. On the Central Role of Mailing Lists in Open Source Projects: An Exploratory Study , 2009, JSAI-isAI Workshops.

[27] Yann-Gaël Guéhéneuc,et al. Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[28] Thomas L. Griffiths,et al. The Author-Topic Model for Authors and Documents , 2004, UAI.

[29] Ahmed E. Hassan,et al. Mining Software Repositories to Assist Developers and Support Managers , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[30] Atif M. Memon,et al. Call-Stack Coverage for GUI Test Suite Reduction , 2008, IEEE Trans. Software Eng..

[31] Ayse Basar Bener,et al. An industrial case study of classifier ensembles for locating software defects , 2011, Software Quality Journal.

[32] Gregg Rothermel,et al. A Static Approach to Prioritizing JUnit Test Cases , 2012, IEEE Transactions on Software Engineering.

[33] Andrian Marcus,et al. Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[34] Ross Ihaka,et al. Gentleman R: R: A language for data analysis and graphics , 1996 .

[35] Andrea De Lucia,et al. Traceability Recovery Using Numerical Analysis , 2009, 2009 16th Working Conference on Reverse Engineering.

[36] Chanchal Kumar Roy,et al. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[37] A. Vargha,et al. A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[38] Richard L. Scheaffer,et al. Probability and statistics for engineers , 1986 .

[39] Ahmed E. Hassan,et al. Studying software evolution using topic models , 2014, Sci. Comput. Program..

[40] Nadine Mandran,et al. Prioritizing test cases with string distances , 2011, Automated Software Engineering.

[41] Cristina V. Lopes,et al. An Application of Latent Dirichlet Allocation to Analyzing Software Evolution , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[42] Oscar Nierstrasz,et al. Software Cartography: thematic software visualization with consistent layout , 2010 .

[43] Scott Grant,et al. Vector space analysis of software clones , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[44] Lionel C. Briand,et al. An enhanced test case selection approach for model-based testing: an industrial case study , 2010, FSE '10.

[45] Lionel C. Briand,et al. Empirical Investigation of the Effects of Test Suite Properties on Similarity-Based Test Case Selection , 2011, 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation.

[46] Pierre Baldi,et al. Mining the coherence of GNOME bug reports with statistical topic models , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[47] R. Scheaffer,et al. Probability and statistics for engineers , 1986 .

[48] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[49] Andrian Marcus,et al. Static techniques for concept location in object-oriented code , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[50] Susumu Horiguchi,et al. Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[51] Ahmed E. Hassan,et al. The Impact of Classifier Configuration and Classifier Combination on Bug Localization , 2013, IEEE Transactions on Software Engineering.

[52] Arie van Deursen,et al. Can LSI help reconstructing requirements traceability in design and test? , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[53] Avinash C. Kak,et al. Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[54] Bogdan Korel,et al. Model-based test prioritization heuristic methods and their evaluation , 2007, A-MOST '07.

[55] Lambert Schomaker,et al. Variants of the Borda count method for combining ranked classifier hypotheses , 2000 .

[56] Andreas Zeller,et al. Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[57] Jesús M. González-Barahona,et al. Tools for the Study of the Usual Data Sources found in Libre Software Projects , 2009, Int. J. Open Source Softw. Process..

[58] Marco Lormans. Monitoring Requirements Evolution using Views , 2007, 11th European Conference on Software Maintenance and Reengineering (CSMR'07).

[59] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[60] Sargur N. Srihari,et al. Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[61] Lu Zhang,et al. Prioritizing JUnit test cases in absence of coverage information , 2009, 2009 IEEE International Conference on Software Maintenance.

[62] Ahmed E. Hassan,et al. Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[63] Timothy Lethbridge,et al. Object-oriented software engineering - practical software development using UML and Java , 2002 .

[64] Elizabeth Chang,et al. An Empirical Approach for Semantic Web Services Discovery , 2008, 19th Australian Conference on Software Engineering (aswec 2008).

[65] Stéphane Ducasse,et al. Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[66] Andrian Marcus,et al. Semantic driven program analysis , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[67] Martin P. Robillard,et al. Representing concerns in source code , 2007, TSEM.

[68] Daniel Jurafsky,et al. Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[69] Alexander L. Wolf,et al. Acm Sigsoft Software Engineering Notes Vol 17 No 4 Foundations for the Study of Software Architecture , 2022 .

[70] Tao Xie,et al. Software intelligence: the future of mining software engineering data , 2010, FoSER '10.

[71] Baowen Xu,et al. Using semi-supervised clustering to improve regression test selection techniques , 2011, 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation.

[72] Michele Lanza,et al. Evaluating defect prediction approaches: a benchmark and an extensive comparison , 2011, Empirical Software Engineering.

[73] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[74] Sushil Krishna Bajracharya,et al. Mining Eclipse Developer Contributions via Author-Topic Models , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[75] Ahmed E. Hassan,et al. A Case Study of Bias in Bug-Fix Datasets , 2010, 2010 17th Working Conference on Reverse Engineering.

[76] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[77] Denys Poshyvanyk,et al. Blending Conceptual and Evolutionary Couplings to Support Change Impact Analysis in Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[78] Lars Kai Hansen,et al. Pruning the vocabulary for better context recognition , 2004, ICPR 2004.

[79] Audris Mockus,et al. Future of Mining Software Archives: A Roundtable , 2009, IEEE Software.

[80] Ahmed E. Hassan,et al. Modeling the evolution of topics in source code histories , 2011, MSR '11.

[81] Gregg Rothermel,et al. Empirical studies of test‐suite reduction , 2002, Softw. Test. Verification Reliab..

[82] Walter Tichy. An Interview with Prof. Andreas Zeller: Mining your way to software reliability , 2010, UBIQ.

[83] Arie van Deursen,et al. Monitoring Requirements Coverage using Reconstructed Views: An Industrial Case Study , 2006, 2006 13th Working Conference on Reverse Engineering.

[84] Lionel C. Briand,et al. A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[85] John Mylopoulos,et al. Log filtering and interpretation for root cause analysis , 2010, 2010 IEEE International Conference on Software Maintenance.

[86] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[87] Zhendong Su,et al. On the naturalness of software , 2012, ICSE 2012.

[88] Denys Poshyvanyk,et al. Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[89] Ahmed E. Hassan,et al. Studying the use of developer IRC meetings in open source projects , 2009, 2009 IEEE International Conference on Software Maintenance.

[90] Christopher D. Manning,et al. Topic Modeling for the Social Sciences , 2009 .

[91] Hung Viet Nguyen,et al. A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[92] Suresh Jagannathan,et al. PHALANX: a graph-theoretic framework for test case prioritization , 2008, SAC '08.

[93] Akif Günes Koru,et al. Prioritizing User-Session-Based Test Cases for Web Applications Testing , 2008, 2008 1st International Conference on Software Testing, Verification, and Validation.

[94] Tibor Gyimóthy,et al. Modeling class cohesion as mixtures of latent topics , 2009, 2009 IEEE International Conference on Software Maintenance.

[95] Mark Harman,et al. Regression testing minimization, selection and prioritization: a survey , 2012, Softw. Test. Verification Reliab..

[96] Denys Poshyvanyk,et al. Using Latent Dirichlet Allocation for automatic categorization of software , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[97] Andrea De Lucia,et al. On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[98] Jiri Matas,et al. On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[99] Katsuro Inoue,et al. MUDABlue: An Automatic Categorization System for Open Source Repositories , 2004, APSEC.

[100] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[101] Paolo Tonella,et al. Measuring the Impact of Different Categories of Software Evolution , 2008, IWSM/Metrikon/Mensura.

[102] Mary Jean Harrold,et al. Test-suite reduction and prioritization for modified condition/decision coverage , 2001, Proceedings IEEE International Conference on Software Maintenance. ICSM 2001.

[103] Tony Gorschek,et al. Searching for Cognitively Diverse Tests: Towards Universal Test Diversity Metrics , 2008, 2008 IEEE International Conference on Software Testing Verification and Validation Workshop.

[104] ChengXiang Zhai,et al. Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[105] Mark Harman,et al. Clustering test cases to achieve effective and scalable prioritisation incorporating expert knowledge , 2009, ISSTA.

[106] A. Zeller,et al. Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[107] R. P. Jagadeesh Chandra Bose,et al. Root Cause Analysis Using Sequence Alignment and Latent Semantic Indexing , 2008 .

[108] Nicolás Serrano,et al. Bugzilla, ITracker, and Other Bug Trackers , 2005, IEEE Softw..

[109] Andrew McCallum,et al. Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[110] Oscar Nierstrasz,et al. Consistent Layout for Thematic Software Maps , 2008, 2008 15th Working Conference on Reverse Engineering.

[111] Michael Pilato. Version Control with Subversion , 2004 .

[112] Letha H. Etzkorn,et al. Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[113] Ahmed E. Hassan,et al. Validating the Use of Topic Models for Software Evolution , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[114] Michael W. Godfrey,et al. What's hot and what's not: Windowed developer topic analysis , 2009, 2009 IEEE International Conference on Software Maintenance.

[115] Max Welling,et al. Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[116] Alexandre Petrenko,et al. Using String Distances for Test Case Prioritisation , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[117] Lionel C. Briand,et al. Reducing the Cost of Model-Based Testing through Test Case Diversity , 2010, ICTSS.

[118] Rodrigo Fernandes de Mello,et al. A Technique to Reduce the Test Case Suites for Regression Testing Based on a Self-Organizing Neural Network Architecture , 2006, 30th Annual International Computer Software and Applications Conference (COMPSAC'06).

[119] Letha H. Etzkorn,et al. Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[120] ChengXiang Zhai,et al. Statistical Language Models for Information Retrieval , 2008, NAACL.

[121] Ahmed E. Hassan,et al. Identifying crosscutting concerns using historical code changes , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[122] Stephen E. Robertson,et al. Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[123] David Leon,et al. A comparison of coverage-based and distribution-based techniques for filtering and prioritizing test cases , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[124] Richard C. Holt,et al. The top ten list: dynamic fault prediction , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[125] Premkumar T. Devanbu,et al. Clones: What is that smell? , 2010, MSR.

[126] Richard W. Selby,et al. Enabling reuse-based software development of large-scale systems , 2005, IEEE Transactions on Software Engineering.

[127] Bogdan Dit,et al. TopicXP: Exploring topics in source code using Latent Dirichlet Allocation , 2010, 2010 IEEE International Conference on Software Maintenance.

[128] Sushil Krishna Bajracharya,et al. A theory of aspects as latent topics , 2008, OOPSLA.

[129] Andrian Marcus,et al. Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[130] Xiaohua Hu,et al. Dragon Toolkit: Incorporating Auto-Learned Semantic Knowledge into Large-Scale Text Retrieval and Mining , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[131] Thomas Zimmermann,et al. Extraction of bug localization benchmarks from history , 2007, ASE.

[132] Gregg Rothermel,et al. Prioritizing test cases for regression testing , 2000, ISSTA '00.

[133] W. Bruce Croft,et al. LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[134] W. Cleveland. The Collected Works of John W. Tukey, Volume V, Graphics 1965-1985. , 1989 .

[135] Mark Steyvers,et al. Topics in semantic representation. , 2007, Psychological review.

[136] Martin P. Robillard,et al. Concern graphs: finding and describing concerns using structural program dependencies , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[137] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[138] Andrian Marcus,et al. An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[139] Denys Poshyvanyk,et al. Creating and evolving software by searching, selecting and synthesizing relevant source code , 2009, 2009 31st International Conference on Software Engineering - Companion Volume.

[140] Ahmed E. Hassan,et al. Static test case prioritization using topic models , 2014, Empirical Software Engineering.

[141] Ahmed E. Hassan,et al. On the use of IRC channels by developers of the GNOME GTK + open source project , 2009 .

[142] Collin McMillan,et al. Combining textual and structural analysis of software artifacts for traceability link recovery , 2009, 2009 ICSE Workshop on Traceability in Emerging Forms of Software Engineering.

[143] Alberto Bacchelli,et al. Benchmarking Lightweight Techniques to Link E-Mails and Source Code , 2009, 2009 16th Working Conference on Reverse Engineering.

[144] A. Hassan,et al. DiffLDA : Topic Evolution in Software Projects [ Technical Report 2010-574 ] July 2010 , 2010 .

[145] J. Tukey. The Philosophy of Multiple Comparisons , 1991 .

[146] Yann-Gaël Guéhéneuc,et al. Combining Probabilistic Ranking and Latent Semantic Indexing for Feature Identification , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[147] Ruslan Salakhutdinov,et al. Evaluation methods for topic models , 2009, ICML '09.

[148] Sushil Krishna Bajracharya,et al. Mining concepts from code with probabilistic topic models , 2007, ASE.

[149] Andreas Zeller,et al. When do changes induce fixes? , 2005, ACM SIGSOFT Softw. Eng. Notes.

[150] R. Kuehl. Design of Experiments: Statistical Principles of Research Design and Analysis , 1999 .

[151] Akito Monden,et al. Revisiting common bug prediction findings using effort-aware models , 2010, 2010 IEEE International Conference on Software Maintenance.

[152] M.M. Lehman,et al. Programs, life cycles, and laws of software evolution , 1980, Proceedings of the IEEE.

[153] A. Steven Klusener,et al. Towards Recovering Architectural Concepts Using Latent Semantic Indexing , 2008, 2008 12th European Conference on Software Maintenance and Reengineering.

[154] Jane Huffman Hayes,et al. Advancing candidate link generation for requirements tracing: the study of methods , 2006, IEEE Transactions on Software Engineering.

[155] M. Veloso,et al. Latent Variable Models , 2019, Statistical and Econometric Methods for Transportation Data Analysis.

[156] Ahmed E. Hassan,et al. Understanding the impact of code and process metrics on post-release defects: a case study on the Eclipse project , 2010, ESEM '10.

[157] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[158] Stephen W. Thomas. Mining software repositories using topic models , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[159] Norman Wilde,et al. The role of concepts in program comprehension , 2002, Proceedings 10th International Workshop on Program Comprehension.

[160] Thomas Hofmann,et al. Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[161] Jonathan I. Maletic,et al. Automatic software clustering via Latent Semantic Analysis , 1999, 14th IEEE International Conference on Automated Software Engineering.

[162] L R Schiller. Maintaining the competitive edge through reengineering. , 1997, Trustee : the journal for hospital governing boards.

[163] Santonu Sarkar,et al. Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.