Predicting source code changes by mining change history

Software developers are often faced with modification tasks that involve source which is spread across a code base. Some dependencies between source code, such as those between source code written in different languages, are difficult to determine using existing static and dynamic analyses. To augment existing analyses and to help developers identify relevant source code during a modification task, we have developed an approach that applies data mining techniques to determine change patterns - sets of files that were changed together frequently in the past - from the change history of the code base. Our hypothesis is that the change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. We show that this approach can reveal valuable dependencies by applying the approach to the Eclipse and Mozilla open source projects and by evaluating the predictability and interestingness of the recommendations produced for actual modification tasks on these systems.

[1]  D HerbslebJames,et al.  Two case studies of open source software development , 2002 .

[2]  Amir Michail,et al.  Data mining library reuse patterns in user-selected applications , 1999, 14th IEEE International Conference on Automated Software Engineering.

[3]  Philip S. Yu,et al.  Using a Hash-Based Method with Transaction Trimming for Mining Association Rules , 1997, IEEE Trans. Knowl. Data Eng..

[4]  Paul J. Layzell,et al.  Facilitating program comprehension by mining association rules from source code , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[5]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[6]  Keith Brian Gallagher,et al.  Using Program Slicing in Software Maintenance , 1991, IEEE Trans. Software Eng..

[7]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[8]  Harald C. Gall,et al.  Analyzing and relating bug report data for feature tracking , 2003, 10th Working Conference on Reverse Engineering, 2003. WCRE 2003. Proceedings..

[9]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[10]  Harald C. Gall,et al.  Populating a Release History Database from version control and bug tracking systems , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[11]  Andreas Zeller,et al.  How history justifies system architecture (or not) , 2003, Sixth International Workshop on Principles of Software Evolution, 2003. Proceedings..

[12]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[13]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[14]  Gail C. Murphy,et al.  Hipikat: recommending pertinent software development artifacts , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[15]  Andreas Zeller,et al.  Mining Version Histories to Guide Software Changes , 2004 .

[16]  Joseph Robert Horgan,et al.  Dynamic program slicing , 1990, PLDI '90.

[17]  Stan Matwin,et al.  Supporting maintenance of legacy software with data mining techniques , 2000, CASCON.

[18]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[19]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[20]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[21]  David B. Leblang The CM challenge: configuration management that works , 1995 .

[22]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[23]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[24]  Osmar R. Zaïane,et al.  Incremental mining of frequent patterns without candidate generation or support constraint , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[25]  Robert S. Arnold,et al.  Software Change Impact Analysis , 1996 .

[26]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[27]  Boris Magnusson,et al.  Fine Grained Version Control of Configurations in COOP/Orm , 1996, SCM.

[28]  Amir Michail,et al.  Data mining library reuse patterns using generalized association rules , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[29]  Audris Mockus,et al.  Globalization by Chunking: A Quantitative Approach , 2001, IEEE Softw..

[30]  Farhad Mavaddat,et al.  Architectural design recovery using data mining techniques , 2000, Proceedings of the Fourth European Conference on Software Maintenance and Reengineering.