Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

Software repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. Specifically, mining source code has yielded significant insights into software development artifacts and processes. Unfortunately, mining source code at a large-scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse-grained, or sacrifice studying the history of the code due to both human and computational scalability issues. In this paper we address the substantial challenges of mining source code: a) at a very large scale; b) at a fine-grained level of detail; and c) with full history information. To address these challenges, we present domain-specific language features for source code mining. Our language features are inspired by object-oriented visitors and provide a default depth-first traversal strategy along with two expressions for defining custom traversals. We provide an implementation of these features in the Boa infrastructure for software repository mining and describe a code generation strategy into Java code. To show the usability of our domain-specific language features, we reproduced over 40 source code mining tasks from two large-scale previous studies in just 2 person-weeks. The resulting code for these tasks show between 2.0x--4.8x reduction in code size. Finally we perform a small controlled experiment to gain insights into how easily mining tasks written using our language features can be understood, with no prior training. We show a substantial number of tasks (77%) were understood by study participants, in about 3 minutes per task.

[1]  Ahmed E. Hassan,et al.  Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[2]  Zhendong Su,et al.  A study of the uniqueness of source code , 2010, FSE '10.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Alessandro Orso,et al.  Understanding myths and realities of test-suite evolution , 2012, SIGSOFT FSE.

[5]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[6]  Danny Dig,et al.  How do developers use parallel libraries? , 2012, SIGSOFT FSE.

[7]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[8]  T. Zimmermann,et al.  Predicting Faults from Cached History , 2007, 29th International Conference on Software Engineering (ICSE'07).

[9]  Yana Momchilova Mileva,et al.  Mining Evolution of Object Usage , 2011, ECOOP.

[10]  N. Nagappan,et al.  Use of relative code churn measures to predict system defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[11]  Harvey P. Siy,et al.  Predicting Fault Incidence Using Software Change History , 2000, IEEE Trans. Software Eng..

[12]  Michael W. Godfrey,et al.  Facilitating software evolution research with kenyon , 2005, ESEC/FSE-13.

[13]  Collin McMillan,et al.  Portfolio: Searching for relevant functions and their usages in millions of lines of code , 2013, TSEM.

[14]  Joost Visser Visitor combination and traversal control , 2001, OOPSLA '01.

[15]  Martin P. Robillard,et al.  Temporal analysis of API usage concepts , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[16]  Karl J. Lieberherr,et al.  DJ: Dynamic Adaptive Programming in Java , 2001, Reflection.

[17]  Benjamin Livshits,et al.  Finding application errors and security flaws using PQL: a program query language , 2005, OOPSLA '05.

[18]  Mitchell Wand,et al.  A language for specifying recursive traversals of object structures , 1999, OOPSLA '99.

[19]  Elnar Hajiyev,et al.  codeQuest: Scalable Source Code Queries with Datalog , 2006, ECOOP.

[20]  Kris De Volder,et al.  Navigating and querying code without getting lost , 2003, AOSD '03.

[21]  Mira Mezini,et al.  Querying source code with natural language , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[22]  Carlo Ghezzi,et al.  An empirical investigation into a large-scale Java open source code repository , 2010, ESEM '10.

[23]  Meng Wang,et al.  The visitor pattern as a reusable, generic, type-safe component , 2008, OOPSLA.

[24]  Jeffrey S. Foster,et al.  Understanding source code evolution using abstract syntax tree matching , 2005, MSR.

[25]  Andreas Zeller,et al.  Mining Cause-Effect-Chains from Version Histories , 2011, 2011 IEEE 22nd International Symposium on Software Reliability Engineering.

[26]  Emerson R. Murphy-Hill,et al.  Adoption and use of Java generics , 2012, Empirical Software Engineering.

[27]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[28]  Peyton Jones,et al.  Haskell 98 language and libraries : the revised report , 2003 .

[29]  Hridesh Rajan,et al.  A Large-scale Empirical Study of Java Language Feature Usage , 2013 .