Incremental Map-Reduce on Repository History

Work on Mining Software Repositories typically involves processing abstractions of resources on individual revisions. A corresponding processing of abstractions of resource changes often depends on working with all revisions of the repository history to guarantee a high resolution of the measured changes. Abstractions of resources and abstractions of resource changes are often very related up to the point that they can be used interchangeably in the processing. In practice, approaches working with abstractions processed over high revision counts face a scalability challenge. In this work, we contribute to the challenge by incrementalizing the processing of repository resources and the corresponding abstractions. Our work is inspired by incrementalization theory including insights on Abelian groups, group homomorphisms and indexing. We provide a map-reduce interface that enables calls to foreign functionality and convenient operations for processing abstractions, such as mapping, filtering, group-wise aggregation and joining. Apache Spark is used for distribution. We compare the scalability of our approach with available MSR approaches, i.e., with LISA that reduces redundancy and with DJ-Rex that migrates an analysis to a distributed map-reduce framework.

[1]  Foutse Khomh,et al.  Inferring Repository File Structure Modifications Using Nearest-Neighbor Clone Detection , 2012, 2012 19th Working Conference on Reverse Engineering.

[2]  Gul A. Agha,et al.  ACTORS - a model of concurrent computation in distributed systems , 1985, MIT Press series in artificial intelligence.

[3]  Ahmed E. Hassan,et al.  MapReduce as a general framework to support research in Mining Software Repositories (MSR) , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[6]  Arie van Deursen,et al.  The Maven repository dataset of metrics, changes, and dependencies , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[7]  Georgios Gousios,et al.  Streaming Software Analytics , 2016, 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE).

[8]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[9]  Ralf Lämmel,et al.  EMF Patterns of Usage on GitHub , 2018, ECMFA.

[10]  Ralf Lämmel,et al.  Similarity management of 'cloned and owned' variants , 2016, Softwaretechnik-Trends.

[11]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[12]  Andreas Zeller,et al.  Mining Version Histories to Guide Software Changes , 2004 .

[13]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[14]  Leonidas Fegaras,et al.  Incremental Query Processing on Big Data Streams , 2015, IEEE Transactions on Knowledge and Data Engineering.

[15]  Hridesh Rajan,et al.  Mining preconditions of APIs in large-scale code corpus , 2014, FSE 2014.

[16]  Sushil Krishna Bajracharya,et al.  Automated dependency resolution for open source software , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[17]  Barton P. Miller,et al.  Mining Software Repositories for Accurate Authorship , 2013, 2013 IEEE International Conference on Software Maintenance.

[18]  Oscar Nierstrasz,et al.  Evolutionary and collaborative software architecture recovery with Softwarenaut , 2014, Sci. Comput. Program..

[19]  Klaus Ostermann,et al.  A theory of changes for higher-order languages: incrementalizing λ-calculi by static differentiation , 2013, PLDI.

[20]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[21]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[22]  Gail C. Murphy,et al.  Predicting source code changes by mining change history , 2004, IEEE Transactions on Software Engineering.

[23]  Jonathan I. Maletic,et al.  Mining sequences of changed-files from version histories , 2006, MSR '06.

[24]  Charles L. Forgy,et al.  Rete: A Fast Algorithm for the Many Patterns/Many Objects Match Problem , 1982, Artif. Intell..

[25]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[26]  Emerson R. Murphy-Hill,et al.  A degree-of-knowledge model to capture source code familiarity , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[27]  Haipeng Cai,et al.  Leveraging Historical Versions of Android Apps for Efficient and Precise Taint Analysis , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[28]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[29]  Michael Isard,et al.  DryadInc: Reusing Work in Large-scale Computations , 2009, HotCloud.

[30]  Giuliano Antoniol,et al.  An automatic approach to identify class evolution discontinuities , 2004, Proceedings. 7th International Workshop on Principles of Software Evolution, 2004..

[31]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[32]  Robert Harper,et al.  Homotopical patch theory , 2014, ICFP.

[33]  Lin Tan,et al.  Do time of day and developer experience affect commit bugginess? , 2011, MSR '11.

[34]  Georgios Gousios,et al.  How to Analyze Git Repositories with Command Line Tools: We're not in Kansas Anymore , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[35]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[36]  Xin Yang,et al.  IncMR: Incremental Data Processing Based on MapReduce , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[37]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[38]  Dirk Riehle,et al.  The Patch-Flow Method for Measuring Inner Source Collaboration , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[39]  Torsten Grust,et al.  Incremental Updates for Materialized OQL Views , 1997, DOOD.

[40]  Fritz Henglein,et al.  Relational algebra by way of adjunctions , 2018, Proc. ACM Program. Lang..

[41]  Premkumar T. Devanbu,et al.  Ownership, experience and defects: a fine-grained study of authorship , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[42]  Harald C. Gall,et al.  Reducing redundancies in multi-revision code analysis , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[43]  Danny Dig,et al.  Accurate and Efficient Refactoring Detection in Commit History , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[44]  Hridesh Rajan,et al.  Boa: Ultra-Large-Scale Software Repository and Source-Code Mining , 2015, ACM Trans. Softw. Eng. Methodol..

[45]  Limsoon Wong,et al.  Query Languages for Bags and Aggregate Functions , 1997, J. Comput. Syst. Sci..

[46]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[47]  Jennifer Widom,et al.  Making views self-maintainable for data warehousing , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[48]  Stefan Plantikow,et al.  Cypher: An Evolving Query Language for Property Graphs , 2018, SIGMOD Conference.

[49]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[50]  Ahmed E. Hassan,et al.  An experience report on scaling tools for mining software repositories using MapReduce , 2010, ASE '10.

[51]  Ahmed E. Hassan,et al.  Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[52]  Ralf Lämmel,et al.  Classification of APIs by Hierarchical Clustering , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[53]  Samuel Mimram,et al.  A Categorical Theory of Patches , 2013, MFPS.

[54]  Abraham Bernstein,et al.  Signal/Collect: Graph Algorithms for the (Semantic) Web , 2010, SEMWEB.

[55]  Gerardo Canfora,et al.  Identifying Changed Source Code Lines from Version Repositories , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[56]  Harald C. Gall,et al.  Rapid Multi-Purpose, Multi-Commit Code Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[57]  Christopher Olston,et al.  Stateful bulk processing for incremental analytics , 2010, SoCC '10.

[58]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[59]  Sam Shah,et al.  Hourglass: A library for incremental processing on Hadoop , 2013, 2013 IEEE International Conference on Big Data.

[60]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[61]  Harald C. Gall,et al.  Redundancy-free analysis of multi-revision software artifacts , 2018, Empirical Software Engineering.

[62]  Marco Tulio Valente,et al.  Apiwave: Keeping track of API popularity and migration , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[63]  Andrea De Lucia,et al.  Do Developers Update Third-Party Libraries in Mobile Apps? , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[64]  Stéphane Ducasse,et al.  How developers drive software evolution , 2005, Eighth International Workshop on Principles of Software Evolution (IWPSE'05).

[65]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[66]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[67]  Chanchal Kumar Roy,et al.  CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Technology Experience , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[68]  Coen De Roover,et al.  Mining Change Histories for Unknown Systematic Edits , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).