Big Data Software Analytics with Apache Spark

At the beginning of every research effort, researchers in empirical software engineering have to go through the processes of extracting data from raw data sources and transforming them to what their tools expect as inputs. This step is time consuming and error prone, while the produced artifacts (code, intermediate datasets) are usually not of scientific value. In the recent years, Apache Spark has emerged as a solid foundation for data science and has taken the big data analytics domain by storm. We believe that the primitives exposed by Apache Spark can help software engineering researchers create and share reproducible, high-performance data analysis pipelines. In our technical briefing, we discuss how researchers can profit from Apache Spark, through a hands-on case study.

[1]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[2]  Fabian Trautsch,et al.  Addressing problems with replicability and validity of repository mining studies through a smart data platform , 2018, Empirical Software Engineering.

[3]  Tim Menzies,et al.  Perspectives on Data Science for Software Engineering , 2016, Perspectives on Data Science for Software Engineering.

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  Georgios Gousios,et al.  TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[6]  Georgios Gousios,et al.  Oops, My Tests Broke the Build: An Explorative Analysis of Travis CI with GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[7]  Georgios Gousios,et al.  When, how, and why developers (do not) test in their IDEs , 2015, ESEC/SIGSOFT FSE.

[8]  Georgios Gousios,et al.  Structure and Evolution of Package Dependency Networks , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[9]  Georgios Gousios,et al.  The bug catalog of the maven ecosystem , 2014, MSR 2014.

[10]  Georgios Gousios,et al.  A platform for software engineering research , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.