An Empirical Analysis of the Docker Container Ecosystem on GitHub

Docker allows packaging an application with its dependencies into a standardized, self-contained unit (a so-called container), which can be used for software development and to run the application on any system. Dockerfiles are declarative definitions of an environment that aim to enable reproducible builds of the container. They can often be found in source code repositories and enable the hosted software to come to life in its execution environment. We conduct an exploratory empirical study with the goal of characterizing the Docker ecosystem, prevalent quality issues, and the evolution of Dockerfiles. We base our study on a data set of over 70000 Dockerfiles, and contrast this general population with samplings that contain the Top-100 and Top-1000 most popular Docker-using projects. We find that most quality issues (28.6%) arise from missing version pinning (i.e., specifying a concrete version for dependencies). Further, we were not able to build 34% of Dockerfiles from a representative sample of 560 projects. Integrating quality checks, e.g., to issue version pinning warnings, into the container build process could result into more reproducible builds. The most popular projects change more often than the rest of the Docker population, with 5.81 revisions per year and 5 lines of code changed on average. Most changes deal with dependencies, that are currently stored in a rather unstructured manner. We propose to introduce an abstraction that, for instance, could deal with the intricacies of different package managers and could improve migration to more light-weight images.

[1]  Diomidis Spinellis,et al.  Does Your Configuration Code Smell? , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[2]  Harald C. Gall,et al.  The making of cloud applications: an empirical study on software development for the cloud , 2014, ESEC/SIGSOFT FSE.

[3]  Georgios Gousios,et al.  TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[4]  Pasi Kuvaja,et al.  Continuous deployment of software intensive products and services: A systematic mapping study , 2017, J. Syst. Softw..

[5]  Philipp Leitner,et al.  All the Services Large and Micro: Revisiting Industrial Practice in Services Computing , 2015, ICSOC Workshops.

[6]  Harald C. Gall,et al.  Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction , 2007, IEEE Transactions on Software Engineering.

[7]  Jez Humble,et al.  Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation , 2010 .

[8]  Georgios Gousios,et al.  Oops, my tests broke the build: An analysis of Travis CI builds with GitHub , 2016, PeerJ Prepr..

[9]  Harald C. Gall,et al.  An empirical study on principles and practices of continuous delivery and deployment , 2016, PeerJ Prepr..

[10]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[11]  Dharmesh Kakadia,et al.  Virtualization vs Containerization to Support PaaS , 2014, 2014 IEEE International Conference on Cloud Engineering.

[12]  Matias Martinez,et al.  Fine-grained and accurate source code differencing , 2014, ASE.

[13]  Harald C. Gall,et al.  Towards quality gates in continuous delivery and deployment , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[14]  Florian Rosenberg,et al.  Testing Idempotence for Infrastructure as Code , 2013, Middleware.

[15]  Ramakrishnan Rajamony,et al.  An updated performance comparison of virtual machines and Linux containers , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Lucas Chaufournier,et al.  Containers and Virtual Machines at Scale: A Comparative Study , 2016, Middleware.

[17]  Miika Komu,et al.  Hypervisors vs. Lightweight Virtualization: A Performance Comparison , 2015, 2015 IEEE International Conference on Cloud Engineering.

[18]  Audris Mockus,et al.  A Large-Scale Empirical Study of the Relationship between Build Technology and Build Maintenance , 2014, Empirical Software Engineering.

[19]  Shane McIntosh,et al.  An empirical study of build maintenance effort , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[20]  Harald C. Gall,et al.  Using Docker Containers to Improve Reproducibility in Software and Web Engineering Research , 2016, ICWE.

[21]  Bram Adams,et al.  Co-evolution of Infrastructure and Source Code - An Empirical Study , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[22]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .