Handling Duplicates in Dockerfiles Families: Learning from Experts

Docker is becoming a popular tool used by developers and end-users to deploy and run software applications. Dockerfiles now belong to software projects as any other software artefacts such as source code or configuration files. Many projects are even starting to maintain families of Dockerfiles rather than a single Dockerfile like the Python project who simultaneously maintains a family of 43 Dockerfiles (specific versions/dependencies). In this paper, we wonder if traditional maintenance challenge of handling duplicates arises in such projects since this challenge is classical in software development, even for non-code software artefacts. Our goal is to provide practitioners a clear explanation for why duplicates arise in projects, and what are the different means to handle duplicates with their pros and cons. To do so, we observe the practices of expert Dockerfile maintainers of Official Docker projects (128 projects) and perform a survey on 25 maintainers from our corpus. We show that duplicates in Dockerfiles are frequent in our corpus, that developers are aware of their existence, are frequently facing them and have a split opinion regarding them (error-prone but easy to maintain with the right tools). Finally, we show that some maintainers manage to limit duplicates by using ad-hoc tools. These tools while sometimes hard to set-up can help reduce the amount of duplicates by up-to 85%.

[1]  Christoph Domann,et al.  The curse of copy&paste — Cloning in requirements specifications , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[2]  Tom Mens,et al.  On the Relation between Outdated Docker Containers, Severity Vulnerabilities, and Bugs , 2018, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[3]  Florian Rosenberg,et al.  Testing Idempotence for Infrastructure as Code , 2013, Middleware.

[4]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[5]  Elmar Jürgens,et al.  Index-based code clone detection: incremental, distributed, scalable , 2010, 2010 IEEE International Conference on Software Maintenance.

[6]  Diomidis Spinellis,et al.  Does Your Configuration Code Smell? , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[7]  Terence John Parr,et al.  Enforcing strict model-view separation in template engines , 2004, WWW '04.

[8]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[9]  Bernhard Schätz,et al.  Can clone detection support quality assessments of requirements specifications? , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[10]  AntoniolGiulio,et al.  Comparison and Evaluation of Clone Detection Tools , 2007 .

[11]  Bram Adams,et al.  Co-evolution of Infrastructure and Source Code - An Empirical Study , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[12]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[13]  L. Sridevi,et al.  Clone Detection Using Abstract Syntax Trees , 2016 .

[14]  Xavier Blanc,et al.  Documentation Reuse: Hot or Not? An Empirical Study , 2017, ICSR.

[15]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[16]  Jeffrey G. Gray,et al.  Phoenix-based clone detection using suffix trees , 2006, ACM-SE 44.

[17]  Bernhard Schätz,et al.  Clone detection in automotive model-based development , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[18]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[19]  Hanspeter Pfister,et al.  UpSet: Visualization of Intersecting Sets , 2014, IEEE Transactions on Visualization and Computer Graphics.

[20]  Audris Mockus,et al.  Collecting and leveraging a benchmark of build system clones to aid in quality assessments , 2014, ICSE Companion.

[21]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[22]  Michiaki Tatsubori,et al.  HTML templates that fly: a template engine approach to automated offloading from server to client , 2009, WWW '09.

[23]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[24]  Zhiyi Ma,et al.  Detecting Duplications in Sequence Diagrams Based on Suffix Trees , 2006, 2006 13th Asia Pacific Software Engineering Conference (APSEC'06).

[25]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[26]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2000, The Kluwer International Series in Software Engineering.

[27]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[28]  Rainer Koschke,et al.  Empirical evaluation of clone detection using syntax suffix trees , 2008, Empirical Software Engineering.

[29]  Harald C. Gall,et al.  An Empirical Analysis of the Docker Container Ecosystem on GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[30]  Harald Störrle Towards clone detection in UML domain models , 2010, ECSA '10.