Towards Reproducible and Reusable Deep Learning Systems Research Artifacts

This paper discusses results and insights from the 1st ReQuEST workshop, a col1 lective effort to promote reusability, portability and reproducibility of deep learning 2 research artifacts within the Architecture/PL/Systems communities. ReQuEST 3 (Reproducible Quality-Efficient Systems Tournament) exploits the open-source 4 Collective Knowledge framework (CK) to unify benchmarking, optimization, and 5 co-design of deep learning systems implementations and exchange results via a 6 live multi-objective scoreboard. Systems evaluated under ReQuEST are diverse 7 and include an FPGA-based accelerator, optimized deep learning libraries for x86 8 and ARM systems, and distributed inference in Amazon Cloud and over a cluster 9 of Raspberry Pis. We finally discuss limitations to our approach, and how we plan 10 improve upon those limitations for the upcoming SysML artifact evaluation effort. 11 1 ReQuEST Overview 12 The quest to continually optimize deep learning systems has introduced new deep learning models, 13 frameworks, DSLs, libraries, compilers and hardware architectures. In this frantically changing 14 environment, is has become critical to quickly reproduce, deploy, and build on top of existing research. 15 While open-sourcing research artifacts is one step in the right direction, it is not sufficient to guarantee 16 ease of reproducibility and reusability. To enable reproducible and reusable research, we need to 17 provide complete, customizable, and portable workflows that combine off-the-shelf and custom layers 18 of the system stack and deploys them in a push-button fashion to generate end-to-end metrics of 19 importance. 20 In an effort to promote reproducible, reusable, and portable workflows in deep learning systems 21 research, we introduced the ReQuEST workshop at the ACM ASPLOS 2018 (for multidisciplinary 22 systems research spanning computer architecture and hardware, programming languages and compil23 ers, operating systems and networking). The goal was to have computer architects, compilers, and 24 systems researchers submit deep learning research artifacts (code, data, and experiments) using a 25 unified Collective Knowledge (CK) workflow framework Fursin et al. (2016) to produce a multi26 objective scoreboard that would rank submissions under varied cost metrics that include: ImageNet 27 validation (50,000 images), latency (seconds per image), throughput (images per second), platform 28 price (dollars), and peak power consumption (Watts). To keep the task of collecting artifacts tractable, 29 we focused on a single problem: ImageNet classification, but gave complete freedom over what 30 models, frameworks, libraries, compilers and hardware platforms were being used to solve the 31 classification problem. 32 The most important difference of ReQuEST from other related workshops and tournaments such 33 as DawnBench daw (2018) and LPIRC lpi (2015) is that we not only publish final results but also 34 share portable and customizable workflows (i.e. not just Docker images) with all related research 35 components (models, data sets, libraries) to let the community immediately reuse, improve, and build 36 upon them. 37 Submitted to Machine Learning Open Source Software (MLOSS, co-located at NIPS 2018). Do not distribute. Figure 1: We leverage the open Collective Knowledge workflow framework (CK) and the rigorous ACM artifact evaluation methodology (AE) to allow the community collaboratively explore quality vs. efficiency trade-offs for rapidly evolving workloads across diverse systems. The first iteration of the ReQuEST workshop led to five artifact submissions that were unified under 38 the CK framework and evaluated (reproduced) by the organizers. What the submissions lacked in 39 quantity, they made up for in terms of diversity: (1) submissions spanned architecture, compilers, 40 and systems research, (2) utilized x86, ARM, and FPGA-based platforms; and (3) were deployed on 41 single-node systems as well as distributed nodes. 42 2 Unifying Artifacts and Workflows with CK 43 ReQuEST aims to promote reproducibility of experimental results and reusability/customization of 44 systems research artifacts by standardizing evaluation methodologies and facilitating the deployment 45 of efficient solutions on heterogeneous platforms. For that reason, packaging artifacts (scripts, 46 libraries, frameworks, data sets, models) and experimental results requires a bit more involvement 47 than sharing some CSV/JSON files or checking out a given GitHub repository. That is why we 48 build our competition on top of CK Fursin et al. (2016) to provide unified evaluation and a real-time 49 leader-board of submissions. CK is an open-source portable workflow framework, used as standard 50 ACM artifact evaluation methodology from ACM and IEEE systems conferences (CGO, PPoPP, 51 PACT, SuperComputing). 52 CK works a Python wrapper framework to help users share their code and data as customizable and 53 reusable plugins with a common JSON API, meta description and an integrated package manager, 54 adaptable to a user platform with Linux, Windows, MacOS and Android. Researchers can then 55 quickly prototype experimental workflows from shared components, crowdsource benchmarking and 56 autotuning across diverse models, data sets and platforms, exchange results via public scoreboards, 57 and generate interactive reports ck(2018). 58 3 Artifact Submissions Overview 59 The ReQuEST-ASPLOS’18 proceedings, available in the ACM Digital Library, include five papers 60 with Artifact Appendices and a set of ACM reproducibility badges. 61 The CK repository for all ReQuEST-ASPLOS’18 artifacts are documented and available at the fol62 lowing link: https://github.com/ctuning/ck-request-asplos18-results. The interactive live scoreboard 63 can be accessed under the followig URL: http://cKnowledge.org/request-results. The proceed64 ings are accompanied by snapshots of Collective Knowledge workflows covering a very diverse 65 model/software/hardware stack: 66 • Models: MobileNets, ResNet-18, ResNet-50, Inception-v3, VGG16, AlexNet, SSD. 67