The Case for Unifying Data Loading in Machine Learning Clusters

Training machine learning models involves iteratively fetching and pre-processing batches of data. Conventionally, popular ML frameworks implement data loading within a job and focus on improving the performance of a single job. However, such an approach is inefficient in shared clusters where multiple training jobs are likely to be accessing the same data and duplicating operations. To illustrate this, we present a case study which reveals that for hyper-parameter tuning experiments, we can reduce up to 89% I/O and 97% pre-processing redundancy. Based on this observation, we make the case for unifying data loading in machine learning clusters by bringing the isolated data loading systems together into a single system. Such a system architecture can remove the aforementioned redundancies that arise due to the isolation of data loading in each job. We introduce OneAccess, a unified data access layer and present a prototype implementation that shows a 47.3% improvement in I/O cost when sharing data across jobs. Finally we discuss open research challenges in designing and developing a unified data loading layer that can run across frameworks on shared multi-tenant clusters, including how to handle distributed data access, support diverse sampling schemes, and exploit new storage media.

[1]  Michael I. Jordan,et al.  Breaking Locality Accelerates Block Gauss-Seidel , 2017, ICML.

[2]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[3]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[4]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[5]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[8]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[9]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[10]  Michael Chow,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Dqbarge: Improving Data-quality Tradeoffs in Large-scale Internet Services Dqbarge: Improving Data-quality Tradeoffs in Large-scale Internet Services , 2022 .

[11]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[12]  Dan Williams,et al.  Platform Storage Performance With 3D XPoint Technology , 2017, Proceedings of the IEEE.

[13]  Tie-Yan Liu,et al.  Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling , 2017, Neurocomputing.

[14]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[15]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Stephen J. Wright,et al.  Randomness and permutations in coordinate descent methods , 2018, Math. Program..

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  Olatunji Ruwase,et al.  HyperDrive: exploring hyperparameters with POP scheduling , 2017, Middleware.

[20]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .