Scaling Deep Learning for Cancer with Advanced Workflow Storage Integration

Cancer Deep Learning Environment (CANDLE) benchmarks and workflows will combine the power of exascale computing with neural network-based machine learning to address a range of loosely connected problems in cancer research. This application area poses unique challenges to the exascale computing environment. Here, we identify one challenge in CANDLE workflows, namely, saving neural network model representations to persistent storage. In this paper, we provide background on this problem, describe our solution, the Model Cache, and present performance results from running the system on a test cluster, ANL/LCRC Blues, and the petascale supercomputer NERSC Cori. We also sketch next steps for this promising workflow storage solution.

[1]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[2]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[3]  Douglas Thain,et al.  The Kangaroo approach to data movement on the Grid , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[4]  Daniel S. Katz,et al.  Design and analysis of data management in scalable parallel scripting , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Chase Qishi Wu,et al.  Energy-Efficient Dynamic Scheduling of Deadline-Constrained MapReduce Workflows , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Shane Snyder,et al.  Toward Understanding I/O Behavior in HPC Workflows , 2018, 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS).

[8]  Tong Shu,et al.  Performance optimization and energy efficiency of big-data computing workflows , 2017 .

[9]  Hyojin Kim,et al.  LBANN: livermore big artificial neural network HPC toolkit , 2015, MLHPC@SC.

[10]  Chase Qishi Wu,et al.  Performance optimization of Hadoop workflows in public clouds through adaptive task partitioning , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[11]  Daniel S. Katz,et al.  Swift/T: scalable data flow programming for many-task applications , 2013, PPoPP '13.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[14]  Kevin Leyton-Brown,et al.  Parallel Algorithm Configuration , 2012, LION.

[15]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[16]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[17]  Chase Qishi Wu,et al.  Energy-efficient Mapping of Big Data Workflows under Deadline Constraints , 2016, WORKS@SC.

[18]  Bart De Moor,et al.  Easy Hyperparameter Search Using Optunity , 2014, ArXiv.

[19]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[20]  Bernd Bischl,et al.  mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions , 2017, 1703.03373.

[21]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[22]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[23]  Daniel S. Katz,et al.  Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications , 2013, Fundam. Informaticae.

[24]  Chris Eliasmith,et al.  Hyperopt: a Python library for model selection and hyperparameter optimization , 2015 .

[25]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[26]  Ian T. Foster,et al.  Compiler Techniques for Massively Scalable Implicit Task Parallelism , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[28]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[29]  Fangfang Xia,et al.  CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research , 2018, BMC Bioinformatics.

[30]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[31]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[32]  Ozik Jonathan,et al.  From desktop to Large-Scale Model Exploration with Swift/T , 2016 .

[33]  Robert Latham,et al.  Methodology for the Rapid Development of Scalable HPC Data Services , 2018, 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS).

[34]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.