CEML: a Coordinated Runtime System for Efficient Machine Learning on Heterogeneous Computing Systems

Heterogeneous computing is rapidly emerging as a promising solution for efficient machine learning. Despite the extensive prior works, system software support for efficient machine learning still remains unexplored in the context of heterogeneous computing. To bridge this gap, we propose CEML, a coordinated runtime system for efficient machine learning on heterogeneous computing systems. CEML dynamically analyzes the performance and power characteristics of the target machine-learning application and robustly adapts the system state to enhance its efficiency on heterogeneous computing systems. Our quantitative evaluation demonstrates that CEML significantly improves the efficiency of machine-learning applications on a full heterogeneous computing system.

[1]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[2]  Vanchinathan Venkataramani,et al.  Hierarchical power management for asymmetric multi-core in dark silicon era , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Woongki Baek,et al.  RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[4]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[5]  Anuj Pathania,et al.  Power-performance modelling of mobile gaming workloads on heterogeneous MPSoCs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Xiaowei Li,et al.  C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[7]  Woongki Baek,et al.  HAP: A Heterogeneity-Conscious Runtime System for Adaptive Pipeline Parallelism , 2016, Euro-Par.

[8]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Henry Hoffmann,et al.  Application heartbeats: a generic interface for specifying program performance and goals in autonomous computing environments , 2010, ICAC '10.

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[13]  Scott A. Mahlke,et al.  Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  Woongki Baek,et al.  HARS: A heterogeneity-aware runtime system for self-adaptive multithreaded applications , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).