Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors

While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of applicable cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. The measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung’s next-generation Exynos 9820 processor by 11.2% and 17.9%, respectively, while running TensorFlow Lite.

[1]  Lieven Eeckhout,et al.  Fairness-aware scheduling on single-ISA heterogeneous multi-cores , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[2]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xuan Chen,et al.  Adaptive Multi-versioning for OpenMP Parallelization via Machine Learning , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[4]  Wenguang Chen,et al.  Taming hardware event samples for FDO compilation , 2010, CGO '10.

[5]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Tipp Moseley,et al.  AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7]  Xiangke Liao,et al.  Automatic generation of fast BLAS3-GEMM: A portable compiler approach , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[8]  Xipeng Shen,et al.  An input-centric paradigm for program dynamic optimizations , 2010, OOPSLA.

[9]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[10]  Erik Hagersten Multiversioned Decoupled Access-Execute: the Key to Energy-Efficient Compilation of General-Purpose Programs , 2016 .

[11]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Ken Kennedy,et al.  Procedure cloning , 1992, Proceedings of the 1992 International Conference on Computer Languages.

[13]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[15]  Wei-Chung Hsu,et al.  Dynamic Profile Driven Code Version Selection , 2007 .