Energy Efficient Computing on Multi-core Processors: Vectorization and Compression Techniques

Over the past few years, energy consumption has become the main limiting factor for computing in general. This has led CPU vendors to aggressively promote parallel computing using multiple cores without significantly increasing the thermal design power of the processor. However, achieving maximum performance and energy efficiency from the available resources on the multi-core and many-core platforms mandates efficient exploitation of the existing and emerging architectural features at the application level. This thesis presents the study of some of the existing and emerging technologies in order to identify the potential of exploiting these technologies in achieving high performance and energy efficiency for a set of Smart Grid applications on Intel multi-core and many-core platforms. The first part of this thesis explores the energy efficiency impact of different multi-core programming techniques for a selected set of benchmarks and smart grid applications on Intel SandyBridge and Haswell multi-core processors. These techniques include different parallelism techniques such as thread-level parallelism using OpenMP, task-based parallelism using OmpSs, data parallelism using SIMD (Single Instruction Multiple Data) instruction sets, code optimizations and use of different existing optimized math libraries. In our initial case studies, SIMD vectorization is proven very effective in providing both high performance and energy efficiency. Though the SIMD vectorization is proven very effective, it can also exert pressure on the available memory bandwidth for some applications like Powel Time-Series Kernel, causing under-utilization of the computing resources and thus energy inefficient executions. In the second part of this research, we investigate the opportunities of improving the performance of SIMD vectorization for memory-bound applications using SIMD data compression, SIMD software prefetching, SIMD shuffling, code-blocking and other code transformation techniques. The key idea is to reduce the

[1]  Yoonho Park,et al.  Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.

[2]  Lasse Natvig,et al.  Performance and Energy Efficiency Analysis of Data Reuse Transformation Methodology on Multicore Processor , 2012, Euro-Par Workshops.

[3]  Katrin Baumgartner Custom Memory Management Methodology Exploration Of Memory Organisation For Embedded Multimedia System Design , 2016 .

[4]  Gihan R. Mudalige,et al.  Vectorizing Unstructured Mesh Computations for Many-core Architectures , 2014, PMAM.

[5]  Peter Pirsch,et al.  Array architectures for block matching algorithms , 1989 .

[6]  Laxmikant V. Kalé,et al.  Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Hugo De Man,et al.  Formalized methodology for data reuse: exploration for low-power hierarchical memory mappings , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[8]  Pradeep Dubey,et al.  Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology , 2012 .

[9]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  S. Nikolaidis,et al.  The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors , 2005, 2005 IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications.

[11]  Mahmut T. Kandemir,et al.  Studying inter-core data reuse in multicores , 2011, SIGMETRICS '11.

[12]  Geeta Sikka,et al.  A Study on Vectorization Methods for Multicore SIMD Architecture Provided by Compilers , 2014 .

[13]  Jack J. Dongarra,et al.  A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Krste Asanovic,et al.  Vector Processors for Energy-Efficient Embedded Systems , 2016, MES@ISCA.

[15]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[16]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[17]  Luca Benini,et al.  Integrated task scheduling and data assignment for SDRAMs in dynamic applications , 2004, IEEE Design & Test of Computers.

[18]  Guang R. Gao,et al.  Optimizing the Fast Fourier Transform on a Multi-core Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[19]  Alexander Chatzigeorgiou,et al.  Evaluating the Effect of Data-Reuse Transformations on Processor Power Consumption , 2001 .

[20]  Erik Brockmeyer,et al.  Data Access and Storage Management for Embedded Programmable Processors , 2002, Springer US.

[21]  Magnus Jahre,et al.  Optimized hardware for suboptimal software: The case for SIMD-aware benchmarks , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Constantinos E. Goutis,et al.  DATA-REUSE EXPLORATION FOR LOW-POWER REALIZATION OF MULTIMEDIA APPLICATIONS ON EMBEDDED CORES , 1999 .

[23]  Yunsong Li,et al.  High-Throughput Power-Efficient VLSI Architecture of Fractional Motion Estimation for Ultra-HD HEVC Video Encoding , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Christoforos E. Kozyrakis,et al.  Models and Metrics to Enable Energy-Efficiency Optimizations , 2007, Computer.

[25]  Jörg Ott,et al.  RTP Payload Format for ITU-T Rec. H.263 Video , 2007, RFC.

[26]  Mats Brorsson,et al.  A Comparison of some recent Task-based Parallel Programming Models , 2010 .

[27]  Borko Furht,et al.  Parallel programming for multimedia applications , 2010, Multimedia Tools and Applications.

[28]  Lasse Natvig,et al.  Case Studies of Multi-core Energy Efficiency in Task Based Programs , 2012, ICT-GLOW.

[29]  Magnus Jahre,et al.  ParVec: vectorizing the PARSEC benchmark suite , 2015, Computing.