ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Heap is one of the most important fundamental data structures in computer science. Unfortunately, for a long time heaps did not obtain ideal performance gain from widely used throughput-oriented processors because of two reasons: (1) heap property decides that operations between any parent node and its child nodes must be executed sequentially, and (2) heaps, even d-heaps (d-ary heaps or d-way heaps), cannot supply enough wide data parallelism to these processors. Recent research proposed more versatile asymmetric multicore processors (AMPs) that consist of two types of cores (latency-oriented cores with high single-thread performance and throughput-oriented cores with wide vector processing capability), unified memory address space and faster synchronization mechanism among cores with different ISAs. To leverage the AMPs for the heap data structure, in this paper we propose ad-heap, an efficient heap data structure that introduces an implicit bridge structure and properly apportions workloads to the two types of cores. We implement a batch k-selection algorithm and conduct experiments on simulated AMP environments composed of real CPUs and GPUs. In our experiments on two representative platforms, the ad-heap obtains up to 1.5x and 3.6x speedup over the optimal AMP scheduling method that executes the fastest d-heaps on the standalone CPUs and GPUs in parallel.

[1]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[2]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[3]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[4]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[5]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[6]  Richard E. Ladner,et al.  The influence of caches on the performance of heaps , 1996, JEAL.

[7]  Jie Shen,et al.  Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms , 2013, CF '13.

[8]  Mark Allen Weiss,et al.  Data structures and algorithm analysis , 1991 .

[9]  Gagan Agrawal,et al.  Accelerating MapReduce on a coupled CPU-GPU architecture , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Sean Keely,et al.  Parallel suffix array and least common prefix for the GPU , 2013, PPoPP '13.

[12]  Hsien-Hsin S. Lee,et al.  Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching , 2010, TACO.

[13]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[14]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  George Varghese,et al.  A 22nm IA multi-CPU and GPU System-on-Chip , 2012, 2012 IEEE International Solid-State Circuits Conference.

[16]  Takakazu Kurokawa,et al.  Power Efficiency Evaluation of Block Ciphers on GPU-Integrated Multicore Processor , 2012, ICA3PP.

[17]  David R. Kaeli,et al.  Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems , 2013, GPGPU@ASPLOS.

[18]  John Paul Shen,et al.  Mitigating Amdahl's law through EPI throttling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Lieven Eeckhout,et al.  Understanding fundamental design choices in single-ISA heterogeneous multicore architectures , 2013, TACO.

[20]  Adam P. Hill,et al.  An On-chip Heterogeneous Implementation of a General Sparse Linear Solver , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[21]  Anand Raghunathan,et al.  Automatic generation of software pipelines for heterogeneous parallel systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Scott B. Baden,et al.  Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.

[23]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[24]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[25]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[27]  Laxmi N. Bhuyan,et al.  A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[28]  KumarRakesh,et al.  Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance , 2004 .

[29]  Donald B. Johnson,et al.  Priority Queues with Update and Finding Minimum Spanning Trees , 1975, Inf. Process. Lett..

[30]  Hsien-Hsin S. Lee,et al.  COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS XV.

[31]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[32]  David Abrahams,et al.  C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series) , 2004 .

[33]  José Nelson Amaral,et al.  Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms , 2007, SPAA '07.

[34]  Sudhakar Yalamanchili,et al.  Accelerating simulation of agent-based models on heterogeneous architectures , 2013, GPGPU@ASPLOS.

[35]  EeckhoutLieven,et al.  Understanding fundamental design choices in single-ISA heterogeneous multicore architectures , 2013 .

[36]  Mayank Daga,et al.  Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.