ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors

Due to the difficulty that modern compilers have in vectorizing applications on vector-extension architectures, programmers resort to manually programming vector registers with intrinsics in order to achieve better performance. However, the continued growth in the width of registers and the evolving library of intrinsics make such manual optimizations tedious and error-prone. Hence, we propose a framework for the Automatic SIMDization of Parallel Sorting (ASPaS) on x86-based multicore and manycore processors. That is, ASPaS takes any sorting network and a given instruction set architecture (ISA) as inputs and automatically generates vectorized code for that sorting network. By formalizing the sort function as a sequence of comparators and the transpose and merge functions as sequences of vector-matrix multiplications, ASPaS can map these functions to operations from a selected "pattern pool" that is based on the characteristics of parallel sorting, and then generate the vectorized code with the real ISA intrinsics. The performance evaluation of our ASPaS framework on the Intel Xeon Phi coprocessor illustrates that automatically generated sorting codes from ASPaS can outperform the sorting implementations from STL, Boost, and Intel TBB.

[1]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[2]  Toshio Nakatani,et al.  AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[3]  Austin R. Benson,et al.  A framework for practical parallel fast matrix multiplication , 2014, PPoPP.

[4]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[5]  Kenneth E. Batcher,et al.  Designing Sorting Networks , 2011 .

[6]  Franz Franchetti,et al.  Operator Language: A Program Generation Framework for Fast Kernels , 2009, DSL.

[7]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[8]  Thomas N. Hibbard An empirical study of minimal storage sorting , 1963, CACM.

[9]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[10]  Franz Franchetti,et al.  Generating SIMD Vectorized Permutations , 2008, CC.

[11]  Markus Püschel,et al.  Computer generation of streaming sorting networks , 2012, DAC Design Automation Conference 2012.

[12]  Rezaur Rahman Intel® Xeon Phi™ Coprocessor Architecture and Tools , 2013, Apress.

[13]  Andrew A. Davidson,et al.  Efficient parallel merge sort for fixed and variable length keys , 2012, 2012 Innovative Parallel Computing (InPar).

[14]  KumarSanjeev,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, VLDB 2008.

[15]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[17]  Gagan Agrawal,et al.  A programming system for xeon phis with runtime SIMD parallelization , 2014, ICS '14.

[18]  Hari Sundar,et al.  HykSort: a new variant of hypercube quicksort on distributed memory architectures , 2013, ICS '13.

[19]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[20]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[21]  Franz Franchetti,et al.  Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets , 2011, ICS '11.

[22]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Alexander A. Stepanov,et al.  C++ Standard Template Library , 2000 .

[24]  W. Weissblum A Sorting Problem , 1960 .

[25]  Kenneth E. Batcher,et al.  Designing Sorting Networks: A New Paradigm , 2011 .

[26]  Boris Schling The Boost C++ Libraries , 2011 .

[27]  Pradeep Dubey,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..