GPU accelerate parallel Odd-Even merge sort: An OpenCL method

Odd-Even merge sort is a basic problem in computer supported cooperative work in design area. However, it is not effective because of the high complexity O(nlg2n) in CPU platform. In this paper, we present a novel implementation based on the OpenCL programming model on recent GPU (Graphic Processing Unit). Our implementation was based on Knuth's algorithm and do some change. Due to limitations of OpenCL, we utilize a flag variable to make it avoid the direct backward control flow. As results, our implementation achieves 18× speedups compared with the CPU C++ STL quick sort. And it gets almost linear speedup for next generations of GPU because of the complete parallelism in each iteration process. Meanwhile, our approach makes the odd-even merge sort effectively in practice because of the high performance. Furthermore, the approach used in this paper for cooperating thousands of processing units to parallel process can also be used in other cooperation areas.

[1]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[2]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[3]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[4]  Lasse Natvig,et al.  Logarithmic time cost optimal parallel sorting is not yet fast in practice , 1990, Proceedings SUPERCOMPUTING '90.

[5]  J. Krüger,et al.  Fast 4-way parallel radix sorting on GPUs , 2009 .

[6]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[7]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[8]  Robert Sedgewick,et al.  Algorithms in C , 1990 .

[9]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[10]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[11]  Erik Millán,et al.  Fragment shaders for agent animation using finite state machines , 2005, Simul. Model. Pract. Theory.

[12]  Richard Cole,et al.  Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[13]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[14]  Shahram Sarkani,et al.  What's at STEAK? Exploring engineering methodologies to identify existing generational boundaries impeding the strategic transfer of engineering and architectural knowledge (STEAK) , 2011, Proceedings of the 2011 15th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[15]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[16]  Yinsheng Li,et al.  Intelligent interactive system for collaborative green computing , 2011, Proceedings of the 2011 15th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[17]  Guy E. Blelloch,et al.  Radix sort for vector multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[18]  Jean-Christophe Lapayre,et al.  Adaptative image flow in collaborative medical telediagnosis environments , 2011, Proceedings of the 2011 15th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[19]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[20]  David J. DeWitt,et al.  A taxonomy of parallel sorting , 1984, CSUR.

[21]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[22]  Donald E. Knuth,et al.  Computer programming as an art , 1974, CACM.

[23]  Pat Hanrahan,et al.  Photon mapping on programmable graphics hardware , 2003, HWWS '03.

[24]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[25]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[26]  Richard M. Karp,et al.  A Survey of Parallel Algorithms for Shared-Memory Machines , 1988 .

[27]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[28]  Gabriel Zachmann,et al.  GPU-ABiSort: optimal parallel sorting on stream architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[29]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[30]  Ulf Assarsson,et al.  Real-time approximate sorting for self shadowing and transparency in hair rendering , 2008, I3D '08.

[31]  Jens H. Krüger,et al.  Fast Four‐Way Parallel Radix Sorting on GPUs , 2009, Comput. Graph. Forum.

[32]  János Komlós,et al.  An 0(n log n) sorting network , 1983, STOC.

[33]  Rüdiger Westermann,et al.  UberFlow: a GPU-based particle engine , 2004, SIGGRAPH '04.

[34]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[35]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[36]  Alexandru Nicolau,et al.  Adaptive Bitonic Sorting: An Optimal Parallel Algorithm for Shared-Memory Machines , 1989, SIAM J. Comput..

[37]  Domitile Lourdeaux,et al.  Taking into account emotions in mixed human/agent systems for improving collaborative work , 2011, Proceedings of the 2011 15th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[38]  Marie-Hélène Abel,et al.  How can knowledge management improve the Tendering process in railway transport , 2011, Proceedings of the 2011 15th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[39]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[40]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[41]  Michael D. McCool,et al.  Shader algebra , 2004, ACM Trans. Graph..