A Study of Parallel Sorting Algorithms Using CUDA and OpenMP

............................................................................................................... iii ÖZ ............................................................................................................................... iv DEDICATION ............................................................................................................. v ACKNOWLEDGMENTS .......................................................................................... vi LIST OF TABLES ...................................................................................................... xi LIST OF FIGURES ................................................................................................... xii LIST OF ALGORITHMS ......................................................................................... xiv LIST OF ABBREVIATIONS .................................................................................... xv CHAPTER 1 ................................................................................................................

[1]  K. Srinathan,et al.  A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[2]  Sweta Kumari,et al.  A parallel selection sorting algorithm on GPUs using binary search , 2014, 2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014).

[3]  Jennifer Widom,et al.  PARALLEL AND DISTRIBUTED SYSTEMS , 2010 .

[4]  Kai Petersen,et al.  Systematic Mapping Studies in Software Engineering , 2008, EASE.

[5]  Ahmet Uyar Parallel merge sort with double merging , 2014, 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT).

[6]  Tarek A. El-Ghazawi,et al.  Exploiting concurrent kernel execution on graphic processing units , 2011, 2011 International Conference on High Performance Computing & Simulation.

[7]  Gudula Rünger,et al.  A Partitioning Algorithm for Parallel Sorting on Distributed Memory Systems , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[8]  Prabhakar Misra,et al.  Performance Evaluation of Concurrent Lock-free Data Structures on GPUs , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[9]  Yan Yang,et al.  Quick-merge sort algorithm based on Multi-core linux , 2013, Proceedings 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC).

[10]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[11]  Ulf Assarsson,et al.  Fast parallel GPU-sorting using a hybrid algorithm , 2008, J. Parallel Distributed Comput..

[12]  E. Wes Bethel,et al.  Sort-first, distributed memory parallel visualization and rendering , 2003, IEEE Symposium on Parallel and Large-Data Visualization and Graphics, 2003. PVG 2003..

[13]  Laurie A. Smith King,et al.  Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs) , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Andrew A. Davidson,et al.  Efficient parallel merge sort for fixed and variable length keys , 2012, 2012 Innovative Parallel Computing (InPar).

[15]  Laxmikant V. Kalé,et al.  Highly scalable parallel sorting , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  Yitzhak Birk,et al.  Merge Path - Parallel Merging Made Simple , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[17]  Michel Barlaud,et al.  Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[19]  Lubos Brim,et al.  Computing Strongly Connected Components in Parallel on CUDA , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[20]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[21]  Manuel Ujaldon High performance computing and simulations on the GPU using CUDA , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[22]  Andrew Sohn,et al.  Load balanced parallel radix sort , 1998, ICS '98.

[23]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[24]  Honesty C. Young,et al.  A Low Communication Sort Algorithm for a Parallel Database Machine , 1989, VLDB.

[25]  Liu Shenghui,et al.  Internal sorting algorithm for large-scale data based on GPU-assisted , 2013, Proceedings of 2013 2nd International Conference on Measurement, Information and Control.

[26]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[27]  Jon Louis Bentley,et al.  Engineering a sort function , 1993, Softw. Pract. Exp..

[28]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[29]  Chen Ding,et al.  Adaptive data partition for sorting using probability distribution , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[30]  Dongseung Kim,et al.  Parallel Merge Sort with Load Balancing , 2004, International Journal of Parallel Programming.

[31]  C. Leopold,et al.  A User ’ s Experience with Parallel Sorting and OpenMP , 2004 .

[32]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[33]  Hermann Lederer,et al.  Parallel Computing: From Multicores and GPU's to Petascale , 2010 .

[34]  Xiaoming Li,et al.  An Empirically Optimized Radix Sort for GPU , 2009, 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[35]  Vitaly Osipov,et al.  GPU sample sort , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[36]  Steve Mann,et al.  Mediated reality using computer graphics hardware for computer vision , 2002, Proceedings. Sixth International Symposium on Wearable Computers,.

[37]  Malcolm Atkinson,et al.  High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: , 2012 .

[38]  Zongmin Ma,et al.  Count Sort for GPU Computing , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[39]  Jie Zhang,et al.  The performance analysis and research of sorting algorithm based on OpenMP , 2011, 2011 International Conference on Multimedia Technology.

[40]  Ci Linlin,et al.  Two parallel strategies of split-merge algorithm for image segmentation , 2007, 2007 International Conference on Wavelet Analysis and Pattern Recognition.

[41]  Takakazu Kurokawa,et al.  High-Performance Symmetric Block Ciphers on CUDA , 2011, 2011 Second International Conference on Networking and Computing.

[42]  Masaki Matsumoto,et al.  Automatic Optimization of Thread Mapping for a GPGPU Programming Framework , 2014, 2014 Second International Symposium on Computing and Networking.

[43]  Henry Fuchs,et al.  A sorting classification of parallel rendering , 1994, IEEE Computer Graphics and Applications.

[44]  Fumihiko Ino,et al.  An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors , 2003, Parallel Comput..

[45]  Sam White,et al.  A CUDA-MPI Hybrid Bitonic Sorting Algorithm for GPU Clusters , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[46]  Michael Allen,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers , 1998 .

[47]  Richard Cole,et al.  Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[48]  Norbert Luttenberger,et al.  A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[49]  Keqin Li,et al.  Parallel Algorithms for Approximate String Matching with k Mismatches on CUDA , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[50]  Timothy J. Rolfe A specimen of parallel programming: parallel merge sort implementation , 2010, INROADS.

[51]  Daniel Weiskopf,et al.  Sort-First Parallel Volume Rendering , 2011, IEEE Transactions on Visualization and Computer Graphics.

[52]  Yohan. jin,et al.  2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops 2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops 2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops 2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops , 2022 .

[53]  Fabrizio Silvestri,et al.  Sorting using BItonic netwoRk wIth CUDA , 2009, LSDS-IR@SIGIR.

[54]  Seong Jin Cho,et al.  Parallel quick sort algorithms analysis using OpenMP 3.0 in embedded system , 2011, 2011 11th International Conference on Control, Automation and Systems.

[55]  Federico Silla,et al.  Performance of CUDA Virtualized Remote GPUs in High Performance Clusters , 2011, 2011 International Conference on Parallel Processing.

[56]  Robert Sedgewick Quicksort with Equal Keys , 1977, SIAM J. Comput..

[57]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[58]  Cláudio T. Silva,et al.  Out-of-core sort-first parallel rendering for cluster-based tiled displays , 2002, Parallel Comput..

[59]  Toshio Nakatani,et al.  AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[60]  Robert M. Farber,et al.  CUDA Application Design and Development , 2011 .

[61]  Wang Xiang,et al.  Analysis of the Time Complexity of Quick Sort Algorithm , 2011, 2011 International Conference on Information Management, Innovation Management and Industrial Engineering.

[62]  Ronald Duarte,et al.  On the performance and energy-efficiency of multi-core SIMD CPUs and CUDA-enabled GPUs , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[63]  Baifeng Wu,et al.  A Novel Parallel Approach of Radix Sort with Bucket Partition Preprocess , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[64]  Yun Liang,et al.  Register and thread structure optimization for GPUs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[65]  Raphael Landaverde,et al.  An investigation of Unified Memory Access performance in CUDA , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[66]  Max Grossman,et al.  Professional CUDA C Programming , 2014 .

[67]  Minyi Guo Editorial: Parallel and Distributed Processing with Applications , 2004, The Journal of Supercomputing.

[68]  Yue Zhao,et al.  High-Performance and Real-Time Volume Rendering in CUDA , 2009, 2009 2nd International Conference on Biomedical Engineering and Informatics.

[69]  Paolo Prinetto,et al.  A software-based self test of CUDA Fermi GPUs , 2013, 2013 18th IEEE European Test Symposium (ETS).

[70]  Paolo Prinetto,et al.  Fault mitigation strategies for CUDA GPUs , 2013, 2013 IEEE International Test Conference (ITC).

[71]  Soo Saw Meng,et al.  Sorting very large text data in multi GPUs , 2012, 2012 IEEE International Conference on Control System, Computing and Engineering.

[72]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[73]  Erika Hernández Rubio,et al.  FLAP: Tool to generate CUDA code from sequential C code , 2014, 2014 International Conference on Electronics, Communications and Computers (CONIELECOMP).

[74]  Norbert Luttenberger,et al.  Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[75]  Ali Yazici,et al.  Implementation of Sorting Algorithms with CUDA: An Empirical Study , 2016 .

[76]  Kenneth Moreland,et al.  Sort-last parallel rendering for viewing extremely large data sets on tile displays , 2001, Proceedings IEEE 2001 Symposium on Parallel and Large-Data Visualization and Graphics (Cat. No.01EX520).

[77]  Andrew S. Grimshaw,et al.  Revisiting sorting for GPGPU stream architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).