Modular Design of High-Throughput, Low-Latency Sorting Units

High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing high-throughput, low-latency sorting units. Our sorting architectures utilize modular design techniques that hierarchically construct large sorting units from smaller building blocks. The sorting units are optimized for situations in which only the M largest numbers from N inputs are needed, because this situation commonly occurs in many applications for scientific computing, data mining, network processing, digital signal processing, and high-energy physics. We utilize our proposed techniques to design parameterized, pipelined, and modular sorting units. A detailed analysis of these sorting units indicates that as the number of inputs increases their resource requirements scale linearly, their latencies scale logarithmically, and their frequencies remain almost constant. When synthesized to a 65-nm TSMC technology, a pipelined 256-to-4 sorting unit with 19 stages can perform more than 2.7 billion sorts per second with a latency of about 7 ns per sort. We also propose iterative sorting techniques, in which a small sorting unit is used several times to find the largest values.

[1]  Mike Paterson,et al.  Improved sorting networks withO(logN) depth , 1990, Algorithmica.

[2]  Stephan Olariu,et al.  An Optimal Hardware-Algorithm for Sorting Using a Fixed-Size Parallel Sorting Device , 2000, IEEE Trans. Computers.

[3]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[4]  Chun-Yueh Huang,et al.  A hardware design approach for merge-sorting network , 2001, ISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196).

[5]  Enzo Mumolo,et al.  A novel sorting algorithm and its application to a gamma-ray telescope asynchronous data acquisition system , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[6]  David J. DeWitt,et al.  A taxonomy of parallel sorting , 1984, CSUR.

[7]  Thompson The VLSI Complexity of Sorting , 1983, IEEE Transactions on Computers.

[8]  Chung J. Kuo,et al.  Modified odd-even merge-sort network for arbitrary number of inputs , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[9]  Mihai F. Ionescu,et al.  Optimizing parallel bitonic sort , 1997, Proceedings 11th International Parallel Processing Symposium.

[10]  Shing-Tsaan Huang,et al.  K-Way Bitonic Sort , 1989, IEEE Trans. Computers.

[11]  Ricardo P. Jasinski,et al.  Panning sorter: A minimal-size architecture for hardware implementation of 2D Data Sorting Coprocessors , 2010, 2010 IEEE Asia Pacific Conference on Circuits and Systems.

[12]  Chaitali Chakrabarti,et al.  Novel sorting network-based architectures for rank order filters , 1994, IEEE Trans. Very Large Scale Integr. Syst..

[13]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.

[14]  Takeo Kanade,et al.  A VLSI sorting image sensor: global massively parallel intensity-to-time processing for low-latency adaptive vision , 1999, IEEE Trans. Robotics Autom..

[15]  Hans-Jörg Pfleiderer,et al.  Area and Throughput Aware Comparator Networks Optimization for Parallel Data Processing on FPGA , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[16]  Simon W. Moore,et al.  Tagged Up/Down Sorter - A Hardware Priority Queue , 1995, Comput. J..

[17]  A. Cicuttin,et al.  SORTCHIP: a VLSI implementation of a hardware algorithm for continuous data sorting , 2003, IEEE J. Solid State Circuits.

[18]  M. Orlowski,et al.  A new algorithm for the largest empty rectangle problem , 1990, Algorithmica.

[19]  Kenneth E. Batcher,et al.  A Generalized Bitonic Sorting Network , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[20]  Ioannis Pitas,et al.  Nonlinear Digital Filters - Principles and Applications , 1990, The Springer International Series in Engineering and Computer Science.

[21]  Claudia Feregrino Uribe,et al.  A Versatile Linear Insertion Sorter Based on a FIFO Scheme , 2008, 2008 IEEE Computer Society Annual Symposium on VLSI.

[22]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1985, IEEE Trans. Computers.

[23]  Claudia Feregrino Uribe,et al.  Author ' s personal copy A versatile linear insertion sorter based on an FIFO scheme , 2009 .

[24]  Lin Yen-Chun,et al.  On balancing sorting on a linear array , 1993 .

[25]  Kenji Shirai,et al.  DIAPRISM Hardware Sorter — Sort a Million Records Within a Second — , 2000 .

[26]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[27]  Xie Hongwei,et al.  An Improved Parallel Sorting Algorithm for Odd Sequence , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[28]  C. Lefevre LHC: the guide , 2008 .

[29]  Aishy Amer,et al.  An FPGA Architecture of Stable-Sorting on a Large Data Volume : Application to Video Signals , 2007, 2007 41st Annual Conference on Information Sciences and Systems.

[30]  J. P. Agrawal,et al.  Arbitrary size bitonic (ASB) sorters and their applications in broadband ATM switching , 1996, Conference Proceedings of the 1996 IEEE Fifteenth Annual International Phoenix Conference on Computers and Communications.

[31]  Kenneth E. Batcher,et al.  Minimizing Communication in the Bitonic Sort , 2000, IEEE Trans. Parallel Distributed Syst..

[32]  Goetz Graefe,et al.  Implementing sorting in database systems , 2006, CSUR.

[33]  Yusuf Leblebici,et al.  Full-custom CMOS realization of a high-performance binary sorting engine with linear area-time complexity , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[34]  B. Ahn,et al.  A pipelined, expandable VLSI sorting engine implemented in CMOS technology , 1989, IEEE International Symposium on Circuits and Systems,.

[35]  E. Szemerédi,et al.  O(n LOG n) SORTING NETWORK. , 1983 .

[36]  Horácio C. Neto,et al.  Sorting Units for FPGA-Based Embedded Systems , 2008, DIPES.

[37]  Michael J. Schulte,et al.  High-Energy Physics , 2010, Handbook of Signal Processing Systems.

[38]  Shengnan Dong,et al.  A Novel High-Speed Parallel Scheme for Data Sorting Algorithm Based on FPGA , 2009, 2009 2nd International Congress on Image and Signal Processing.

[39]  Si-Qing Zheng,et al.  An Efficient Parallel VLSI Sorting Architecture , 2000, VLSI Design.

[40]  Amin Farmahini Farahani,et al.  Modular high-throughput and low-latency sorting units for FPGAs in the Large Hadron Collider , 2011, 2011 IEEE 9th Symposium on Application Specific Processors (SASP).

[41]  C.-I.H. Chen,et al.  Chip design for monobit receiver , 1997 .

[42]  Behrooz Parhami,et al.  Data-Driven Control Scheme for Linear Arrays: Application to a Stable Insertion Sorter , 1999, IEEE Trans. Parallel Distributed Syst..

[43]  Ezequiel Herruzo,et al.  A New Parallel Sorting Algorithm based on Odd-Even Mergesort , 2007, 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07).

[44]  János Komlós,et al.  An 0(n log n) sorting network , 1983, STOC.

[45]  Kenneth Y. Yun,et al.  A self-timed real-time sorting network , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[46]  David L. Andrews,et al.  A configurable high-throughput linear sorter system , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[47]  Hui Zhang,et al.  Implementing scheduling algorithms in high-speed networks , 1999, IEEE J. Sel. Areas Commun..