Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL

Smith-Waterman is a dynamic programming algorithm that plays a key role in the modern genomics pipeline as it is guaranteed to find the optimal local alignment between two strings of data. The state of the art presents many hardware acceleration solutions that have been implemented in order to exploit the high degree of parallelism available in this algorithm. The majority of these implementations use heuristics to increase the performance of the system at the expense of the accuracy of the result. In this work, we present an implementation of the pure version of the algorithm. We include the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations. To achieve scalability, our custom design comprises of systolic arrays, data compression features and shift registers, while a custom port mapping strategy aims to maximize performance. Our designs are built leveraging an OpenCL-based design entry, namely Xilinx SDAccel, in conjunction with a Xilinx Virtex 7 and Kintex Ultrascale platform. Our final design achieves a performance of 42.47 GCUPS (giga cell updates per second) with an energy efficiency of 1.6988 GCUPS/W. This represents an improvement of 1.72x in performance and energy efficiency over previously published FPGA implementations and 8.49x better in energy efficiency over comparable GPU implementations.

[1]  Guang R. Gao,et al.  Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform , 2007, HPRCTA.

[2]  Yongchao Liu,et al.  CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions , 2013, BMC Bioinformatics.

[3]  Michael S. Farrar Optimizing Smith-Waterman for the Cell Broadband Engine , 2008 .

[4]  Fumihiko Ino,et al.  Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.

[5]  Stephen Neuendorffer,et al.  FPGA Based OpenCL Acceleration of Genome Sequencing Software , 2015 .

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[8]  William Lynch,et al.  Smith-Waterman implementation on a FSB-FPGA module using the Intel Accelerator Abstraction Layer , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  Jacek Blazewicz,et al.  Protein alignment algorithms with an efficient backtracking routine on multiple GPUs , 2011, BMC Bioinformatics.

[11]  Fa Zhang,et al.  A parallel Smith-Waterman algorithm based on divide and conquer , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[12]  Yongchao Liu,et al.  SWAPHI: Smith-waterman protein database search on Xeon Phi coprocessors , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[13]  Rudy Lauwereins,et al.  Design, Automation, and Test in Europe , 2008 .

[14]  Kenneth O'Brien,et al.  A Semi-Automated Tool Flow for Roofline Anaylsis of OpenCL Kernels on Accelerators , 2015 .

[15]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[16]  Meng-Lai Yin,et al.  A parallel implementation of the Smith-Waterman algorithm for massive sequences searching , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[17]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[18]  Liang-Tsung Huang,et al.  Improving the Mapping of Smith-Waterman Sequence Database Searches onto CUDA-Enabled GPUs , 2015, BioMed research international.

[19]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[20]  Zaid Al-Ars,et al.  DOPA: GPU-based protein alignment using database and memory access optimizations , 2011, BMC Research Notes.

[21]  Witold R. Rudnicki,et al.  An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22]  Marco D. Santambrogio,et al.  On How to Improve FPGA-Based Systems Design Productivity via SDAccel , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[23]  Sean O. Settle High-performance Dynamic Programming on FPGAs with OpenCL , 2013 .

[24]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.