Heterogeneous Cloud Framework for Big Data Genome Sequencing

The next generation genome sequencing problem with short (long) reads is an emerging field in numerous scientific and big data research domains. However, data sizes and ease of access for scientific researchers are growing and most current methodologies rely on one acceleration approach and so cannot meet the requirements imposed by explosive data scales and complexities. In this paper, we propose a novel FPGA-based acceleration solution with MapReduce framework on multiple hardware accelerators. The combination of hardware acceleration and MapReduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. To evaluate the performance and other metrics, we conducted a theoretical speedup analysis on a MapReduce programming platform, which demonstrates that our proposed architecture have efficient potential to improve the speedup for large scale genome sequencing applications. Also, as a practical study, we have built a hardware prototype on the real Xilinx FPGA chip. Significant metrics on speedup, sensitivity, mapping quality, error rate, and hardware cost are evaluated, respectively. Experimental results demonstrate that the proposed platform could efficiently accelerate the next generation sequencing problem with satisfactory accuracy and acceptable hardware cost.

[1]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[2]  Wen Tang,et al.  Accelerating Millions of Short Reads Mapping on a Heterogeneous Architecture with FPGA Accelerator , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[3]  Dionisios N. Pnevmatikatos,et al.  Fast, Large-Scale String Match for a 10Gbps FPGA-Based Network Intrusion Detection System , 2003, FPL.

[4]  Kirk P. Arnett,et al.  The size of the IT job market , 2008, CACM.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Quinn Snell,et al.  Accelerated large-scale multiple sequence alignment , 2011, BMC Bioinformatics.

[7]  Bin Liu,et al.  A memory-efficient pipelined implementation of the aho-corasick string-matching algorithm , 2010, TACO.

[8]  Dionisios N. Pnevmatikatos,et al.  A Memory-Efficient Reconfigurable Aho-Corasick FSM Implementation for Intrusion Detection Systems , 2007, 2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[9]  Bertil Schmidt,et al.  Accelerating short read mapping on an FPGA (abstract only) , 2012, FPGA '12.

[10]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[11]  Chao Wang,et al.  Hardware acceleration for the banded Smith-Waterman algorithm with the cycled systolic array , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[12]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[13]  Cole Trapnell,et al.  Optimizing data intensive GPGPU computations for DNA sequence alignment , 2009, Parallel Comput..

[14]  Chao Wang,et al.  Big data genome sequencing on Zynq based clusters (abstract only) , 2014, FPGA.

[15]  Stefano Lonardi,et al.  Exploration of Short Reads Genome Mapping in Hardware , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[16]  Joseph M. Lancaster,et al.  A Banded Smith-Waterman FPGA Accelerator for Mercury BLASTP , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[17]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[18]  Guang R. Gao,et al.  Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform , 2007, HPRCTA.

[19]  Yu Wang,et al.  FPMR: MapReduce framework on FPGA , 2010, FPGA '10.

[20]  Viktor K. Prasanna,et al.  A computationally efficient engine for flexible intrusion detection , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[22]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[23]  Chao Wang,et al.  Cloud Based Short Read Mapping Service , 2012, 2012 IEEE International Conference on Cluster Computing.

[24]  Ting Chen,et al.  Statistical Detection of Intrinsically Multivariate Predictive Genes , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[26]  Chao Wang,et al.  Genome sequencing using mapreduce on FPGA with multiple hardware accelerators (abstract only) , 2013, FPGA '13.

[27]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[28]  Paul D. Franzon,et al.  Configurable string matching hardware for speeding up intrusion detection , 2005, CARN.

[29]  Hong Wang,et al.  A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm , 2012, ISRN bioinformatics.

[30]  Martin C. Herbordt,et al.  FMSA: FPGA-Accelerated ClustalW-Based Multiple Sequence Alignment through Pipelined Prefiltering , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[31]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[32]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[33]  Carl Ebeling,et al.  Hardware Acceleration of Short Read Mapping , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[34]  Wei Lin,et al.  Pipelined Architecture for Multi-String Matching , 2008, IEEE Computer Architecture Letters.

[35]  Martin C. Herbordt,et al.  Families of FPGA-based algorithms for approximate string matching , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[36]  Chao Wang,et al.  Accelerating the Next Generation Long Read Mapping with the FPGA-Based System , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Rainer G. Spallek,et al.  Short-Read Mapping by a Systolic Custom FPGA Computation , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[38]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[39]  Viktor K. Prasanna,et al.  Multi-Core Architecture on FPGA for Large Dictionary String Matching , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[40]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[41]  Yongchao Liu,et al.  CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions , 2010, BMC Research Notes.

[42]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[43]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[44]  Rainer G. Spallek,et al.  Next-generation massively parallel short-read mapping on FPGAs , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[45]  Viktor K. Prasanna,et al.  Automatic Synthesis of Efficient Intrusion Detection Systems on FPGAs , 2004, IEEE Transactions on Dependable and Secure Computing.

[46]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.