Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges

The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.

[1]  Jinwoo Kim,et al.  An integrative model of multi-organ drug-induced toxicity prediction using gene-expression data , 2014, BMC Bioinformatics.

[2]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[3]  Yongchao Liu,et al.  MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[4]  Weiguo Liu,et al.  Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[5]  Krunal Patel,et al.  ArrayFire: a GPU acceleration platform , 2012, Defense, Security, and Sensing.

[6]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[7]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[8]  Nikolaos V. Sahinidis,et al.  GPU-BLAST: using graphics processors to accelerate protein sequence alignment , 2010, Bioinform..

[9]  Surin Kittitornkun,et al.  MT-ClustalW: multithreading multiple sequence alignment , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[10]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[11]  Yongchao Liu,et al.  SWAPHI: Smith-waterman protein database search on Xeon Phi coprocessors , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[12]  Yangdong Deng,et al.  Towards accelerating irregular EDA applications with GPUs , 2012, Integr..

[13]  Yongchao Liu,et al.  CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions , 2013, BMC Bioinformatics.

[14]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[15]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[16]  Stefano Lonardi,et al.  FHAST: FPGA-Based Acceleration of Bowtie in Hardware , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[18]  Qiong Luo,et al.  Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis , 2012, SSDBM.

[19]  Qiong Luo,et al.  GSNP: A DNA Single-Nucleotide Polymorphism Detection System with GPU Acceleration , 2011, 2011 International Conference on Parallel Processing.

[20]  Jignesh M. Patel,et al.  WHAM: A High-Throughput Sequence Alignment Method , 2011, TODS.

[21]  Che-Lun Hung,et al.  CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi-GPUs , 2015, Comput. Biol. Chem..

[22]  Weiguo Liu,et al.  XSW: Accelerating Biological Database Search on Xeon Phi , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[23]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[24]  Weiguo Liu,et al.  Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[25]  Hao Wang,et al.  cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU , 2017, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[26]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[27]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[28]  Andreas Kopmann,et al.  UFO: A Scalable GPU-based Image Processing Framework for On-line Monitoring , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[29]  David P. Luebke,et al.  CUDA: Scalable parallel programming for high-performance scientific computing , 2008, 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[30]  Yongchao Liu,et al.  Parallelized short read assembly of large genomes using de Bruijn graphs , 2011, BMC Bioinformatics.

[31]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[32]  Srinivas Aluru,et al.  Parallel Mutual Information Based Construction of Whole-Genome Networks on the Intel (R) Xeon Phi (TM) Coprocessor , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[33]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[34]  Bertil Schmidt,et al.  Integrating FPGA acceleration into HMMer , 2008, Parallel Comput..

[35]  Yongchao Liu,et al.  Bit-Parallel Approximate Pattern Matching on the Xeon Phi Coprocessor , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[36]  Douglas Thain,et al.  A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids , 2012, IEEE Transactions on Parallel and Distributed Systems.

[37]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[38]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[39]  Qiong Luo,et al.  High-performance short sequence alignment with GPU acceleration , 2012, Distributed and Parallel Databases.

[40]  Heikki Hyyrö,et al.  A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances , 2003, Nord. J. Comput..

[41]  Srinivas Aluru,et al.  Parallel de novo assembly of large genomes from high-throughput short reads , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[42]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[43]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[44]  Nagesh V. Honnalli,et al.  Hobbes: optimized gram-based methods for efficient read alignment , 2011, Nucleic acids research.

[45]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[46]  Yongchao Liu,et al.  DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI , 2011, BMC Bioinformatics.

[47]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[48]  Andrzej Wozniak,et al.  Using video-oriented instructions to speed up sequence comparison , 1997, Comput. Appl. Biosci..

[49]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[50]  Onur Mutlu,et al.  Accelerating read mapping with FastHASH , 2013, BMC Genomics.

[51]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[52]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[53]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[54]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[55]  Ke Qiu,et al.  Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pBWA , 2017 .

[56]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[57]  Weiguo Liu,et al.  CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[58]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[59]  Xiaohui Xie,et al.  Improving read mapping using additional prefix grams , 2014, BMC Bioinformatics.

[60]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[61]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[62]  Armando De Giusti,et al.  An energy‐aware performance analysis of SWIMM: Smith–Waterman implementation on Intel's Multicore and Manycore architectures , 2015, Concurr. Comput. Pract. Exp..

[63]  Kenneth A. Ross,et al.  Vectorized Bloom filters for advanced SIMD processors , 2014, DaMoN '14.

[64]  Xiangke Liao,et al.  B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC , 2015, Interdisciplinary Sciences: Computational Life Sciences.

[65]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[66]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[67]  L. Pachter,et al.  Streaming fragment assignment for real-time analysis of sequencing experiments , 2012, Nature Methods.

[68]  Jan-Ming Ho,et al.  De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[69]  J. Venter,et al.  Multiple personal genomes await , 2010, Nature.

[70]  Weiguo Liu,et al.  Accelerating large-scale biological database search on Xeon Phi-based neo-heterogeneous architectures , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[71]  Qi Li,et al.  A Speculative HMMER Search Implementation on GPU , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[72]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[73]  Roger D. Chamberlain,et al.  Accelerating HMMER on GPUs by implementing hybrid data and task parallelism , 2010, BCB '10.

[74]  Hugh E. Williams,et al.  A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST , 2006, J. Comput. Biol..

[75]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[76]  Jun Wang,et al.  MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC) , 2014, BMC Bioinformatics.

[77]  Siu-Ming Yiu,et al.  High Throughput Short Read Alignment via Bi-directional BWT , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[78]  Heshan Lin,et al.  Massively parallel genomic sequence search on the Blue Gene/P architecture , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[79]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[80]  Xiaohui Xie,et al.  AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization , 2011, RECOMB.

[81]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[82]  Yongchao Liu,et al.  CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions , 2010, BMC Research Notes.

[83]  Leonid Oliker,et al.  Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[84]  Che-Lun Hung,et al.  Local Alignment Tool Based on Hadoop Framework and GPU Architecture , 2014, BioMed research international.

[85]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[86]  Torbjørn Rognes,et al.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation , 2011, BMC Bioinformatics.

[87]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[88]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.