Computational Strategies for Scalable Genomics Analysis

The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Xiandong Meng,et al.  A case study of tuning MapReduce for efficient Bioinformatics in the cloud , 2017, Parallel Comput..

[3]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[4]  Lizhen Shi,et al.  A Vector Representation of DNA Sequences Using Locality Sensitive Hashing , 2019, bioRxiv.

[5]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[6]  Wei Zhou,et al.  MetaSpark: a spark‐based distributed processing tool to recruit metagenomic reads to reference genomes , 2017, Bioinform..

[7]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[8]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[9]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[10]  S. R. Sathe,et al.  Parallelization of DNA sequence alignment using OpenMP , 2011, ICCCS '11.

[11]  Sarah Webb Deep learning for biology , 2018, Nature.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[14]  Ola Spjuth,et al.  Galaxy-Kubernetes integration: scaling bioinformatics workflows in the cloud , 2018, bioRxiv.

[15]  Shaoliang Peng,et al.  Bioinformatics applications on Apache Spark , 2018, GigaScience.

[16]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Gonçalo Abecasis,et al.  Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank , 2019, bioRxiv.

[18]  Yaw-Ling Lin,et al.  CloudTSS: A TagSNP Selection Approach on Cloud Computing , 2011, FGIT-GDC.

[19]  Vijay S. Pande,et al.  Accelerating molecular dynamic simulation on graphics processing units , 2009, J. Comput. Chem..

[20]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[21]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[22]  Marco S. Nobile,et al.  Graphics processing units in bioinformatics, computational biology and systems biology , 2016, Briefings Bioinform..

[23]  Ke Qiu,et al.  Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pBWA , 2017 .

[24]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[25]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[26]  Thomas K. F. Wong,et al.  SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner , 2013, PloS one.

[27]  Andreas W. Götz,et al.  SPFP: Speed without compromise - A mixed precision model for GPU accelerated molecular dynamics simulations , 2013, Comput. Phys. Commun..

[28]  Robert Palermo,et al.  Enabling large‐scale next‐generation sequence assembly with Blacklight , 2014, Concurr. Comput. Pract. Exp..

[29]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[30]  Inanc Birol,et al.  ORCA: a comprehensive bioinformatics container environment for education and research , 2019, Bioinform..

[31]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[32]  Yongchao Liu,et al.  CUSHAW2-GPU: Empowering Faster Gapped Short-Read Alignment Using GPU Computing , 2014, IEEE Design & Test.

[33]  Juan Touriño,et al.  Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures , 2009, PVM/MPI.

[34]  Katherine A. Yelick,et al.  UPC++: A PGAS Extension for C++ , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[35]  Parimala Thulasiraman,et al.  An OpenMP-based tool for finding longest common subsequence in bioinformatics , 2019, BMC Research Notes.

[36]  Kevin Truong,et al.  160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA) , 2007, BMC Bioinformatics.

[37]  Siu-Ming Yiu,et al.  SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[38]  Xiandong Meng,et al.  SpaRC: Scalable Sequence Clustering using Apache Spark , 2018 .

[39]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[40]  Pavan Balaji Unified Parallel C , 2011, Encyclopedia of Parallel Computing.

[41]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[42]  Lisa Gerhardt,et al.  Shifter: Containers for HPC , 2017 .

[43]  Robert P. Davey,et al.  An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations , 2016, bioRxiv.

[44]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[45]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[46]  Jorge Amigo,et al.  SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data , 2016, PloS one.

[47]  Leonid Oliker,et al.  Extreme Scale De Novo Metagenome Assembly , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[49]  Tomás F. Pena,et al.  BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies , 2015, Bioinform..

[50]  Vasileios Megalooikonomou,et al.  Genomic big data hitting the storage bottleneck , 2018, EMBnet.journal.