Accelerating Genome Analysis: A Primer on an Ongoing Journey

Genome analysis fundamentally starts with a process known as read mapping, where sequenced fragments of an organism's genome are compared against a reference genome. Read mapping is currently a major bottleneck in the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are able to sequence a genome much faster than the computational techniques employed to analyze the genome. We describe the ongoing journey in significantly improving the performance of read mapping. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory). We conclude with the challenges of adopting these hardware-accelerated read mappers.

[1]  Srinivas Aluru,et al.  Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[2]  Jason Cong,et al.  Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[3]  Rajeev Balasubramonian,et al.  GenCache: Leveraging In-Cache Operators for Efficient Sequence Alignment , 2019, MICRO.

[4]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[5]  Onur Mutlu,et al.  Shouji: a fast and efficient pre-alignment filter for sequence alignment , 2018, Bioinform..

[6]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[7]  Onur Mutlu,et al.  GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies , 2017, BMC Genomics.

[8]  David Blaauw,et al.  GenAx: A Genome Sequencing Accelerator , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[9]  Ravishankar K. Iyer,et al.  ASAP: Accelerated Short-Read Alignment on Programmable Hardware , 2017, IEEE Transactions on Computers.

[10]  Dominique Lavenier,et al.  GASSST: global alignment short sequence search tool , 2010, Bioinform..

[11]  H. T. Kung Why systolic architectures? , 1982, Computer.

[12]  C. Alkan,et al.  MAGNET: Understanding and Improving the Accuracy of Genome Pre-Alignment Filtering , 2017, 1707.01631.

[13]  Rachata Ausavarungnirun,et al.  Processing Data Where It Makes Sense: Enabling In-Memory Computation , 2019, Microprocess. Microsystems.

[14]  B. Langmead,et al.  Cloud computing for genomic data analysis and collaboration , 2018, Nature Reviews Genetics.

[15]  Sunghoon Lee,et al.  Ultra-Fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine , 2017 .

[16]  Onur Mutlu,et al.  Processing-in-memory: A workload-driven perspective , 2019, IBM J. Res. Dev..

[17]  David A. Matthews,et al.  Real-time, portable genome sequencing for Ebola surveillance , 2016, Nature.

[18]  Yuan Xie,et al.  RADAR: A 3D-ReRAM based DNA Alignment Accelerator Architecture , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[19]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[20]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[21]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[22]  Ryan E. Mills,et al.  Structural variation in the sequencing era , 2019, Nature Reviews Genetics.

[23]  Mohsen Imani,et al.  RAPID: A ReRAM Processing in-Memory Architecture for DNA Sequence Alignment , 2019, 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[24]  Zou Dan,et al.  FPGASW: Accelerating Large-Scale Smith–Waterman Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array , 2017, Interdisciplinary Sciences Computational Life Sciences.

[25]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[26]  Chao Wang,et al.  Accelerating the Next Generation Long Read Mapping with the FPGA-Based System , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  A. Ameur,et al.  Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics , 2018, Nucleic acids research.

[28]  William J. Dally,et al.  Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly , 2018, USENIX Annual Technical Conference.

[29]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[30]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[31]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[32]  Michael E. Saks,et al.  Approximating Edit Distance within Constant Factor in Truly Sub-Quadratic Time , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[33]  C. Alkan,et al.  Technology dictates algorithms: recent developments in read alignment , 2020, Genome biology.

[34]  Zaid Al-Ars,et al.  GASAL2: a GPU accelerated sequence alignment library for high-throughput NGS data , 2019, BMC Bioinformatics.

[35]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[36]  Onur Mutlu,et al.  Accelerating read mapping with FastHASH , 2013, BMC Genomics.

[37]  Funda Ergün,et al.  Oblivious string embeddings and edit distance approximations , 2006, SODA '06.

[38]  Onur Mutlu,et al.  Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping , 2015, Bioinform..

[39]  Alexandr Andoni,et al.  Approximating Edit Distance in Near-Linear Time , 2012, SIAM J. Comput..

[40]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[41]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[42]  Xavier Martorell,et al.  CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-Wide Alignment in GPU Clusters , 2016, IEEE Transactions on Parallel and Distributed Systems.

[43]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[44]  Abhinav Nellore,et al.  Cloud computing for genomic data analysis and collaboration , 2018, Nature Reviews Genetics.

[45]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[46]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[47]  Onur Mutlu,et al.  Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions , 2017, Briefings Bioinform..

[48]  Eleazar Eskin,et al.  Metalign: efficient alignment-based metagenomic profiling via containment min hash , 2020, Genome Biology.

[49]  Javier Setoain,et al.  Compressed Sparse FM-Index: Fast Sequence Alignment Using Large K-Steps , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[50]  Faraz Hach,et al.  mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications , 2014, Nucleic Acids Res..

[51]  Lior Pachter,et al.  Swab-Seq: A high-throughput platform for massively scaled up SARS-CoV-2 testing , 2020, medRxiv.

[52]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[53]  P. Elliott,et al.  Mirror extreme BMI phenotypes associated with gene dosage at the chromosome 16p11.2 locus , 2011, Nature.

[54]  Onur Mutlu,et al.  Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm , 2019, Bioinform..