Darwin: A Hardware-acceleration Framework for Genomic Sequence Alignment

Genomics is set to transform medicine and our understanding of life in fundamental ways. But the growth in genomics data has been overwhelming - far outpacing Moore’s Law. The advent of third generation sequencing technologies is providing new insights into genomic contribution to diseases with complex mutation events, but have prohibitively high computational costs. Over 1,300 CPU hours are required to align reads from a 54× coverage of the human genome to a reference (estimated using [1]), and over 15,600 CPU hours to assemble the reads de novo [2]. This paper proposes “Darwin” - a hardware-accelerated framework for genomic sequence alignment that, without sacrificing sensitivity, provides 125× and 15.6× speedup over the state-of-the-art software counterparts for reference-guided and de novo assembly of third generation sequencing reads, respectively. For pairwise alignment of sequences, Darwin is over 39,000× more energy-efficient than software. Darwin uses (i) a novel filtration strategy, called D-SOFT, to reduce the search space for sequence alignment at high speed, and (ii) a hardware-accelerated version of GACT, a novel algorithm to generate near-optimal alignments of arbitrarily long genomic sequences using constant memory for trace-back. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.

[1]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[2]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[3]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[4]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[5]  Satnam Singh,et al.  Synthesis of a Parallel Smith-Waterman Sequence Alignment Kernel into FPGA Hardware , 2009 .

[6]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[7]  Yvan Saeys,et al.  Scalable hardware accelerator for comparing DNA and protein sequences , 2006, InfoScale '06.

[8]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Akihiko Konagaya,et al.  High Speed Homology Search with FPGAs , 2001, Pacific Symposium on Biocomputing.

[10]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[11]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[12]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[13]  Walid A. Najjar,et al.  Compiler generated systolic arrays for wavefront algorithm acceleration on FPGAs , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[14]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[15]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[16]  David A. Eccles,et al.  MinION Analysis and Reference Consortium: Phase 1 data release and analysis , 2015, F1000Research.

[17]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[18]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[19]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[20]  Jason Cong,et al.  A Novel High-Throughput Acceleration Engine for Read Alignment , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[21]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[22]  Joseph M. Lancaster,et al.  A Banded Smith-Waterman FPGA Accelerator for Mercury BLASTP , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[23]  Kali T. Witherspoon,et al.  Excess of rare, inherited truncating mutations in autism , 2015, Nature Genetics.

[24]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[25]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[26]  Mechthild Prinz,et al.  Prediction of eye and skin color in diverse populations using seven SNPs. , 2011, Forensic science international. Genetics.

[27]  Wen Tang,et al.  Accelerating Millions of Short Reads Mapping on a Heterogeneous Architecture with FPGA Accelerator , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[28]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[29]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[30]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[31]  Michael Eisenstein,et al.  Oxford Nanopore announcement sets sequencing sector abuzz , 2012, Nature Biotechnology.

[32]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[33]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[34]  Michael C. Schatz,et al.  Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome , 2015 .

[35]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[36]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[37]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[38]  Chao Wang,et al.  Accelerating the Next Generation Long Read Mapping with the FPGA-Based System , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[40]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[41]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[42]  Carl Ebeling,et al.  Hardware Acceleration of Short Read Mapping , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[43]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[44]  Philip Heng Wai Leong,et al.  A Smith-Waterman Systolic Cell , 2003, FPL.

[45]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[46]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[47]  Xianyang Jiang,et al.  A Reconfigurable Accelerator for Smith–Waterman Algorithm , 2007, IEEE Transactions on Circuits and Systems II: Express Briefs.

[48]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[49]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[50]  Koen Bertels,et al.  A parallel FPGA design of the Smith-Waterman traceback , 2010, 2010 International Conference on Field-Programmable Technology.

[51]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[52]  Manuel Serrano,et al.  The Hallmarks of Aging , 2013, Cell.

[53]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[54]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[55]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[56]  G. Ginsburg,et al.  The path to personalized medicine. , 2002, Current opinion in chemical biology.

[57]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[58]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[59]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[60]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[61]  Onur Mutlu,et al.  Optimal seed solver: optimizing seed selection in read mapping , 2015, Bioinform..

[62]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[63]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[64]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[65]  S. Koren,et al.  One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. , 2015, Current opinion in microbiology.