Darwin

Genomics is transforming medicine and our understanding of life in fundamental ways. Genomics data, however, is far outpacing Moore»s Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. Over 1,300 CPU hours are required for reference-guided assembly of the human genome, and over 15,600 CPU hours are required for de novo assembly. This paper describes "Darwin" --- a co-processor for genomic sequence alignment that, without sacrificing sensitivity, provides up to $15,000X speedup over the state-of-the-art software for reference-guided assembly of third-generation reads. Darwin achieves this speedup through hardware/algorithm co-design, trading more easily accelerated alignment for less memory-intensive filtering, and by optimizing the memory system for filtering. Darwin combines a hardware-accelerated version of D-SOFT, a novel filtering algorithm, alignment at high speed, and with a hardware-accelerated version of GACT, a novel alignment algorithm. GACT generates near-optimal alignments of arbitrarily long genomic sequences using constant memory for the compute-intensive step. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.

[1]  Guojie Zhang,et al.  Genomics: Bird sequencing project takes off , 2015, Nature.

[2]  S. O’Brien,et al.  The Genome 10K Project: a way forward. , 2015, Annual review of animal biosciences.

[3]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[4]  Manuel Serrano,et al.  The Hallmarks of Aging , 2013, Cell.

[5]  James C. Hoe,et al.  CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs , 2012, FPGA '12.

[6]  Mechthild Prinz,et al.  Prediction of eye and skin color in diverse populations using seven SNPs. , 2011, Forensic science international. Genetics.

[7]  Cory Y. McLean,et al.  Human-specific loss of regulatory DNA and the evolution of human-specific traits , 2011, Nature.

[8]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[9]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[10]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[11]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[12]  Yvan Saeys,et al.  Scalable hardware accelerator for comparing DNA and protein sequences , 2006, InfoScale '06.

[13]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[14]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[15]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[16]  Akihiko Konagaya,et al.  High Speed Homology Search with FPGAs , 2001, Pacific Symposium on Biocomputing.

[17]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[18]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[19]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[20]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[21]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[24]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[25]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[26]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[27]  Gustavo Glusman,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005 .