Accelerating Sequence Alignment to Graphs

Aligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBio/ONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices. Availability The implementation of our algorithm is available at https://github.com/ParBLiSS/PaSGAL. Data sets used for evaluation are accessible using https://alurulab.cc.gatech.edu/PaSGAL.

[1]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[4]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[5]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[6]  Gonzalo Navarro Improved approximate pattern matching on hypertext , 2000, Theor. Comput. Sci..

[7]  Stephen W. Poole,et al.  Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors , 2010, J. Comput. Phys..

[8]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Anders Krogh,et al.  Accurate genotyping across variant classes and lengths using variant graphs , 2018, Nature Genetics.

[10]  Richard Hughey,et al.  Reduced space sequence alignment , 1997, Comput. Appl. Biosci..

[11]  Tony Pan,et al.  Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis , 2018, PACT.

[12]  Paul Medvedev,et al.  Genome Graphs , 2010 .

[13]  Aarti Jajoo,et al.  NovoGraph: Human genome graph construction from multiple long-read de novo assemblies , 2018, F1000Research.

[14]  Nae-Chyun Chen,et al.  FORGe: prioritizing variants for graph genomes , 2018, Genome Biology.

[15]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[16]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[17]  Jordan M. Eizenga,et al.  Genome Graphs , 2017, bioRxiv.

[18]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[19]  Daehwan Kim,et al.  HISAT-genotype: Next Generation Genomic Analysis Platform on a Personal Computer , 2018, bioRxiv.

[20]  Chirag Jain,et al.  Fine-grained GPU parallelization of pairwise local sequence alignment , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[21]  Yadong Wang,et al.  deBGA: read alignment with de Bruijn graph-based seed and extension , 2016, Bioinform..

[22]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[23]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[24]  J. Reifman,et al.  A new strategy to reduce allelic bias in RNA-Seq readmapping , 2012, Nucleic acids research.

[25]  Alexander T. Dilthey,et al.  High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs , 2016, PLoS Comput. Biol..

[26]  Masahiro Kasahara,et al.  Introducing difference recurrence relations for faster semi-global alignment of long sequences , 2018, BMC Bioinformatics.

[27]  Torbjørn Rognes,et al.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation , 2011, BMC Bioinformatics.

[28]  Udi Manber,et al.  APPROXIMATE STRING MATCHING WITH ARBITRARY COSTS FOR TEXT AND HYPERTEXT , 1993 .

[29]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[30]  Mile Šikić,et al.  Fast and accurate de novo genome assembly from long uncorrected reads , 2016, bioRxiv.

[31]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[32]  Will P. M. Rowe,et al.  Indexed variation graphs for efficient and accurate resistome profiling , 2018, bioRxiv.

[33]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[34]  Webb Miller,et al.  A space-efficient algorithm for local similarities , 1990, Comput. Appl. Biosci..

[35]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[36]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[37]  Veli Mäkinen,et al.  Bit-parallel sequence-to-graph alignment , 2018, bioRxiv.

[38]  Srinivas Aluru,et al.  Parallel biological sequence comparison using prefix computations , 2003, J. Parallel Distributed Comput..

[39]  Andrew J. Olson,et al.  NovoGraph: Genome graph construction from multiple long-read de novo assemblies. , 2018, F1000Research.

[40]  Moshe Lewenstein,et al.  Hypertext Searching - A Survey , 2014, Language, Culture, Computation.

[41]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[42]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[43]  Chirag Jain,et al.  On the Complexity of Sequence to Graph Alignment , 2019 .

[44]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[45]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[46]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..