A fast adaptive algorithm for computing whole-genome homology maps

Motivation Whole‐genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements. Results We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer‐based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane‐sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly‐to‐genome and genome‐to‐genome mapper. As a result, we were able to map an error‐corrected whole‐genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory‐usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be Symbol on multiple datasets. Finally, we performed a sensitive self‐alignment of the human genome to compute all duplications of length ≥1 Kbp and Symbol identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser's segmental duplication annotation. Symbol. No caption available. Symbol. No caption available. Availability and implementation https://github.com/marbl/MashMap

[1]  Tamim H. Shaikh,et al.  Segmental duplications: an 'expanding' role in genomic instability and disease , 2001, Nature Reviews Genetics.

[2]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[3]  P. Pevzner,et al.  Detection and analysis of ancient segmental duplications in mammalian genomes , 2018, Genome research.

[4]  Piotr Berman,et al.  Winnowing sequences from a database search , 1999, RECOMB.

[5]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[6]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[7]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[8]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[9]  Masahiro Kasahara,et al.  Introducing difference recurrence relations for faster semi-global alignment of long sequences , 2018, BMC Bioinformatics.

[10]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[11]  Miriah D. Meyer,et al.  Genome-wide synteny through highly sensitive sequence alignment: Satsuma , 2010, Bioinform..

[12]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[13]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[14]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[15]  Michael Ian Shamos,et al.  Geometric intersection problems , 1976, 17th Annual Symposium on Foundations of Computer Science (sfcs 1976).

[16]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[17]  Bonnie Berger,et al.  Compressive mapping for next-generation sequencing , 2016, Nature Biotechnology.

[18]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[19]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[20]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[21]  Srinivas Aluru,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017, bioRxiv.

[22]  András Rácz,et al.  A Lower Bound for the Integer Element Distinctness Problem , 1991, Inf. Comput..

[23]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[24]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[25]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[26]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[27]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[28]  Gianluca Russo,et al.  PCRdrive: the largest qPCR assay archive to date and endless potential for lab workflow revitalization , 2018, BMC Bioinformatics.

[29]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[30]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[31]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[32]  Adam M. Phillippy,et al.  MUMmer4: A fast and versatile genome alignment system , 2018, PLoS Comput. Biol..

[33]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[34]  Bernard De Baets,et al.  essaMEM: finding maximal exact matches using enhanced sparse suffix arrays , 2013, Bioinform..

[35]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.