Canonical, stable, general mapping using context schemes

MOTIVATION Sequence mapping is the cornerstone of modern genomics. However, most existing sequence mapping algorithms are insufficiently general. RESULTS We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique best mapping, and define this criterion uniformly for all reference bases. Mappings under context schemes can also be made stable, so that extension of the query string (e.g. by increasing read length) will not alter the mapping of previously mapped positions. Context schemes are general in several senses. They natively support the detection of arbitrary complex, novel rearrangements relative to the reference. They can scale over orders of magnitude in query sequence length. Finally, they are trivially extensible to more complex reference structures, such as graphs, that incorporate additional variation. We demonstrate empirically the existence of high-performance context schemes, and present efficient context scheme mapping algorithms. AVAILABILITY AND IMPLEMENTATION The software test framework created for this study is available from https://registry.hub.docker.com/u/adamnovak/sequence-graphs/. CONTACT anovak@soe.ucsc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[2]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[3]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[4]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[5]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  David Haussler,et al.  Alignathon: a competitive assessment of whole-genome alignment methods , 2014, bioRxiv.

[8]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[9]  Helen E White,et al.  Evaluation of methods to detect CALR mutations in myeloproliferative neoplasms. , 2015, Leukemia research.

[10]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11]  Deanna M. Church,et al.  Genome Reference Consortium , 2013 .

[12]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[13]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[14]  Elena S. Babaylova,et al.  Complete sequence and gene map of a human major histocompatibility complex , 1999, Nature.

[15]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[16]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[17]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[18]  Gen Tamiya,et al.  Complete sequence and gene map of a human major histocompatibility complex , 1999 .

[19]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[20]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[21]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2011 , 2011, Nucleic Acids Res..

[22]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[23]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[24]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[25]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[26]  David Haussler,et al.  Comparative assembly hubs: Web-accessible browsers for comparative genomics , 2013, Bioinform..