Aligning reads against a graph based reference genome

As sequencing technologies improve, we are able to produce a larger amount of genetic data. One of the models used to organize and store this data are reference genomes, structures which collect such information to form a representative sample of the genome for a given species. To account for the variation which appears as the amount of data increases, new models for representing reference genomes are necessary. Graphs present the opportunity to have complex interrelationships between elements, a property which naturally solves the problem of variation. The newest human reference genome, GRCh38, already incorporate graph-like features through the introduction of alternate paths through variable regions. Methods created for interacting with the existing structures are traditionally centered around linear data representations, realized as a set of text string operations. To allow a complete transition, these methods must be adapted to fit the domain of graphs. An important string operation in the context of genetic data is sequence alignment. In reference genomes, this is a technique which can be utilized for mapping new data against the reference. In this thesis, we present a new method for aligning text strings against graph based reference genomes. The method is built on the concept of context-based mapping; a technique proposed to standardize uniqueness in structures which do not have an innate coordinate system. We have made the method accessible through a tool which is available online. We test the feasibility of our approach by doing performance comparisons with existing methods, examining both accuracy and efficiency. The results display several traits of the approach which outperform other proposed solutions. We argue that the method provides a viable solution to the most general version of the problem, which provides a basis for more specific biological applications.

[1]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[2]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[3]  Cristina Cattaneo,et al.  Introduction to genomics. , 2012, Methods in molecular biology.

[4]  Joong Chae Na,et al.  Suffix Array of Alignment: A Practical Index for Similar Data , 2013, SPIRE.

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[7]  S. Sommer The importance of immune gene variability (MHC) in evolutionary ecology and conservation , 2005, Frontiers in Zoology.

[8]  James G. R. Gilbert,et al.  Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project , 2008, Immunogenetics.

[9]  Jie Ding,et al.  Estimation of sequencing error rates in short reads , 2012, BMC Bioinformatics.

[10]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[11]  R. Redon,et al.  Copy Number Variation: New Insights in Genome Diversity References , 2006 .

[12]  Matthias Reumann,et al.  Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers , 2011, 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[13]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[15]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[16]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[17]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[18]  Ian T. Foster,et al.  Supercomputing for the parallelization of whole genome analysis , 2014, Bioinform..

[19]  N. Setterblad,et al.  Retroelements in the human MHC class II region. , 1998, Trends in genetics : TIG.

[20]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[21]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[22]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[23]  Moshe Lewenstein,et al.  String processing and information retrieval : 20th International Symposium, SPIRE 2013, Jerusalem, Israel, October 7-9, 2013, proceedings , 2013 .

[24]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[25]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[26]  David Haussler,et al.  Canonical, stable, general mapping using context schemes , 2015, Bioinform..

[27]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[28]  Lin Liu,et al.  Comparison of Next-Generation Sequencing Systems , 2012, Journal of biomedicine & biotechnology.

[29]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[30]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[31]  C. Quince,et al.  Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform , 2015, Nucleic acids research.

[32]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[33]  David Haussler,et al.  Building a Pangenome Reference for a Population , 2014, RECOMB.

[34]  Ziqi Wang,et al.  A Fast and Accurate Method for Approximate String Search , 2011, ACL.

[35]  David A. Fenstermacher,et al.  Introduction to bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[36]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[37]  Knut Reinert,et al.  Genome alignment with graph data structures: a comparison , 2014, BMC Bioinformatics.

[38]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[39]  Michael Sipser,et al.  Introduction to the Theory of Computation , 1996, SIGA.

[40]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[41]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.