Probabilistic Methods for Processing High-Throughput Sequencing Signals

High-throughput sequencing has the potential to answer many of the big questions in biology and medicine. It can be used to determine the ancestry of species, to chart complex ecosystems and to understand and diagnose disease. However, going from raw sequencing data to biological or medical insights is far from trivial. A key challenge is that these methods cannot read the input sequences in their entirety. Due to technological constraints, they instead provide the sequences of very many fragments of the input molecules. Furthermore, not all nucleotides in these fragments are measured correctly and the final output of a typical experiment thus consists of hundreds of millions of error-containing sequence fragments. This thesis concerns the development of methods for transforming such a raw sequencing signal into a simpler representation from which biological inferences can then be made. Importantly, the fact that the fragments are short and contain errors implies that there may be significant uncertainty associated with the signal. By using probabilistic models, we are able to quantify this uncertainty and propagate it to downstream analyses. The first chapter describes a new method for reconstructing transcript sequences from RNA sequencing data. The method is based on a novel sparse prior distribution over transcript abundances and is markedly more accurate than existing approaches. The second chapter describes a new method for calling genotypes from a fixed set of candidate variants. The method queries the reads using a graph representation of the variants and hereby mitigates the reference-bias that characterise standard genotyping methods. In the last chapter, we apply this method to call the genotypes of 50 deeply sequencing parent-offspring trios from the GenomeDenmark project. By estimating the genotypes on a set of candidate variants obtained from both a standard mapping-based approach as well as de novo assemblies, we are able to find considerably more structural variation than previous studies.

[1]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[2]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[3]  Anders Krogh,et al.  Bayesian transcriptome assembly , 2014, Genome Biology.

[4]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[5]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[6]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[7]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[8]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[9]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[10]  Jérôme Goudet,et al.  Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data , 2014 .

[11]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[12]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[14]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[15]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[16]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[17]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[18]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[19]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.