Information theoretic alignment free variant calling

8 While traditional methods for calling variants across whole genome sequence data rely 9 on alignment to an appropriate reference sequence, alternative techniques are needed when 10 a suitable reference does not exist. We present a novel alignment and assembly free variant 11 calling method based on information theoretic principles designed to detect variants have 12 strong statistical evidence for their ability to segregate samples in a given dataset. Our 13 method uses the context surrounding a particular nucleotide to define variants. Given a 14 set of reads, we model the probability of observing a given nucleotide conditioned on the 15 surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate 16 which of these contexts are stable intra-sample and varying inter-sample using a statistic 17 based on the Kullback–Leibler divergence. 18 The utility of the variant calling method was evaluated through analysis of a pair of 19 bacterial datasets and a mouse dataset. We found that our variants are highly informative for 20 supervised learning tasks with performance similar to standard reference based calls and 21 another reference free method (DiscoSNP++). Comparisons against reference based calls 22 showed our method was able to capture very similar population structure on the bacterial 23 dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common 24 analysis tasks for organisms that are too diverse to be mapped back to a single reference 25 sequence. 26

[1]  W. Doolittle,et al.  Lateral gene transfer , 2011, Current Biology.

[2]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[3]  Jukka Corander,et al.  Dense genomic sampling identifies highways of pneumococcal recombination , 2014, Nature Genetics.

[4]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[5]  Marie-France Sagot,et al.  Identifying SNPs without a Reference Genome by Comparing Raw Reads , 2010, SPIRE.

[6]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[7]  Rayan Chikhi,et al.  Reference-free detection of isolated SNPs , 2014, Nucleic acids research.

[8]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[9]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[10]  E. Mauceli,et al.  Mutation discovery in mice by whole exome sequencing , 2011, Genome Biology.

[11]  W. Hanage,et al.  Comprehensive Identification of Single Nucleotide Polymorphisms Associated with Beta-lactam Resistance within Pneumococcal Mosaic Genes , 2014, PLoS genetics.

[12]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[13]  C. Robert,et al.  Testing hypotheses via a mixture estimation model , 2014, 1412.2044.

[14]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[15]  Alexey V. Rakov,et al.  Population structure of hyperinvasive serotype 12F, clonal complex 218 Streptococcus pneumoniae revealed by multilocus boxB sequence typing. , 2011, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[16]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[17]  H. Ochman,et al.  Lateral gene transfer and the nature of bacterial innovation , 2000, Nature.

[18]  M. Lipsitch,et al.  Population genomics of post-vaccine changes in pneumococcal epidemiology , 2013, Nature Genetics.

[19]  Derek Y. Chiang,et al.  The landscape of somatic copy-number alteration across human cancers , 2010, Nature.