GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population

UNLABELLED Current genotyping algorithms typically call genotypes by clustering allele-specific intensity data on a single nucleotide polymorphism (SNP) by SNP basis. This approach assumes the availability of a large number of control samples that have been sampled on the same array and platform. We have developed a SNP genotyping algorithm for the Illumina Infinium SNP genotyping assay that is entirely within-sample and does not require the need for a population of control samples nor parameters derived from such a population. Our algorithm exhibits high concordance with current methods and >99% call accuracy on HapMap samples. The ability to call genotypes using only within-sample information makes the method computationally light and practical for studies involving small sample sizes and provides a valuable independent quality control metric for other population-based approaches. AVAILABILITY http://www.stats.ox.ac.uk/~giannoul/GenoSNP/.

[1]  Weihua Chang,et al.  Whole-genome genotyping with the single-base extension assay , 2005, Nature Methods.

[2]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[3]  J. Q. Smith,et al.  1. Bayesian Statistics 4 , 1993 .

[4]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  David Harrington,et al.  PLASQ: a generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data. , 2007, Biostatistics.

[7]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[8]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[9]  Michael Inouye,et al.  A genotype calling algorithm for the Illumina BeadArray platform , 2007, Bioinform..

[10]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[11]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[12]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[13]  P. Deloukas,et al.  A genome-wide association study for celiac disease identifies risk variants in the region harboring IL2 and IL21 , 2007, Nature Genetics.

[14]  Michel Verleysen,et al.  Robust Bayesian clustering , 2007, Neural Networks.