Smarter clustering methods for SNP genotype calling

MOTIVATION Most genotyping technologies for single nucleotide polymorphism (SNP) markers use standard clustering methods to 'call' the SNP genotypes. These methods are not always optimal in distinguishing the genotype clusters of a SNP because they do not take advantage of specific features of the genotype calling problem. In particular, when family data are available, pedigree information is ignored. Furthermore, prior information about the distribution of the measurements for each cluster can be used to choose an appropriate model-based clustering method and can significantly improve the genotype calls. One special genotyping problem that has never been discussed in the literature is that of genotyping of trisomic individuals, such as individuals with Down syndrome. Calling trisomic genotypes is a more complicated problem, and the addition of external information becomes very important. RESULTS In this article, we discuss the impact of incorporating external information into clustering algorithms to call the genotypes for both disomic and trisomic data. We also propose two new methods to call genotypes using family data. One is a modification of the K-means method and uses the pedigree information by updating all members of a family together. The other is a likelihood-based method that combines the Gaussian or beta-mixture model with pedigree information. We compare the performance of these two methods and some other existing methods using simulation studies. We also compare the performance of these methods on a real dataset generated by the Illumina platform (www.illumina.com). AVAILABILITY The R code for the family-based genotype calling methods (SNPCaller) is available to be downloaded from the following website: http://watson.hgen.pitt.edu/register.

[1]  Satoshi Miyata,et al.  Genotyping of single nucleotide polymorphism using model-based clustering , 2004, Bioinform..

[2]  D. Clayton,et al.  Population structure, differential bias and genomic control in a large-scale, case-control association study , 2005, Nature Genetics.

[3]  Xiaolin Wu,et al.  GEL: a novel genotype calling algorithm using empirical likelihood , 2006, Bioinform..

[4]  E Feingold,et al.  Multipoint estimation of genetic maps for human trisomies with one parent or other partial data. , 2000, American journal of human genetics.

[5]  Jean Yee Hwa Yang,et al.  A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays , 2007, Bioinform..

[6]  P. Taberlet,et al.  Genotyping errors: causes, consequences and solutions , 2005, Nature Reviews Genetics.

[7]  Kenneth Lange,et al.  Bayesian Gaussian Mixture Models for High-Density Genotyping Arrays , 2008, Journal of the American Statistical Association.

[8]  Jing Huang,et al.  Dynamic model based algorithms for screening and genotyping over 100K SNPs on oligonucleotide microarrays , 2005, Bioinform..

[9]  K. Gunderson,et al.  High-throughput SNP genotyping on universal bead arrays. , 2005, Mutation research.

[10]  Stephen J Finch,et al.  Factors affecting statistical power in the detection of genetic association. , 2005, The Journal of clinical investigation.

[11]  Eleanor Feingold,et al.  A trisomic transmission disequilibrium test , 2004, Genetic epidemiology.

[12]  Michael Inouye,et al.  A genotype calling algorithm for the Illumina BeadArray platform , 2007, Bioinform..

[13]  BRLMM : an Improved Genotype Calling Method for the GeneChip ® Human Mapping 500 K Array Set , 2006 .

[14]  Rafael A Irizarry,et al.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. , 2006, Biostatistics.

[15]  R. Anderson,et al.  AN INVESTIGATION OF THE EFFECT OF MISCLASSIFICATION ON THE PROPERTIES OF CHI-2-TESTS IN THE ANALYSIS OF CATEGORICAL DATA. , 1965, Biometrika.

[16]  Jing Huang,et al.  Algorithms for large-scale genotyping microarrays , 2003, Bioinform..

[17]  J. Ott,et al.  A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. , 2001, American journal of human genetics.

[18]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[19]  Terence P. Speed,et al.  Genome analysis A genotype calling algorithm for affymetrix SNP arrays , 2005 .

[20]  Peter Holmans,et al.  Effects of Differential Genotyping Error Rate on the Type I Error Probability of Case-Control Studies , 2006, Human Heredity.

[21]  J. Ott,et al.  Power and Sample Size Calculations for Case-Control Genetic Association Tests when Errors Are Present: Application to Single Nucleotide Polymorphisms , 2002, Human Heredity.

[22]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[23]  Eleanor Feingold,et al.  Linkage disequilibrium mapping in trisomic populations: Analytical approaches and an application to congenital heart defects in Down syndrome , 2004, Genetic epidemiology.