PlatinumCNV: A Bayesian Gaussian mixture model for genotyping copy number polymorphisms using SNP array signal intensity data

We present a statistical model for allele‐specific patterns of copy number polymorphisms (CNPs) in commercial single nucleotide polymorphism (SNP) array data. This model is based on the observation that fluorescent signal intensities tend to cluster into clouds of similar allele‐specific copy number (ASCN) genotypes at each SNP locus. To capture the tendency of this clustering to be made vague by instrumental errors, our model allows for cluster memberships to overlap each other, according to a Bayesian Gaussian mixture model (GMM). This approach is flexible, allowing for both absolute scale differences and X/Y scale imbalances of fluorescent signal intensities. The resulting model is also robust toward unobserved ASCN genotypes, which can be problematic for ordinary GMMs. We illustrated the utility of the model by applying it to commercial SNP array intensity data obtained from the Illumina HumanHap 610K platform. We retrieved more than 4,000 allele‐specific CNPs, though 99% of them showed rather simple allele‐specific CNP patterns with only a single aneuploid haplotype among the normal haplotypes. The genotyping accuracy was assessed by two approaches, quantitative PCR and replicated subjects. The results of both of these approaches demonstrated mean genotyping error rates of 1%. We demonstrated a preliminary genome‐wide association study of three hematological traits. The result exhibited that it could form the foundation for new, more effective statistical methods for the mapping of both disease genes and quantitative trait loci with genome‐wide CNPs. The methods described in this work are implemented in a software package, PlatinumCNV, available on the Internet. Genet. Epidemiol. 35:831–844, 2011. © 2011 Wiley Periodicals, Inc.

[1]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[2]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[3]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[4]  OrtegaAntonio,et al.  Sparse representation and Bayesian detection of genome copy number alterations from microarray data , 2008 .

[5]  C. Yau,et al.  QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data , 2007, Nucleic acids research.

[6]  Ryan E. Mills,et al.  Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing , 2010, Nature Genetics.

[7]  T. Louis Finding the Observed Information Matrix When Using the EM Algorithm , 1982 .

[8]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.

[9]  Ajay N. Jain,et al.  Hidden Markov models approach to the analysis of array CGH data , 2004 .

[10]  D. Clayton,et al.  A Method to Address Differential Bias in Genotyping in Large-Scale Association Studies , 2007, PLoS genetics.

[11]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[12]  Philippe Froguel,et al.  cnvHap: an integrative population and haplotype–based multiplatform model of SNPs and CNVs , 2010, Nature Methods.

[13]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[14]  J. Wakefield Bayesian Methods for Examining Hardy–Weinberg Equilibrium , 2010, Biometrics.

[15]  J. Magnus,et al.  Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised Edition) , 1999 .

[16]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[17]  Tomas W. Fitzgerald,et al.  Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization , 2007, Genome Biology.

[18]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[19]  R. Ophoff,et al.  Detection, imputation, and association analysis of small deletions and null alleles on oligonucleotide arrays. , 2008, American journal of human genetics.

[20]  Juliet M Chapman,et al.  Detecting Disease Associations due to Linkage Disequilibrium Using Haplotype Tags: A Class of Tests and the Determinants of Statistical Power , 2003, Human Heredity.

[21]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[22]  Simon Tavaré,et al.  BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data , 2006, Bioinform..

[23]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[24]  Y. Pawitan In all likelihood : statistical modelling and inference using likelihood , 2002 .

[25]  J. Magnus,et al.  Matrix Differential Calculus with Applications in Statistics and Econometrics , 1991 .

[26]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[27]  Alberto Piazza,et al.  Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants , 2009, Nature Genetics.

[28]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[29]  C McRae,et al.  Myocardial infarction. , 2019, Australian family physician.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Joshua M. Korn,et al.  Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs , 2008, Nature Genetics.

[32]  Laurent Bodin,et al.  Determination of Cytochrome P450 2D6 (CYP2D6) Gene Copy Number by Real-Time Quantitative PCR , 2005, Journal of biomedicine & biotechnology.

[33]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[34]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[35]  Judy H Cho,et al.  Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease , 2008, Nature Genetics.

[36]  Robert T. Schultz,et al.  Autism genome-wide copy number variation reveals ubiquitin and neuronal genes , 2009, Nature.

[37]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[38]  Yusuke Nakamura,et al.  [BioBank Japan project]. , 2005, Nihon rinsho. Japanese journal of clinical medicine.

[39]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[40]  Yusuke Nakamura,et al.  CYP2D6 genotyping for functional-gene dosage analysis by allele copy number detection. , 2009, Clinical chemistry.

[41]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[42]  Yusuke Nakamura,et al.  Genome-wide association study of hematological and biochemical traits in a Japanese population , 2010, Nature Genetics.