Hybridization modeling of oligonucleotide SNP arrays for accurate DNA copy number estimation

Affymetrix SNP arrays have been widely used for single-nucleotide polymorphism (SNP) genotype calling and DNA copy number variation inference. Although numerous methods have achieved high accuracy in these fields, most studies have paid little attention to the modeling of hybridization of probes to off-target allele sequences, which can affect the accuracy greatly. In this study, we address this issue and demonstrate that hybridization with mismatch nucleotides (HWMMN) occurs in all SNP probe-sets and has a critical effect on the estimation of allelic concentrations (ACs). We study sequence binding through binding free energy and then binding affinity, and develop a probe intensity composite representation (PICR) model. The PICR model allows the estimation of ACs at a given SNP through statistical regression. Furthermore, we demonstrate with cell-line data of known true copy numbers that the PICR model can achieve reasonable accuracy in copy number estimation at a single SNP locus, by using the ratio of the estimated AC of each sample to that of the reference sample, and can reveal subtle genotype structure of SNPs at abnormal loci. We also demonstrate with HapMap data that the PICR model yields accurate SNP genotype calls consistently across samples, laboratories and even across array platforms.

[1]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  G. Grinstein,et al.  Modeling of DNA microarray data by using physical properties of hybridization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[3]  A model of molecular interactions on short oligonucleotide microarrays , 2003, Nature Biotechnology.

[4]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[5]  S. P. Fodor,et al.  Large-scale genotyping of complex DNA , 2003, Nature Biotechnology.

[6]  Robert Henke,et al.  High-resolution identification of chromosomal abnormalities using oligonucleotide arrays containing 116,204 SNPs. , 2005, American journal of human genetics.

[7]  Jing Huang,et al.  CARAT: A novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays , 2006, BMC Bioinformatics.

[8]  Jing Huang,et al.  Dynamic model based algorithms for screening and genotyping over 100K SNPs on oligonucleotide microarrays , 2005, Bioinform..

[9]  Shigeru Chiba,et al.  A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. , 2005, Cancer research.

[10]  Cheng Li,et al.  Allele-Specific Amplification in Cancer Revealed by SNP Array Analysis , 2005, PLoS Comput. Biol..

[11]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[12]  Xiaolin Wu,et al.  GEL: a novel genotype calling algorithm using empirical likelihood , 2006, Bioinform..

[13]  Clifford A. Meyer,et al.  Model-based analysis of tiling-arrays for ChIP-chip , 2006, Proceedings of the National Academy of Sciences.

[14]  S. Tavaré,et al.  Non-linear analysis of GeneChip arrays , 2006, Nucleic acids research.

[15]  BRLMM : an Improved Genotype Calling Method for the GeneChip ® Human Mapping 500 K Array Set , 2006 .

[16]  G. Grinstein,et al.  Relationship between gene expression and observed intensities in DNA microarrays—a modeling study , 2006, Nucleic acids research.

[17]  Terence P. Speed,et al.  Genome analysis A genotype calling algorithm for affymetrix SNP arrays , 2005 .

[18]  Derek Y. Chiang,et al.  Characterizing the cancer genome in lung adenocarcinoma , 2007, Nature.

[19]  David Harrington,et al.  PLASQ: a generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data. , 2007, Biostatistics.

[20]  Jean Yee Hwa Yang,et al.  A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays , 2007, Bioinform..

[21]  Chunlei Wu,et al.  Free energy of DNA duplex formation on short oligonucleotide microarrays , 2006, Nucleic acids research.

[22]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[23]  E. Birney,et al.  Challenges and standards in integrating surveys of structural variation , 2007, Nature Genetics.

[24]  Tomas W. Fitzgerald,et al.  Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization , 2007, Genome Biology.

[25]  N. Carter Methods and strategies for analyzing copy number variation using DNA microarrays , 2007, Nature Genetics.

[26]  Rafael A Irizarry,et al.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. , 2006, Biostatistics.

[27]  Terence P. Speed,et al.  Estimation and assessment of raw copy numbers at the single locus level , 2008, Bioinform..

[28]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[29]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[30]  Sharon J. Diskin,et al.  Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms , 2008, Nucleic acids research.

[31]  Joshua M. Korn,et al.  Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs , 2008, Nature Genetics.

[32]  Wing Hung Wong,et al.  Cross-hybridization modeling on Affymetrix exon arrays , 2008, Bioinform..