Copy number variation signature to predict human ancestry

BackgroundCopy number variations (CNVs) are genomic structural variants that are found in healthy populations and have been observed to be associated with disease susceptibility. Existing methods for CNV detection are often performed on a sample-by-sample basis, which is not ideal for large datasets where common CNVs must be estimated by comparing the frequency of CNVs in the individual samples. Here we describe a simple and novel approach to locate genome-wide CNVs common to a specific population, using human ancestry as the phenotype.ResultsWe utilized our previously published Genome Alteration Detection Analysis (GADA) algorithm to identify common ancestry CNVs (caCNVs) and built a caCNV model to predict population structure. We identified a 73 caCNV signature using a training set of 225 healthy individuals from European, Asian, and African ancestry. The signature was validated on an independent test set of 300 individuals with similar ancestral background. The error rate in predicting ancestry in this test set was 2% using the 73 caCNV signature. Among the caCNVs identified, several were previously confirmed experimentally to vary by ancestry. Our signature also contains a caCNV region with a single microRNA (MIR270), which represents the first reported variation of microRNA by ancestry.ConclusionsWe developed a new methodology to identify common CNVs and demonstrated its performance by building a caCNV signature to predict human ancestry with high accuracy. The utility of our approach could be extended to large case–control studies to identify CNV signatures for other phenotypes such as disease susceptibility and drug response.

[1]  Sylvia Richardson,et al.  Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model , 2006, Bioinform..

[2]  J. Sikela,et al.  A survey of analysis software for array-comparative genomic hybridisation studies to detect copy number variation , 2010, Human Genomics.

[3]  Alexander Eckehart Urban,et al.  in the human genome Systematic prediction and validation of breakpoints associated with copy-number variants , 2007 .

[4]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[5]  Zachary A. Szpiech,et al.  Genotype, haplotype and copy-number variation in worldwide human populations , 2008, Nature.

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[8]  L. Vissers,et al.  Variation of CNV distribution in five different ethnic populations , 2007, Cytogenetic and Genome Research.

[9]  C. Yau,et al.  QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data , 2007, Nucleic acids research.

[10]  Pardis C Sabeti,et al.  Common deletion polymorphisms in the human genome , 2006, Nature Genetics.

[11]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[12]  D. Pinkel,et al.  Comparative Genomic Hybridization for Molecular Cytogenetic Analysis of Solid Tumors , 2022 .

[13]  Antonio Ortega,et al.  Sparse representation and Bayesian detection of genome copy number alterations from microarray data , 2008, Bioinform..

[14]  J. González,et al.  Identification of Copy Number Variants Defining Genomic Differences among Major Human Groups , 2009, PloS one.

[15]  Fengtang Yang,et al.  Adaptive evolution of UGT2B17 copy-number variation. , 2008, American journal of human genetics.

[16]  B. Ylstra,et al.  High resolution microarray comparative genomic hybridisation analysis using spotted oligonucleotides , 2004, Journal of Clinical Pathology.

[17]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Anya Tsalenko,et al.  Population-genetic properties of differentiated human copy-number polymorphisms. , 2011, American journal of human genetics.

[19]  Terence P. Speed,et al.  Estimation and assessment of raw copy numbers at the single locus level , 2008, Bioinform..

[20]  Antonio Ortega,et al.  Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA , 2009, Bioinform..

[21]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[22]  Howard L. McLeod,et al.  wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data , 2008, Nucleic acids research.

[23]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[24]  Wang Fei,et al.  Amino acid classification based spectrum kernel fusion for protein subnuclear localization , 2010, BMC Bioinformatics.

[25]  Juan R. González,et al.  R-Gada: a fast and flexible pipeline for copy number analysis in association studies , 2010, BMC Bioinformatics.

[26]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[27]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[28]  D. Reich,et al.  Ancestry informative marker panels for African Americans based on subsets of commercially available SNP arrays , 2011, Genetic epidemiology.

[29]  Antonio Ortega,et al.  Bioinformatics for copy number variation data. , 2011, Methods in molecular biology.

[30]  N. Hayward,et al.  Characterization of the Melanoma miRNAome by Deep Sequencing , 2010, PloS one.

[31]  Agus Salim,et al.  Identification of recurrent regions of copy-number variants across multiple individuals , 2010, BMC Bioinformatics.

[32]  Simon Tavaré,et al.  CNAnova: a new approach for finding recurrent copy number abnormalities in cancer SNP microarray data , 2010, Bioinform..

[33]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[34]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[35]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[36]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[37]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[38]  Joshua M. Korn,et al.  Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs , 2008, Nature Genetics.

[39]  Nigel P. Carter,et al.  Accurate and reliable high-throughput detection of copy number variation in the human genome. , 2006, Genome research.

[40]  H. Ostrer,et al.  A versatile statistical analysis algorithm to detect genome copy number variation. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[41]  E. Lander,et al.  Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma , 2007, Proceedings of the National Academy of Sciences.

[42]  Christian J Stoeckert,et al.  STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. , 2006, Genome research.