Detecting Population-Differentiation Copy Number Variants in Human Population Tree by Sparse Group Selection

Copy-number variants (CNVs) account for a substantial proportion of human genetic variations. Understanding the CNV diversities across populations is a computational challenge because CNV patterns are often present in several related populations and only occur in a subgroup of individuals within each of the population. This paper introduces a tree-guided sparse group selection algorithm (treeSGS) to detect population-differentiation CNV markers of subgroups across populations organized by a phylogenetic tree of human populations. The treeSGS algorithm detects CNV markers of populations associated with nodes from all levels of the tree such that the evolutionary relations among the populations are incorporated for more accurate detection of population-differentiation CNVs. We applied treeSGS algorithm to study the 1,179 samples from the 11 populations in Hapmap3 CNV data. The treeSGS algorithm accurately identifies CNV markers of each population and the collection of populations organized under the branches of the human population tree, validated by consistency among family trios and SNP characterizations of the CNV regions. Further comparison between the detected CNV markers and other population-differentiation CNVs reported in 1,000 genome data and other recent studies also shows that treeSGS can significantly improve the current annotations of population-differentiation CNV markers. TreeSGS package is available at https://github.com/kuanglab/treeSGS.

[1]  Yehudit Hasin,et al.  High-Resolution Copy-Number Variation Map Reflects Human Olfactory Receptor Diversity and Evolution , 2008, PLoS genetics.

[2]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[3]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[4]  Bradley P. Coe,et al.  Global diversity, population stratification, and selection of human copy-number variation , 2015, Science.

[5]  Xi Chen,et al.  Smoothing proximal gradient method for general structured sparse regression , 2010, The Annals of Applied Statistics.

[6]  J. R. MacDonald,et al.  A copy number variation map of the human genome , 2015, Nature Reviews Genetics.

[7]  Zachary A. Szpiech,et al.  Genotype, haplotype and copy-number variation in worldwide human populations , 2008, Nature.

[8]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[9]  Rui Kuang,et al.  Sparse Group Selection on Fused Lasso Components for Identifying Group-Specific DNA Copy Number Variations , 2012, 2012 IEEE 12th International Conference on Data Mining.

[10]  Lars Feuk,et al.  The Database of Genomic Variants: a curated collection of structural variation in the human genome , 2013, Nucleic Acids Res..

[11]  Tae-Ho Lee,et al.  SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data , 2014, BMC Genomics.

[12]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[13]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[14]  M. Hurles,et al.  Copy number variation in human health, disease, and evolution. , 2009, Annual review of genomics and human genetics.

[15]  S. Vishweswaraiah,et al.  Unravelling the Complexity of Human Olfactory Receptor Repertoire by Copy Number Analysis across Population Using High Resolution Arrays , 2013, PloS one.

[16]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[17]  Anya Tsalenko,et al.  Population-genetic properties of differentiated human copy-number polymorphisms. , 2011, American journal of human genetics.