Manifold Learning for Human Population Structure Studies

The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.

[1]  Jiawei Han,et al.  Spectral Regression for Efficient Regularized Subspace Learning , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[3]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[4]  Jun Zhang,et al.  Graphic analysis of population structure on genome-wide rheumatoid arthritis data , 2009, BMC proceedings.

[5]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[6]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  A. Chakravarti Population genetics—making sense out of sequence , 1999, Nature Genetics.

[8]  K. Leonard,et al.  Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species , 2005, Molecular ecology.

[9]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[10]  Xiaoou Tang,et al.  Learning Semi-Riemannian Metrics for Semisupervised Feature Extraction , 2011, IEEE Transactions on Knowledge and Data Engineering.

[11]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[12]  Michael W. Mahoney,et al.  PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations , 2007, PLoS genetics.

[13]  Kathryn Roeder,et al.  A SPECTRAL GRAPH APPROACH TO DISCOVERING GENETIC ANCESTRY. , 2009, The annals of applied statistics.

[14]  Jun Zhang,et al.  Laplacian Eigenfunctions Learn Population Structure , 2009, PloS one.

[15]  Nanda Kambhatla,et al.  Dimension Reduction by Local Principal Component Analysis , 1997, Neural Computation.

[16]  R. Nielsen,et al.  Genomics: In search of rare human variants , 2010, Nature.

[17]  R. Nielsen,et al.  Population genetic inference from genomic sequence variation. , 2010, Genome research.

[18]  Ruiqiang Li,et al.  Design of association studies with pooled or un‐pooled next‐generation sequencing data , 2010, Genetic epidemiology.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Ann B. Lee,et al.  Discovering genetic ancestry using spectral graph theory , 2009, Genetic epidemiology.

[21]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[22]  J. Stamatoyannopoulos,et al.  Power of deep, all-exon resequencing for discovery of human trait genes , 2009, Proceedings of the National Academy of Sciences.

[23]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[24]  F. Collins Has the revolution arrived? , 2010, Nature.

[25]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[26]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[27]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[28]  M. Feldman,et al.  The application of molecular genetic approaches to the study of human evolution , 2003, Nature Genetics.

[29]  J. Venter,et al.  Multiple personal genomes await , 2010, Nature.

[30]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[31]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[32]  P. Menozzi,et al.  Synthetic maps of human gene frequencies in Europeans. , 1978, Science.

[33]  Rod Peakall,et al.  Spatial autocorrelation analysis of individual multiallele and multilocus genetic structure , 1999, Heredity.

[34]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.

[35]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[36]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[37]  Jun Zhang,et al.  Ancestral Informative Marker Selection and Population Structure Visualization Using Sparse Laplacian Eigenfunctions , 2010, PloS one.

[38]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[39]  Shameek Biswas,et al.  Genome-wide insights into the patterns and determinants of fine-scale population structure in humans. , 2009, American journal of human genetics.

[40]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[41]  Brenna M Henn,et al.  Fine-scale population structure and the era of next-generation sequencing. , 2010, Human molecular genetics.

[42]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[43]  Weihua Guan,et al.  Genotype‐based matching to correct for population stratification in large‐scale case‐control genetic association studies , 2009, Genetic epidemiology.

[44]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[45]  Wen Gao,et al.  Maximal Linear Embedding for Dimensionality Reduction , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[47]  John H. Maindonald,et al.  Modern Multivariate Statistical Techniques: Regression, Classification and Manifold Learning , 2009 .