Separating Populations with Wide Data: A Spectral Analysis

In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of k product distributions. We are interested in the case that individual features are of low average quality γ, and we want to use as few of them as possible to correctly partition the sample. We analyze a spectral technique that is able to approximately optimize the total data size--the product of number of data points n and the number of features K--needed to correctly perform this partitioning as a function of 1/γ for K > n. Our goal is motivated by an application in clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small.

[1]  Shang-Hua Teng,et al.  Smoothed analysis: an attempt to explain the behavior of algorithms in practice , 2009, CACM.

[2]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[3]  Eran Halperin,et al.  A rigorous analysis of population stratification with limited data , 2007, SODA '07.

[4]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[5]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[6]  Amin Coja-Oghlan An Adaptive Spectral Heuristic for Partitioning Random Graphs , 2006, ICALP.

[7]  Jon Feldman,et al.  PAC Learning Axis-Aligned Mixtures of Gaussians with No Separation Assumption , 2006, COLT.

[8]  G. Ganger,et al.  Routing, disjoint paths, and classification , 2006 .

[9]  J. Feldman,et al.  PAC Learning Mixtures of Gaussians with No Separation Assumption , 2006 .

[10]  Jon M. Kleinberg,et al.  On learning mixtures of heavy-tailed distributions , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[11]  J. Feldman,et al.  Learning mixtures of product distributions over discrete domains , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[12]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[13]  Van H. Vu,et al.  Spectral norm of random matrices , 2005, STOC '05.

[14]  R. Latala Some estimates of norms of random matrices , 2005 .

[15]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[16]  M. Meckes Concentration of norms and eigenvalues of random matrices , 2002, math/0211192.

[17]  R. Lata,et al.  SOME ESTIMATES OF NORMS OF RANDOM MATRICES , 2004 .

[18]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixtures of distributions , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[19]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[20]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[21]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[22]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[23]  Mary Elizabeth Cryan,et al.  Learning and approximation algorithms for problems motivated by evolutionary trees , 1999 .

[24]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[25]  Yishay Mansour,et al.  Estimating a mixture of two product distributions , 1999, COLT '99.

[26]  Paul W. Goldberg,et al.  Evolutionary trees can be learned in polynomial time in the two-state general Markov model , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[27]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[28]  M. Fiedler Algebraic connectivity of graphs , 1973 .