Separating Populations with Wide Data: A Spectral Analysis

In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of k product distributions. We are interested in the case that individual features are of low average quality γ, and we want to use as few of them as possible to correctly partition the sample. We analyze a spectral technique that is able to approximately optimize the total data size--the product of number of data points n and the number of features K--needed to correctly perform this partitioning as a function of 1/γ for K > n. Our goal is motivated by an application in clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small.

[1]  Jon Feldman,et al.  PAC Learning Axis-Aligned Mixtures of Gaussians with No Separation Assumption , 2006, COLT.

[2]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[3]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[4]  Eran Halperin,et al.  A rigorous analysis of population stratification with limited data , 2007, SODA '07.

[5]  Van H. Vu,et al.  Spectral norm of random matrices , 2005, STOC '05.

[6]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[7]  M. Meckes Concentration of norms and eigenvalues of random matrices , 2002, math/0211192.

[8]  Amin Coja-Oghlan An Adaptive Spectral Heuristic for Partitioning Random Graphs , 2006, ICALP.

[9]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[10]  Jon M. Kleinberg,et al.  On learning mixtures of heavy-tailed distributions , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[11]  Shang-Hua Teng,et al.  Smoothed analysis: an attempt to explain the behavior of algorithms in practice , 2009, CACM.

[12]  Yishay Mansour,et al.  Estimating a mixture of two product distributions , 1999, COLT '99.

[13]  Paul W. Goldberg,et al.  Evolutionary Trees Can be Learned in Polynomial Time in the Two-State General Markov Model , 2001, SIAM J. Comput..

[14]  Mary Elizabeth Cryan,et al.  Learning and approximation algorithms for problems motivated by evolutionary trees , 1999 .

[15]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[16]  G. Ganger,et al.  Routing, disjoint paths, and classification , 2006 .

[17]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[18]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[19]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[20]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[21]  R. Latala Some estimates of norms of random matrices , 2005 .

[22]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[23]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixtures of distributions , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[24]  J. Feldman,et al.  PAC Learning Mixtures of Gaussians with No Separation Assumption , 2006 .

[25]  Paul W. Goldberg,et al.  Evolutionary trees can be learned in polynomial time in the two-state general Markov model , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[26]  R. Lata,et al.  SOME ESTIMATES OF NORMS OF RANDOM MATRICES , 2004 .

[27]  J. Feldman,et al.  Learning mixtures of product distributions over discrete domains , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[28]  Jon Feldman,et al.  Learning mixtures of product distributions over discrete domains , 2005, FOCS.

[29]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[30]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.