Reliable Learning of Bernoulli Mixture Models

In this paper, we have derived a set of sufficient conditions for reliable clustering of data produced by Bernoulli Mixture Models (BMM), when the number of clusters is unknown. A BMM refers to a random binary vector whose components are independent Bernoulli trials with cluster-specific frequencies. The problem of clustering BMM data arises in many real-world applications, most notably in population genetics where researchers aim at inferring the population structure from multilocus genotype data. Our findings stipulate a minimum dataset size and a minimum number of Bernoulli trials (or genotyped loci) per sample, such that the existence of a clustering algorithm with a sufficient accuracy is guaranteed. Moreover, the mathematical intuitions and tools behind our work can help researchers in designing more effective and theoretically-plausible heuristic methods for similar problems.

[1]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[2]  Joseph K. Pickrell,et al.  Approximately independent linkage disequilibrium blocks in human populations , 2015, bioRxiv.

[3]  J. Wall,et al.  Haplotype blocks and linkage disequilibrium in the human genome , 2003, Nature Reviews Genetics.

[4]  Paul D. McNicholas,et al.  Model-Based Clustering , 2016, Journal of Classification.

[5]  P. Deb Finite Mixture Models , 2008 .

[6]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[8]  F. Krauss Latent Structure Analysis , 1980 .

[9]  Fernando A. Quintana,et al.  Bayesian Nonparametric Data Analysis , 2015 .

[10]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[11]  J. Pella,et al.  The Gibbs and splitmerge sampler for population mixture analysis from genetic data with incomplete baselines , 2006 .

[12]  Angel Amores,et al.  Stacks: an analysis tool set for population genomics , 2013, Molecular ecology.

[13]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[14]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[15]  Seyed Abolfazl Motahari,et al.  Statistical Association Mapping of Population-Structured Genetic Data , 2016, bioRxiv.

[16]  K. Kreutz-Delgado,et al.  A generalized multivariate logistic model and EM algorithm based on the normal variance mean mixture representation , 2016, 2016 IEEE Statistical Signal Processing Workshop (SSP).

[17]  Alfons Juan-Císcar,et al.  EM Initialisation for Bernoulli Mixture Learning , 2004, SSPR/SPR.

[18]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[19]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[20]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[21]  Hua Zhou,et al.  Fast Genome‐Wide QTL Association Mapping on Pedigree and Population Data , 2014, Genetic epidemiology.

[22]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[23]  F. B. Differential and Integral Calculus , 1937, Nature.

[24]  M. Jakobsson,et al.  Clumpak: a program for identifying clustering modes and packaging population structure inferences across K , 2015, Molecular ecology resources.

[25]  Alfons Juan-Císcar,et al.  Bernoulli mixture models for binary images , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[26]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[27]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[28]  T. Maruyama,et al.  Stochastic Problems in Population Genetics , 1977 .

[29]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[30]  G. Evanno,et al.  Detecting the number of clusters of individuals using the software structure: a simulation study , 2005, Molecular ecology.

[31]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[32]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[33]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[34]  G. Celeux,et al.  An entropy criterion for assessing the number of clusters in a mixture model , 1996 .

[35]  Cheng Li,et al.  Conditional Bernoulli Mixtures for Multi-label Classification , 2016, ICML.

[36]  P. Visscher,et al.  Five years of GWAS discovery. , 2012, American journal of human genetics.

[37]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[38]  J. Rousseau On the Frequentist Properties of Bayesian Nonparametric Methods , 2016 .

[39]  Gérard Govaert,et al.  An improvement of the NEC criterion for assessing the number of clusters in a mixture model , 1999, Pattern Recognit. Lett..