Augmented Latent Dirichlet Allocation (Lda) Topic Model with Gaussian Mixture Topics

Latent Dirichlet allocation (LDA) is a statistical model that is often used to discover topics or themes in a large collection of documents. In the LDA model, topics are modeled as discrete distributions over a finite vocabulary of words. The LDA is also a popular choice to model other datasets spanning a discrete domain, such as population genetics and social networks. However, in order to model data spanning a continuous domain with the LDA, discrete approximations of the data need to be made. These discrete approximations to continuous data can lead to loss of information and may not represent the true structure of the underlying data. We present an augmented version of the LDA topic model, where topics are represented using Gaussian mixture models (GMMs), which are multi-modal distributions spanning a continuous domain. This augmentation of the LDA topic model with Gaussian mixture topics is denoted by the GMM-LDA model. We use Gibbs sampling to infer model parameters. We demonstrate the utility of the GMM-LDA model by applying it to the problem of clustering sleep states in electroencephalography (EEG) data. Results are presented demonstrating superior clustering performance with our GMM-LDA algorithm compared to the standard LDA and other clustering algorithms.

[1]  M. Brandon Westover,et al.  Data-driven modeling of sleep states from EEG , 2012, 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[2]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[3]  S. Chokroverty,et al.  The visual scoring of sleep in adults. , 2007, Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine.

[4]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[6]  J. Gillin,et al.  Successful separation of depressed, normal, and insomniac subjects by EEG sleep data. , 1979, Archives of general psychiatry.

[7]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[9]  P. Welch The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms , 1967 .

[10]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[11]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[12]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[15]  E. Wolpert A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects. , 1969 .

[16]  Michael Escobar,et al.  Nonparametric Bayesian methods in hierarchical models , 1995 .

[17]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[18]  Saeid Nahavandi,et al.  Unsupervised mining of long time series based on latent topic model , 2013, Neurocomputing.