A nonparametric model for online topic discovery with word embeddings

Abstract With the explosive growth of short documents generated from streaming textual sources (e.g., Twitter), latent topic discovery has become a critical task for short text stream clustering. However, most online clustering models determine the probability of producing a new topic by manually setting some hyper-parameter/threshold, which becomes barrier to achieve better topic discovery results. Moreover, topics generated by using existing models often involve a wide coverage of the vocabulary which is not suitable for online social media analysis. Therefore, we propose a nonparametric model (NPMM) which exploits auxiliary word embeddings to infer the topic number and employs a “spike and slab” function to alleviate the sparsity problem of topic-word distributions in online short text analyses. NPMM can automatically decide whether a given document belongs to existing topics, measured by the squared Mahalanobis distance. Hence, the proposed model is free from tuning the hyper-parameter to obtain the probability of generating new topics. Additionally, we propose a nonparametric sampling strategy to discover representative terms for each topic. To perform inference, we introduce a one-pass Gibbs sampling algorithm based on Cholesky decomposition of covariance matrices, which can further be sped up using a Metropolis-Hastings step. Our experiments demonstrate that NPMM significantly outperforms the state-of-the-art algorithms.

[1]  Hong Cheng,et al.  The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Scharolta Katharina Siencnik Adapting word2vec to Named Entity Recognition , 2015, NODALIDA.

[4]  Philip S. Yu,et al.  Lifelong Domain Word Embedding via Meta-Learning , 2018, IJCAI.

[5]  Shasha Wang,et al.  Deep feature weighting for naive Bayes and its application to text classification , 2016, Eng. Appl. Artif. Intell..

[6]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[7]  Shasha Wang,et al.  Structure extended multinomial naive Bayes , 2016, Inf. Sci..

[8]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[9]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[10]  Arjun Mukherjee,et al.  Aspect Extraction through Semi-Supervised Modeling , 2012, ACL.

[11]  Zhenhua Wang,et al.  Sumblr: continuous summarization of evolving tweet streams , 2013, SIGIR.

[12]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[13]  Rik Warren,et al.  Use of Mahalanobis Distance for Detecting Outliers and Outlier Clusters in Markedly Non-Normal Data: A Vehicular Traffic Example , 2011 .

[14]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[15]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[16]  Jianhua Yin,et al.  A model-based approach for text clustering with outlier detection , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[17]  Hae-Chang Rim,et al.  A new method of parameter estimation for multinomial naive bayes text classifiers , 2002, SIGIR '02.

[18]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Massih-Reza Amini,et al.  Streaming-LDA: A Copula-based Approach to Modeling Topic Dependencies in Document Streams , 2016, KDD.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Charu C. Aggarwal,et al.  A Survey of Stream Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[22]  Wei Zhang,et al.  Model-based Clustering of Short Text Streams , 2018, KDD.

[23]  Argyris Kalogeratos,et al.  Improving Text Stream Clustering using Term Burstiness and Co-burstiness , 2016, SETN.

[24]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[25]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[26]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[27]  Le Song,et al.  Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams , 2015, KDD.

[28]  Nando de Freitas,et al.  An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[29]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[30]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[33]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[34]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[35]  Zhiguo Gong,et al.  A Nonparametric Model for Event Discovery in the Geospatial-Temporal Space , 2016, CIKM.

[36]  Eric P. Xing,et al.  Dynamic Non-Parametric Mixture Models and the Recurrent Chinese Restaurant Process: with Applications to Evolutionary Clustering , 2008, SDM.

[37]  Liangxiao Jiang,et al.  A Novel Bayes Model: Hidden Naive Bayes , 2009, IEEE Transactions on Knowledge and Data Engineering.

[38]  Dan Klein,et al.  Neural CRF Parsing , 2015, ACL.

[39]  Zhiguo Gong,et al.  A Density-based Nonparametric Model for Online Event Discovery from the Social Media Data , 2017, IJCAI.

[40]  Evangelos Kanoulas,et al.  Dynamic Clustering of Streaming Short Documents , 2016, KDD.

[41]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[42]  Hao Huang,et al.  Streaming spectral clustering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[43]  Philip S. Yu,et al.  Under Consideration for Publication in Knowledge and Information Systems on Clustering Massive Text and Categorical Data Streams , 2022 .

[44]  Shuai Wang,et al.  Targeted Topic Modeling for Focused Analysis , 2016, KDD.

[45]  Wee Keong Ng,et al.  A survey on data stream clustering and classification , 2015, Knowledge and Information Systems.

[46]  Eugene Agichtein,et al.  TM-LDA: efficient online modeling of latent topic transitions in social media , 2012, KDD.