Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder

Topic models have been prevailing for many years on discovering latent semantics while modeling long documents. However, for short texts they generally suffer from data sparsity because of extremely limited word co-occurrences; thus tend to yield repetitive or trivial topics with low quality. In this paper, to address this issue, we propose a novel neural topic model in the framework of autoencoding with a new topic distribution quantization approach generating peakier distributions that are more appropriate for modeling short texts. Besides the encoding, to tackle this issue in terms of decoding, we further propose a novel negative sampling decoder learning from negative samples to avoid yielding repetitive topics. We observe that our model can highly improve short text topic modeling performance. Through extensive experiments on real-world datasets, we demonstrate our model can outperform both strong traditional and neural baselines under extreme data sparsity scenes, producing high-quality topics.

[1]  Kam-Fai Wong,et al.  Microblog Conversation Recommendation via Joint Modeling of Topics and Discourse , 2018, NAACL.

[2]  Gao Cong,et al.  Topic-driven reader comments summarization , 2012, CIKM.

[3]  Chunping Li,et al.  Learning Multilingual Topics with Neural Variational Inference , 2020, NLPCC.

[4]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[5]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[6]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[7]  Paolo Ferragina,et al.  Classification of Short Texts by Deploying Topical Annotations , 2012, ECIR.

[8]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[9]  Charles A. Sutton,et al.  Autoencoding Variational Inference For Topic Models , 2017, ICLR.

[10]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[11]  Jing Li,et al.  Topic Memory Networks for Short Text Classification , 2018, EMNLP.

[12]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[13]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[14]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[15]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[16]  Peng Wang,et al.  Short Text Clustering via Convolutional Neural Networks , 2015, VS@HLT-NAACL.

[17]  Mikio Yamamoto,et al.  Topic-based language models using Dirichlet Mixtures , 2007 .

[18]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Phil Blunsom,et al.  Discovering Discrete Latent Topics with Neural Variational Inference , 2017, ICML.

[21]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[22]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[23]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[24]  Feng Nan,et al.  Topic Modeling with Wasserstein Autoencoders , 2019, ACL.

[25]  Mikio Yamamoto,et al.  Topic-based language models using Dirichlet Mixtures , 2007, Systems and Computers in Japan.

[26]  Phil Blunsom,et al.  Neural Variational Inference for Text Processing , 2015, ICML.

[27]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[28]  Chunping Li,et al.  Short Text Topic Modeling with Flexible Word Patterns , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[29]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[30]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[31]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[32]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[33]  Matthew D. Hoffman,et al.  Variational Autoencoders for Collaborative Filtering , 2018, WWW.

[34]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[35]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[36]  Salim Jouili,et al.  Improving Topic Quality by Promoting Named Entities in Topic Modeling , 2018, ACL.

[37]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[38]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[39]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[40]  Jaegul Choo,et al.  Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.