Clustering-Based Online News Topic Detection and Tracking Through Hierarchical Bayesian Nonparametric Models

In this paper, we propose a clustering-based online news topic detection and tracking (TDT) approach based on hierarchical Bayesian nonparametric framework that allows topics to be shared across different news stories in a corpus. Our approach is formulated using the hierarchical Pitman-Yor process mixture model with the inverted Beta-Liouville (IBL) distribution as its component density, which has shown superior performance in modeling text data than the widely used Gaussian distribution. Moreover, we theoretically develop a convergence-guaranteed online learning algorithm that can effectively learn the proposed TDT model from a stream of news stories based on varational Bayes. The merits of our TDT approach are illustrated by comparing it with other well-defined clustering-based TDT approaches on different news data sets.

[1]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[2]  Xiaolong Wang,et al.  Online topic detection and tracking of financial news based on hierarchical clustering , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[3]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[4]  Nizar Bouguila,et al.  Online Learning of Hierarchical Pitman–Yor Process Mixture of Generalized Dirichlet Distributions With Feature Selection , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[5]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[6]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[7]  Michael I. Jordan,et al.  Hierarchical Bayesian Nonparametric Models with Applications , 2008 .

[8]  Nizar Bouguila,et al.  Online news topic detection and tracking via localized feature selection , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[9]  Satoshi Morinaga,et al.  Tracking dynamics of topic trends using a finite mixture model , 2004, KDD.

[10]  Nizar Bouguila,et al.  A Bayesian analysis of spherical pattern based on finite Langevin mixture , 2016, Appl. Soft Comput..

[11]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[12]  Aimin Zhou,et al.  Simultaneous Bayesian Clustering and Feature Selection Through Student’s ${t}$ Mixtures Model , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[13]  P. Deb Finite Mixture Models , 2008 .

[14]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[15]  Jalil Taghia,et al.  Variational Inference for Watson Mixture Model , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Nizar Bouguila,et al.  Modeling and Clustering Positive Vectors via Nonparametric Mixture Models of Liouville Distributions , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[18]  Jui-Feng Yeh,et al.  Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation , 2016, Neurocomputing.

[19]  Jalil Taghia,et al.  Bayesian Estimation of the von-Mises Fisher Mixture Model with Variational Inference , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[21]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[22]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[23]  Qi He,et al.  Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  James Allan,et al.  Relevance models for topic detection and tracking , 2002 .

[25]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[26]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Hiroshi Nakagawa,et al.  Topic models with power-law using Pitman-Yor process , 2010, KDD.

[29]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[30]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[31]  Guixian Xu,et al.  Research on Topic Detection and Tracking for Online News Texts , 2019, IEEE Access.