A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA

Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.

[1]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[2]  ChengXiang Zhai,et al.  Unsupervised Feature Selection for Multi-View Clustering on Text-Image Web News Data , 2014, CIKM.

[3]  Zhiyong Lu,et al.  Automatic Extraction of Clusters from Hierarchical Clustering Representations , 2003, PAKDD.

[4]  Hal Daumé,et al.  A Co-training Approach for Multi-view Spectral Clustering , 2011, ICML.

[5]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[6]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[7]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[8]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[9]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[10]  T. Frey,et al.  A Cluster Analysis of the D 2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle , 1972 .

[11]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[12]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[13]  Johannes Schneider,et al.  Fast parameterless density-based clustering via random projections , 2013, CIKM.

[14]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[17]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[18]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[19]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[20]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[21]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[22]  Michalis Vazirgiannis,et al.  Quality Scheme Assessment in the Clustering Process , 2000, PKDD.

[23]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[25]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[26]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[27]  M. Cugmas,et al.  On comparing partitions , 2015 .