Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

RSS news articles that are either partially or completely duplicated in content are easily found on the Internet these days, which require Web users to sort through the articles to identify non-redundant information. This manual-filtering process is time-consuming and tedious. In this paper, we present a new filtering and clustering approach, called FICUS, which starts with identifying and eliminating redundant RSS news articles using a fuzzy set information retrieval approach and then clusters the remaining non-redundant RSS news articles according to their degrees of resemblance. FICUS uses a tree hierarchy to organize clusters of RSS news articles. The contents of the respective clusters are captured by the representative keywords from RSS news articles in the clusters so that searching and retrieval of similar RSS news articles is fast and efficient. FICUS is simple, since it uses the pre-defined word-correlation factors to determine related (words in) RSS news articles and filter redundant ones, and is supported by well-known and yet simple mathematical models, such as the standard deviation, vector space model, and probability theory, to generate clusters of non-redundant RSS news articles. Experiments performed on (test sets of) RSS news articles on various topics, which were downloaded from different online sources, verify the accuracy of FICUS on eliminating redundant RSS news articles, clustering similar RSS news articles together, and segregating different RSS news articles in terms of their contents. In addition, further empirical studies show that FICUS outperforms well-known approaches adopted for clustering RSS news articles.

[1]  Algorithm of documents clustering based on minimum spanning tree , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[2]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[5]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[6]  Tetsuya Morita,et al.  A fuzzy document retrieval system using the keyword connection matrix and a learning method , 1991 .

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[9]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[10]  Maria Soledad Pera,et al.  Synthesizing correlated RSS news articles based on a fuzzy equivalence relation , 2009, Int. J. Web Inf. Syst..

[11]  D. Zwillinger,et al.  Standard Mathematical Tables and Formulae , 1997, The Mathematical Gazette.

[12]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[13]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[14]  Mohamed S. Kamel,et al.  Phrase-based document similarity based on an index graph model , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[16]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[17]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[18]  Xiaochun Cheng,et al.  A Comparative Study on Text Clustering Methods , 2006, ADMA.

[19]  Yiu-Kai Ng,et al.  Categorizing and extracting information from multilingual HTML documents , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[20]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[21]  Yiu-Kai Ng,et al.  Using Word Clusters to Detect Similar Web Documents , 2006, KSEM.

[22]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[23]  Shi Zhong,et al.  A Comparative Study of Generative Models for Document Clustering , 2003 .

[24]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[25]  Christopher C. Yang Search Engines Information Retrieval in Practice , 2010, J. Assoc. Inf. Sci. Technol..