Mining fuzzy frequent itemsets for hierarchical document clustering

As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F^2IHC) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.

[1]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[2]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[3]  Rafael Berlanga Llavori,et al.  Topic discovery based on text mining techniques , 2007, Inf. Process. Manag..

[4]  Jesús Chamorro-Martínez,et al.  Mining web documents to find additional query terms using fuzzy association rules , 2004, Fuzzy Sets Syst..

[5]  Marc El-Bèze,et al.  A Clustering Method for Information Retrieval , 1999 .

[6]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[7]  Daniel Sánchez,et al.  Mining Text Data: Special Features and Patterns , 2002, Pattern Detection and Discovery.

[8]  Xiaohua Hu,et al.  Towards effective document clustering: A constrained K-means based approach , 2008, Inf. Process. Manag..

[9]  Tzung-Pei Hong,et al.  Fuzzy data mining for interesting generalized association rules , 2003, Fuzzy Sets Syst..

[10]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[11]  Chi-Sheng Shih,et al.  Extracting classification knowledge of Internet documents with mining term associations: a semantic approach , 1998, SIGIR '98.

[12]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[13]  Donna Harman,et al.  Information Processing and Management , 2022 .

[14]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[15]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[16]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[17]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[18]  Chun-Ling Chen,et al.  Hierarchical Document Clustering Using Fuzzy Association Rule Mining , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[19]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[20]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[21]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[22]  Khalil Shihab Improving Clustering Performance by Using Feature Selection and Extraction Techniques , 2004 .

[23]  Tunga Güngör,et al.  Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords , 2006, KI.

[24]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[25]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[26]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[27]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[28]  Reda Alhajj,et al.  Utilizing Genetic Algorithms to Optimize Membership Functions for Fuzzy Weighted Association Rules Mining , 2006, Applied Intelligence.