An Integration of Fuzzy Association Rules and WordNet for Document Clustering

With the rapid growth of text documents, document clustering has become one of the main techniques for organizing large amount of documents into a small number of meaningful clusters. However, there still exist several challenges for document clustering, such as high dimensionality, scalability, accuracy, meaningful cluster labels, and extracting semantics from texts. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discovery fuzzy frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Reuters-21578 dataset. The experimental result shows that our proposed method outperforms the accuracy quality of FIHC, HFTC, and UPGMA.

[1]  R. Sokal,et al.  A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  S. Moral,et al.  Learning rules for a fuzzy inference model , 1993 .

[7]  M. Singhal Automatic Text Browsing Using Vector Space , 1995 .

[8]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[9]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[11]  Vincenzo Pallotta,et al.  Robust methods in analysis of natural language data , 2002, Natural Language Engineering.

[12]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[13]  Chih-Ping Wei,et al.  Managing document categories in e-commerce environments: an evolution-based approach , 2002, Eur. J. Inf. Syst..

[14]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[15]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[16]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[17]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[18]  Tzung-Pei Hong,et al.  Fuzzy data mining for interesting generalized association rules , 2003, Fuzzy Sets Syst..

[19]  Khalil Shihab Improving Clustering Performance by Using Feature Selection and Extraction Techniques , 2004 .

[20]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[21]  Jiawei Han,et al.  Scalable construction of topic directory with nonparametric closed termset mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[22]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[23]  Jesús Chamorro-Martínez,et al.  Mining web documents to find additional query terms using fuzzy association rules , 2004, Fuzzy Sets Syst..

[24]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[25]  Reda Alhajj,et al.  Utilizing Genetic Algorithms to Optimize Membership Functions for Fuzzy Weighted Association Rules Mining , 2006, Applied Intelligence.

[26]  Wei Wang,et al.  Efficient mining of skyline objects in subspaces over data streams , 2010, Knowledge and Information Systems.

[27]  Chun-Ling Chen,et al.  Hierarchical Document Clustering Using Fuzzy Association Rule Mining , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[28]  Dimitrios I. Fotiadis,et al.  An optimized sequential pattern matching methodology for sequence classification , 2009, Knowledge and Information Systems.

[29]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[30]  Frank S. C. Tseng,et al.  Mining fuzzy frequent itemsets for hierarchical document clustering , 2010, Inf. Process. Manag..