FP-growth approach for document clustering

Since the amount of text data stored in computer repositories is growing every day, we need more than ever a reliable way to group or categorize text documents. Most of the existing document clustering techniques use a group of keywords from each document to cluster the documents. In this thesis, we have used a sense based approach to cluster documents instead of using only the frequency of the keywords. We use relationships between the keywords to cluster the documents. The relationships are retrieved from the WordNet ontology and represented in the form of a graph. The document-graphs, which reflect the essence of the documents, are searched in order to find the frequent subgraphs. To discover the frequent subgraphs, we use the Frequent Pattern Growth (FP-growth) approach, which was originally designed to discover frequent patterns. The common frequent subgraphs discovered by the FP-growth approach are later used to cluster the documents. The FP-growth approach requires the creation of an FP-tree. Mining the FP-tree, which is created for a normal transaction database, is easier compared to large documentgraphs, mostly because the itemsets in a transaction database is smaller compared to the edge list of our document-graphs. Original FP-tree mining procedure is also easier because the items of a traditional transaction database are stand-alone entities and have no direct connection to each other. In contrast, as we look for subgraphs in graphs, they become related to each other in the context of connectivity. The computation cost makes the original FP-growth approach somewhat inefficient for text documents. We modify the FP-growth approach, making it possible to generate frequent subgraphs from the FP-tree. Later, we cluster documents using these subgraphs.

[1]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[3]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[4]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[5]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[6]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7]  M. Shahriar Hossain,et al.  GDClust: A Graph-Based Document Clustering Technique , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[8]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[9]  Jian Pei,et al.  Mining frequent patterns by pattern-growth: methodology and implications , 2000, SKDD.

[10]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[11]  Ji Hyea Han,et al.  Data Mining : Concepts and Techniques 2 nd Edition Solution Manual , 2005 .

[12]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[13]  R. Biswas,et al.  Metagraph-Based Substructure Pattern Mining , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[14]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[15]  Alexander F. Gelbukh,et al.  Text Mining at Detail Level Using Conceptual Graphs , 2002, ICCS.

[16]  Ping Guo,et al.  Frequent mining of subgraph structures , 2006, J. Exp. Theor. Artif. Intell..

[17]  Philip S. Yu,et al.  Mining, Indexing, and Similarity Search in Graphs and Complex Structures , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  A.A. Mohamed Generating user-focused, content-based summaries for multi-documents using document graphs , 2005, Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005..

[19]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[20]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[22]  John F. Sowa,et al.  Implementing a Semantic Interpreter Using Conceptual Graphs , 1986, IBM J. Res. Dev..

[23]  Ehud Gudes,et al.  Diagonally Subgraphs Pattern Mining , 2004, DMKD '04.

[24]  Joost N. Kok,et al.  The Gaston Tool for Frequent Subgraph Mining , 2005, GraBaTs.

[25]  Rafal A. Angryk,et al.  GDClust: A Graph-Based Document Clustering Technique , 2007 .

[26]  Hidekazu Nakawatase,et al.  Graph-based text database for knowledge discovery , 2004, WWW Alt. '04.

[27]  Kenneth H. Rosen,et al.  Discrete Mathematics and its applications , 2000 .

[28]  Gerald Salton,et al.  Automatic text processing , 1988 .

[29]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[30]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[31]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[32]  Sharma Chakravarthy,et al.  InfoSift: Adapting Graph Mining Techniques for Text Classification , 2005, FLAIRS.

[33]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[34]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..