AN EFFICIENT ASSOCIATION RULE BASED HIERARCHICAL ALGORITHM FOR TEXT CLUSTERING

In this modern era, the amount of information available has become too large. But are we getting useful information still remain a question. Text clustering is one of the techniques that helps organize information and hence obtain information in a more efficient manner. This paper presents a new technique for clustering text documents based on association rule based systems. In this approach, the text documents are preprocessed and the association between the text files are found using Apriori algorithm. The associated text files are clustered using hierarchical clustering algorithm. The text files are also clustered using hierarchical algorithm. The results of both the methods are evaluated. The algorithms are tested on benchmark data set Reuters-21578. The experimental results prove that the Association Rule Based Hierarchical clustering method (ARBHC) produce better results and also improved cluster quality over hierarchical method. KEYWORDSText clustering, Association rule, Hierarchical algorithm, Apriori.

[1]  Zhi-Hua Zhou,et al.  Distributional Features for Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Vipin Kumar,et al.  Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results , 1998, IEEE Data Eng. Bull..

[3]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[4]  Ming Zhao,et al.  Research on Application of Improved Association Rules Algorithm in Intelligent QA System , 2008, 2008 Second International Conference on Genetic and Evolutionary Computing.

[5]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[6]  S. S. Bedi,et al.  Categorization, clustering and association rule mining on WWW , 2009, 2009 International Multimedia, Signal Processing and Communication Technologies.

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.