Document Topic Generation in Text Mining by Using Cluster Analysis with EROCK

Clustering is useful technique in the field of textual data mining. Cluster analysis divides objects into meaningful groups based on similarity between objects. Copious material is available from the World Wide Web (WWW) in response to any user-provided query. It becomes tedious for the user to manually extract real required information from this material. This paper proposes a scheme to effectively address this problem with the help of cluster analysis. In particular, the ROCK algorithm is studied with some modifications. ROCK generates better clusters than other clustering algorithms for data with categorical attributes. We present an enhanced version of ROCK called Enhanced ROCK (EROCK) with improved similarity measure as well as storage efficiency. Evaluation of the proposed algorithm done on standard text documents shows improved performance.

[1]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[2]  Chunping Li,et al.  Improved ROCK for Text Clustering Using Asymmetric Proximity , 2006, SOFSEM.

[3]  Christoph F. Eick,et al.  MOSAIC: A Proximity Graph Approach for Agglomerative Clustering , 2007, DaWaK.

[4]  Kwong-Sak Leung,et al.  Scalable model-based clustering for large databases based on data summarization , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[6]  B. Mathiak,et al.  Five Steps to Text Mining in Biomedical Literature , 2004 .

[7]  Pascal Cuxac,et al.  Document stream clustering: experimenting an incremental algorithm and AR-based tools for highlighting dynamic trends , 2008, ArXiv.

[8]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[9]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  M. Castellano,et al.  A Web Text Mining Flexible Architecture , 2007 .

[11]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[12]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..