An Efficient Clustering Algorithm for Text Mining Using Greedy Approach

Text clustering is a text mining technique used to group text documents into groups (or clusters) based on similarity of content. This organization (i.e. clustering) is so as to make documents more understandable and easier to search the relevant information, easier to process, and even more efficient in utilizing communication bandwidth and storage space. Clustering problems can be defined as: given a dataset of N records, each having dimensionality d, to partition the data into subsets such that a specific criterion is optimized. The most widely used criterion for optimization is the distortion criterion. Each record is assigned to a single cluster and distortion is the average squared Euclidean distance between a record and the corresponding cluster center. Thus this criterion minimizes the sum of the squared distances of each record from its corresponding center. A new approach has been proposed for avoiding clustering problem, which is called greedy approach. Global K-means clustering is used to minimize the above-mentioned term by partitioning the data into k non-overlapping regions identified by their centers. K Means is arguably the most popular text clustering algorithm. However, just like the others, it must be having its own weaknesses. We explore the K Means algorithm as well as its variants and discuss their appropriateness in text clustering. The final the proposed result explains of text mining concerning the choice of Global K Means for text clustering.

[1]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[2]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[3]  Ben Shneiderman,et al.  Visual information seeking: tight coupling of dynamic query filters with starfield displays , 1994, CHI '94.

[4]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[5]  Christopher Ahlberg,et al.  IVEE: an Information Visualization and Exploration Environment , 1995, Proceedings of Visualization 1995 Conference.

[6]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[7]  Mika Klemettinen,et al.  Mining in the Phrasal Frontier , 1997, PKDD.

[8]  M. Klemettinen,et al.  Applying Data Mining Techniques in Text Analysis , 1997 .

[9]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[10]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[11]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[12]  Ramakrishnan Srikant,et al.  On integrating catalogs , 2001, WWW '01.

[13]  Thomas L. Adams,et al.  Technology Issues Regarding the Evolution to a Semantic Web , 2001, ISAS-SCI.

[14]  Hwee Tou Ng,et al.  Bayesian online classifiers for text classification and filtering , 2002, SIGIR '02.

[15]  James S. Aitken Learning Information Extraction Rules: An Inductive Logic Programming approach , 2002, ECAI.

[16]  Bohn Stafleu van Loghum,et al.  Online … , 2002, LOG IN.