Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud

This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.

[1]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[2]  Douglas Steinley,et al.  K-means clustering: a half-century synthesis. , 2006, The British journal of mathematical and statistical psychology.

[3]  Christian Döring,et al.  Fundamentals of Fuzzy Clustering , 2007 .

[4]  Chunming Rong,et al.  K-means Clustering in the Cloud -- A Mahout Test , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[5]  Mahdi Mahfouf,et al.  Clustering Files of Chemical Structures Using the Fuzzy k-Means Clustering Method , 2004, J. Chem. Inf. Model..

[6]  Greg Hamerly,et al.  Alternatives to the k-means algorithm that find better clusterings , 2002, CIKM '02.

[7]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[8]  Thaung Thaung Win,et al.  Document clustering by fuzzy c-mean algorithm , 2010, 2010 2nd International Conference on Advanced Computer Control.

[9]  George Nagy,et al.  State of the art in pattern recognition , 1968 .

[10]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[11]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Vlado Keselj,et al.  Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering , 2005, CIKM '05.

[13]  Carl G. Looney,et al.  Interactive clustering and merging with a new fuzzy expected value , 2002, Pattern Recognit..

[14]  T.F. Gharib,et al.  Web document clustering approach using wordnet lexical categories and fuzzy clustering , 2008, 2008 11th International Conference on Computer and Information Technology.

[15]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[16]  Jian Hu,et al.  Using Wikipedia for Co-clustering Based Cross-Domain Text Classification , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  C BezdekJames A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980 .

[18]  Enrique Ruspini,et al.  A theory of fuzzy clustering , 1977, 1977 IEEE Conference on Decision and Control including the 16th Symposium on Adaptive Processes and A Special Symposium on Fuzzy Set Theory and Applications.

[19]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[20]  Mostafa M. Aref,et al.  Fuzzy Document Clustering Approach using WordNet Lexical Categories , 2008, SCSS.

[21]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[22]  Julian Szymanski,et al.  Self-Organizing Map Representation for Clustering Wikipedia Search Results , 2011, ACIIDS.

[23]  Vasudeva Varma,et al.  EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING , 2010 .