W-kmeans: Clustering News Articles Using WordNet

Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed suffering however from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. We are proposing the enhancement of standard kmeans algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the "bag of words" used prior to the clustering process and assisting the label generation procedure following it. Our experimentation revealed a significant improvement over standard kmeans for a corpus of news articles derived from major news portals. Moreover, the cluster labeling process generates useful and of high quality cluster tags.

[1]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[2]  Taeho Jo,et al.  The Evaluation Measure of Text Clustering for the Variable Number of Clusters , 2007, ISNN.

[3]  Yuen-Hsien Tseng,et al.  Generic title labeling for clustered documents , 2010, Expert Syst. Appl..

[4]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[5]  Christos Bouras,et al.  PeRSSonal's core functionality evaluation: Enhancing text labeling through personalized summaries , 2008, Data Knowl. Eng..

[6]  Derong Liu,et al.  Advances in Neural Networks - ISNN 2007, 4th International Symposium on Neural Networks, ISNN 2007, Nanjing, China, June 3-7, 2007, Proceedings, Part I , 2007, ISNN.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Frank S. C. Tseng,et al.  An integration of fuzzy association rules and WordNet for document clustering , 2010, Knowledge and Information Systems.

[9]  Christos Bouras,et al.  Improving Text Summarization Using Noun Retrieval Techniques , 2008, KES.

[10]  Lakhmi C. Jain,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2004, Lecture Notes in Computer Science.

[11]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[12]  Soon Myoung Chung,et al.  Parallel bisecting k-means with prediction clustering algorithm , 2006, The Journal of Supercomputing.

[13]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[14]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[15]  James P. Callan,et al.  Automatically labeling hierarchical clusters , 2006, DG.O.