Knowledge-based vector space model for text clustering

This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The idea is to calculate the dissimilarity between two documents more effectively so that text clustering results can be enhanced. In this paper, the semantic relationship between two terms is defined by the similarity of the two terms. Such similarity is used to re-weight term frequency in the VSM. We consider and study two different similarity measures for computing the semantic relationship between two terms based on two different approaches. The first approach is based on the existing ontologies like WordNet and MeSH. We define a new similarity measure that combines the edge-counting technique, the average distance and the position weighting method to compute the similarity of two terms from an ontology hierarchy. The second approach is to make use of text corpora to construct the relationships between terms and then calculate their semantic similarities. Three clustering algorithms, bisecting k-means, feature weighting k-means and a hierarchical clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.

[1]  Janet L. Kolodner,et al.  Case-Based Reasoning , 1989, IJCAI 1989.

[2]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.

[3]  Steffen Staab,et al.  Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis , 2005, J. Artif. Intell. Res..

[4]  Richi Nayak,et al.  Fast and effective clustering of XML data using structural information , 2008, Knowledge and Information Systems.

[5]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[6]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[7]  A. L. Edwards,et al.  An introduction to linear regression and correlation. , 1985 .

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[11]  John Mylopoulos,et al.  Ontologies for Knowledge Management: An Information Systems Perspective , 2004, Knowledge and Information Systems.

[12]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[13]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[14]  George Karypis,et al.  Comparison of Agglomerative and Partitional Document Clustering Algorithms , 2002 .

[15]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[16]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[17]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[18]  Xiaohua Hu,et al.  A comparative evaluation of different link types on enhancing document clustering , 2008, SIGIR '08.

[19]  Christiane Fellbaum,et al.  Nouns in WordNet , 1998 .

[20]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[21]  George A. Vouros,et al.  Human-centered ontology engineering: The HCOME methodology , 2006, Knowledge and Information Systems.

[22]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[23]  Sze Huey Tan,et al.  The Correlation Coefficient , 2009 .

[24]  Janet L. Kolodner,et al.  Case-Based Reasoning , 1988, IJCAI 1989.

[25]  Stephan Bloehdorn,et al.  Text classification by boosting weak learners based on terms and concepts , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[26]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[27]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[28]  Xiaojun Wan,et al.  Beyond topical similarity: a structural similarity measure for retrieving highly similar documents , 2008, Knowledge and Information Systems.

[29]  Shi Zhong,et al.  A Comparative Study of Generative Models for Document Clustering , 2003 .

[30]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[31]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[32]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[33]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[34]  Thomas R. Gruber,et al.  A Translation Approach to Portable Ontologies , 1993 .

[35]  Marta Sabou,et al.  Learning web service ontologies: an automatic extraction method and its evaluation , 2005 .

[36]  Wesley W. Chu,et al.  Free-text medical document retrieval via phrase-based vector space model , 2002, AMIA.

[37]  Ignazio Gallo,et al.  An online document clustering technique for short web contents , 2009, Pattern Recognit. Lett..

[38]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[39]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[40]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[41]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .