Text document clustering using global term context vectors

Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.

[1]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[2]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[3]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Victor Maojo,et al.  A context vector model for information retrieval , 2002, J. Assoc. Inf. Sci. Technol..

[6]  Joydeep Ghosh,et al.  Similarity-Based Text Clustering: A Comparative Study , 2006, Grouping Multidimensional Data.

[7]  Mostafa Keikha,et al.  Document Representation and Quality of Text: An Analysis , 2008 .

[8]  HuaBei,et al.  Short text clustering by finding core terms , 2011 .

[9]  Shuicheng Yan,et al.  Local Word Bag Model for Text Categorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[10]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[11]  Mohamed S. Kamel,et al.  Statistical semantics for enhancing document clustering , 2011, Knowledge and Information Systems.

[12]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[15]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16]  Weblog Wikipedia,et al.  In Wikipedia the Free Encyclopedia , 2005 .

[17]  Zhi Lu,et al.  Short text clustering by finding core terms , 2011, Knowledge and Information Systems.

[18]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[19]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[20]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[21]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[22]  Frank S. C. Tseng,et al.  An integration of fuzzy association rules and WordNet for document clustering , 2010, Knowledge and Information Systems.

[23]  Carlotta Domeniconi,et al.  Text Clustering with Local Semantic Kernels , 2008 .

[24]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[25]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[26]  George Karypis,et al.  Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization , 2000 .

[27]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[28]  Yi Mao,et al.  The Locally Weighted Bag of Words Framework for Document Representation , 2007, J. Mach. Learn. Res..

[29]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[30]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[31]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[32]  Dunja Mladenic,et al.  Machine Learning on non-homogeneous, distributed text data , 1998 .