Enriching Document Representation with the Deviations of Word Co-occurrence Frequencies

Recent strategies have been proposed to reveal the semantic relatedness between documents by enriching a document with the relatedness of all the words in the given document collection to the document. By restricting the relatedness to the expected frequencies that each word will occur in the document, the traditional weighted sum of word vectors is proved to give the upper bounds of the expected frequencies. Duplicate counts usually exist during the sum of the word vectors, which weaken the discriminativeness of the enriched document vectors. The strategy which gives the lower bounds of the expected frequencies is also obtained by keeping the maximum values of the word vectors on each dimension. Together with the lower bounds and the deviations of word co-occurrence frequencies, a novel method is proposed to remove the duplicate counts existing in the upper bounds. As a result, the proposed method smooths the generated document vectors better than the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the proposed method achieves a significant performance improvement compared with the existing strategies.

[1]  Longbing Cao,et al.  Coupled term-term relation analysis for document clustering , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[2]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[3]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[4]  Victor Maojo,et al.  A context vector model for information retrieval , 2002, J. Assoc. Inf. Sci. Technol..

[5]  Alexandros Potamianos,et al.  Unsupervised Semantic Similarity Computation between Terms Using Web Documents , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  N. Biggs MATCHING THEORY (Annals of Discrete Mathematics 29) , 1988 .

[8]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[9]  Georgiana Dinu,et al.  New Directions in Vector Space Models of Meaning , 2014, ACL.

[10]  Arnon Rungsawang DSIR: the First TREC-7 Attempt , 1998, TREC.

[11]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[12]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Argyris Kalogeratos,et al.  Text document clustering using global term context vectors , 2011, Knowledge and Information Systems.

[15]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[16]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[17]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[18]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[19]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[20]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior research methods.