Word importance-based similarity of documents metric (WISDM): Fast and scalable document similarity metric for analysis of scientific documents

We present the Word importance-based similarity of documents metric (WISDM), a fast and scalable novel method for document similarity/distance computation for analysis of scientific documents. It is based on recent advancements in the area of word embeddings. WISDM combines learned word vectors together with traditional count-based models for document similarity computation, eventually achieving state-of-the-art performance and precision. The novel method first selects from two text documents those words that carry the most information and forms a word set for each document respectively. Then it relies on an existing word embeddings model to get the vector representations of the selected words. In the final step, it computes the closeness of the two sets of word vector representations, fit into a matrix, using a correlation coefficient. The presented metric was evaluated on three tasks, relevant to the analysis of scientific documents, and three data sets of open access scientific research. The results demonstrate that WISDM achieves significant performance speed-up in comparison to state-of-the-art metrics with a very marginal drop in precision.

[1]  Ronan Cummins,et al.  Sentence Similarity Measures for Fine-Grained Estimation of Topical Relevance in Learner Essays , 2016, BEA@NAACL-HLT.

[2]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[3]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[7]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[8]  Eric P. Xing,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2014, ACL 2014.

[9]  Matt J. Kusner,et al.  Supervised Word Mover's Distance , 2016, NIPS.

[10]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[11]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[14]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[15]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[18]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.