Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing

The prevalent way to estimate the similarity of two documents based on word embeddings is to apply the cosine similarity measure to the two centroids obtained from the embedding vectors associated with the words in each document. Motivated by an industrial application from the domain of youth marketing, where this approach produced only mediocre results, we propose an alternative way of combining the word vectors using matrix norms. The evaluation shows superior results for most of the investigated matrix norms in comparison to both the classical cosine measure and several other document similarity estimates.

[1]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[2]  Françoise Chatelin Eigenvalues of Matrices: Revised Edition , 2012 .

[3]  Benno Stein,et al.  Insights into explicit semantic analysis , 2011, CIKM '11.

[4]  Azucena Montes Rendón,et al.  Sentence level matrix representation for document spectral clustering , 2017, Pattern Recognit. Lett..

[5]  Dan Roth,et al.  Unsupervised Sparse Vector Densification for Short Text Similarity , 2015, NAACL.

[6]  M. Lynn Segmenting and Targeting Your Market: Strategies and Limitations , 2011 .

[7]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[8]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[9]  Lluís A. Belanche Muñoz,et al.  Things to Know about a (dis)similarity Measure , 2011, KES.

[10]  Ion Androutsopoulos,et al.  Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering , 2016, BioNLP@ACL.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Petra Perner,et al.  The Problem of Normalization and a Normalized Similarity Measure by Online Data , 2011, Trans. Case Based Reason..

[13]  Han-Joon Kim,et al.  Enhanced document clustering using Wikipedia-based document representation , 2016 .

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.