Synergistic union of Word2Vec and lexicon for domain specific semantic similarity

Semantic similarity measures are an important part in Natural Language Processing tasks. However Semantic similarity measures built for general use do not perform well within specific domains. Therefore in this study we introduce a domain specific semantic similarity measure that was created by the synergistic union of word2vec, a word embedding method that is used for semantic similarity calculation and lexicon based (lexical) semantic similarity methods. We prove that this proposed methodology outperforms both, word embedding methods trained on a generic corpus and word embedding methods trained on a domain specific corpus, which do not use lexical semantic similarity methods to augment the results. Further, we prove that text lemmatization can improve the performance of word embedding methods.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  John J. Nay Gov2Vec: Learning Distributed Representations of Institutions and Their Legal Text , 2016, NLP+CSS@EMNLP.

[4]  Dejing Dou,et al.  Discovering Inconsistencies in PubMed Abstracts through Ontology-Based Information Extraction , 2017, BCB.

[5]  Werner Winiwarter,et al.  Legal Expert System KONTERM - Automatic Representation of Document Structure and Contents , 1993, DEXA.

[6]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[7]  Dejing Dou,et al.  Ontology-based information extraction: An introduction and a survey of current approaches , 2010, J. Inf. Sci..

[8]  Keet Sugathadasa,et al.  Deriving a representative vector for ontology classes with instance word vector embeddings , 2017, 2017 Seventh International Conference on Innovative Computing Technology (INTECH).

[9]  Omer Levy,et al.  A Simple Word Embedding Model for Lexical Substitution , 2015, VS@HLT-NAACL.

[10]  M. K. D. T. Maldeniya,et al.  SeMap - mapping dependency relationships into semantic frame relationships , 2013 .

[11]  Yuval Shahar,et al.  Representation of change in controlled medical terminologies , 1999, Artif. Intell. Medicine.

[12]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[13]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[14]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[15]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[16]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[17]  N. H. N. D. de Silva SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case , 2015, 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer).

[18]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[21]  Keet Sugathadasa,et al.  Semi-supervised instance population of an ontology using word vector embedding , 2017, 2017 Seventeenth International Conference on Advances in ICT for Emerging Regions (ICTer).

[22]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[23]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.