Challenging the Boundaries of Unsupervised Learning for Semantic Similarity

The semantic analysis field has a crucial role to play in the research related to text analytics. Calculating the semantic similarity between sentences is a long-standing problem in the area of natural language processing, and it differs significantly as the domain of operation differs. In this paper, we present a methodology that can be applied across multiple domains by incorporating corpora-based statistics into a standardized semantic similarity algorithm. To calculate the semantic similarity between words and sentences, the proposed method follows an edge-based approach using a lexical database. When tested on both benchmark standards and mean human similarity dataset, the methodology achieves a high correlation value for both word (<inline-formula> <tex-math notation="LaTeX">$r=0.8753$ </tex-math></inline-formula>) and sentence similarity (<inline-formula> <tex-math notation="LaTeX">$r=0.8793$ </tex-math></inline-formula>) concerning <italic>Rubenstein and Goodenough</italic> standard and the <italic>SICK</italic> dataset (<inline-formula> <tex-math notation="LaTeX">$r=0.8324 $ </tex-math></inline-formula><xref ref-type="fn" rid="fn1"><sup>1</sup></xref>) outperforming other unsupervised models.<fn id="fn1"><label><sup>1</sup></label><p>Eliminating the outliers which constitutes to 3.75% of 4927 statement pairs</p></fn>

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[3]  Alice Lai,et al.  Illinois-LH: A Denotational and Distributional Approach to Semantics , 2014, *SEMEVAL.

[4]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[5]  Jimmy J. Lin,et al.  Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks , 2015, EMNLP.

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[8]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[9]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[10]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[11]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[12]  Vibhanshu Abhishek,et al.  Keyword generation for search engine advertising using semantic similarity between terms , 2007, ICEC.

[13]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[14]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[15]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[16]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[17]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[19]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[20]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[21]  A D Baddeley,et al.  Short-term Memory for Word Sequences as a Function of Acoustic, Semantic and Formal Similarity , 1966, The Quarterly journal of experimental psychology.

[22]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[23]  Liang Xiao,et al.  Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning , 2017, NIPS.

[24]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[25]  Malvina Nissim,et al.  The Meaning Factory: Formal Semantics for Recognizing Textual Entailment and Determining Semantic Similarity , 2014, *SEMEVAL.

[26]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[27]  Man Lan,et al.  ECNU: One Stone Two Birds: Ensemble of Heterogenous Measures for Semantic Relatedness and Textual Entailment , 2014, *SEMEVAL.

[28]  Alexander F. Gelbukh,et al.  UNAL-NLP: Combining Soft Cardinality Features for Semantic Textual Similarity, Relatedness and Entailment , 2014, *SEMEVAL.

[29]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[30]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[31]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[32]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[33]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[34]  Ted Pedersen,et al.  Maximizing Semantic Relatedness to Perform Word Sense Disambiguation , 2005 .

[35]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[36]  Seán O'Riain,et al.  Querying Linked Data Using Semantic Relatedness: A Vocabulary Independent Approach , 2011, NLDB.

[37]  Jia Wei Chang,et al.  A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences , 2014, TheScientificWorldJournal.