A novel method based on symbolic regression for interpretable semantic similarity measurement

Abstract The problem of automatically measuring the degree of semantic similarity between textual expressions is a challenge that consists of calculating the degree of likeness between two text fragments that have none or few features in common according to human judgment. In recent times, several machine learning methods have been able to establish a new state-of-the-art regarding the accuracy, but none or little attention has been paid to their interpretability, i.e. the extent to which an end-user could be able to understand the cause of the output from these approaches. Although such solutions based on symbolic regression already exist in the field of clustering, we propose here a new approach which is being able to reach high levels of interpretability without sacrificing accuracy in the context of semantic textual similarity. After a complete empirical evaluation using several benchmark datasets, it is shown that our approach yields promising results in a wide range of scenarios.

[1]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[2]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[3]  Abdelmajid Ben Hamadou,et al.  Ontology-based approach for measuring semantic similarity , 2014, Eng. Appl. Artif. Intell..

[4]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[5]  Ana M. García-Serrano,et al.  HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset , 2017, Inf. Syst..

[6]  Stephen Tyree,et al.  Non-linear Metric Learning , 2012, NIPS.

[7]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[8]  John H. Holland,et al.  Cognitive systems based on adaptive algorithms , 1977, SGAR.

[9]  Hai Jin,et al.  Expanding Approach to Information Retrieval Using Semantic Similarity Analysis Based on WordNet and Wikipedia , 2012, Int. J. Softw. Eng. Knowl. Eng..

[10]  Dirk Thorleuchter,et al.  Mining ideas from textual information , 2010, Expert Syst. Appl..

[11]  Dick den Hertog,et al.  On the Importance of Data Balancing for Symbolic Regression , 2010, IEEE Transactions on Evolutionary Computation.

[12]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[13]  William B. Langdon,et al.  Quadratic Bloat in Genetic Programming , 2000, GECCO.

[14]  A. Tversky Features of Similarity , 1977 .

[15]  Jorge Martínez Gil,et al.  Automatic design of semantic similarity controllers based on fuzzy logics , 2019, Expert Syst. Appl..

[16]  Abdelmajid Ben Hamadou,et al.  LWCR: multi-Layered Wikipedia representation for Computing word Relatedness , 2016, Neurocomputing.

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[19]  Michela Bertolotto,et al.  An evaluative baseline for geo-semantic relatedness and similarity , 2014, GeoInformatica.

[20]  Eneko Agirre,et al.  A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art , 2019, Eng. Appl. Artif. Intell..

[21]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[22]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[23]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[24]  Jorge Martinez-Gil CoTO: A novel approach for fuzzy aggregation of semantic similarity measures , 2016, Cognitive Systems Research.

[25]  Timothy W. Finin,et al.  Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy , 2013, IEEE Transactions on Knowledge and Data Engineering.

[26]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Zachary C. Lipton,et al.  The mythos of model interpretability , 2018, Commun. ACM.

[29]  Ryen W. White Opportunities and challenges in search interaction , 2018, Commun. ACM.

[30]  Erik Cambria,et al.  Learning short-text semantic similarity with word embeddings and external knowledge sources , 2019, Knowl. Based Syst..

[31]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[32]  Florent Perronnin,et al.  Textual Similarity with a Bag-of-Embedded-Words Model , 2013, ICTIR.

[33]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[34]  Danushka Bollegala,et al.  A Web Search Engine-Based Approach to Measure Semantic Similarity between Words , 2011, IEEE Transactions on Knowledge and Data Engineering.

[35]  Jorge Martínez Gil,et al.  Evolutionary algorithm based on different semantic similarity functions for synonym recognition in the biomedical domain , 2013, Knowl. Based Syst..

[36]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .