HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset

This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-Diaz and Garcia-Serrano in (2015, 2016) [56–58], which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep.

[1]  Montserrat Batet,et al.  Ontology-based semantic clustering , 2011, AI Commun..

[2]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[3]  Juan J. Lastra-Díaz,et al.  HESML_vs_SML: scalability and performance benchmarks between the HESML V1R2 and SML 0.9 semantic measures libraries , 2016 .

[4]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[5]  Martin Bichler,et al.  More than bin packing: Dynamic resource allocation strategies in cloud data centers , 2015, Inf. Syst..

[6]  David Sánchez,et al.  Ontology-Based Anonymization of Categorical Values , 2010, MDAI.

[7]  Mohamed Ali Hadj Taieb,et al.  FM3S: Features-Based Measure of Sentences Semantic Similarity , 2015, HAIS.

[8]  Jorge Martínez Gil,et al.  Evolutionary algorithm based on different semantic similarity functions for synonym recognition in the biomedical domain , 2013, Knowl. Based Syst..

[9]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[10]  Hisham Al-Mubaid,et al.  Measuring Semantic Similarity Between Biomedical Concepts Within Multiple Ontologies , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[11]  Junzhong Gu,et al.  A New Model of Information Content Based on Concept ’ s Topology for Measuring Semantic Similarity in WordNet , 2012 .

[12]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[13]  David Sánchez,et al.  Improving Semantic Relatedness Assessments: Ontologies Meet Textual Corpora , 2016, KES.

[14]  Zhongqing Yu,et al.  A New Model of Information Content for Measuring the Semantic Similarity between Concepts , 2013, 2013 International Conference on Cloud Computing and Big Data.

[15]  David Sánchez,et al.  A Review on Semantic Similarity , 2015 .

[16]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[17]  Oh,et al.  SEAL — A Framework for Developing SEmantic PortALs , 2001 .

[18]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[19]  Thomas Lengauer,et al.  Improving disease gene prioritization using the semantic similarity of Gene Ontology terms , 2010, Bioinform..

[20]  Ted Pedersen,et al.  Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text , 2013, J. Biomed. Informatics.

[21]  Dennis Shasha,et al.  A collaborative approach to computational reproducibility , 2016, Inf. Syst..

[22]  Leif Kobbelt,et al.  OpenMesh: A Generic and Efficient Polygon Mesh Data Structure , 2002 .

[23]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[24]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[25]  Sylvie Ranwez,et al.  The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies , 2014, Bioinform..

[26]  Biswanath Dutta,et al.  A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet , 2016, ArXiv.

[27]  Junzhong Gu,et al.  A New Model of Information Content for Semantic Similarity in WordNet , 2008, 2008 Second International Conference on Future Generation Communication and Networking Symposia.

[28]  Jan Mendling,et al.  On the refactoring of activity labels in business process models , 2012, Inf. Syst..

[29]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[30]  Jennifer S Trueblood,et al.  A quantum geometric model of similarity. , 2013, Psychological review.

[31]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[32]  J. Jośe,et al.  Intrinsic Semantic Spaces for the representation of documents and semantic annotated data , 2014 .

[33]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[34]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[35]  Juan J. Lastra-Díaz,et al.  HESML V1R2 Java software library of ontology-based semantic similarity measures and information content models , 2016 .

[36]  Reality check on reproducibility , 2016, Nature.

[37]  Ana M. García-Serrano,et al.  A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet. , 2016 .

[38]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[39]  Tomonobu Ozaki,et al.  A method for supporting retrieval of articles on protein structure analysis considering users’ intention , 2011, BMC Bioinformatics.

[40]  Junzhong Gu,et al.  New model of semantic similarity measuring in wordnet , 2008, 2008 3rd International Conference on Intelligent System and Knowledge Engineering.

[41]  Junzhong Gu,et al.  Measuring Semantic Similarity of Word Pairs Using Path and Information Content , 2014 .

[42]  Mohamed Ali Hadj Taieb,et al.  SISR: System for integrating semantic relatedness and similarity measures , 2018 .

[43]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[44]  Vasile Rus,et al.  Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods , 2015, CICLing.

[45]  John P. A. Ioannidis,et al.  A manifesto for reproducible science , 2017, Nature Human Behaviour.

[46]  G. G. Meyer,et al.  Lecture notes in business information processing , 2009 .

[47]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[48]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[49]  Abdelhak Imoussaten,et al.  On the consideration of a bring-to-mind model for computing the Information Content of concepts defined into ontologies , 2015, 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[50]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[51]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[52]  Dennis Shasha,et al.  ReproZip: Computational Reproducibility With Ease , 2016, SIGMOD Conference.

[53]  A. Tversky Features of Similarity , 1977 .

[54]  Xiaopei Zhang,et al.  Wikipedia-based information content and semantic similarity computation , 2017, Inf. Process. Manag..

[55]  Abdelmajid Ben Hamadou,et al.  A new semantic relatedness measurement using WordNet features , 2013, Knowledge and Information Systems.

[56]  Deyi Xiong,et al.  Semantic Similarity from Natural Language and Ontology Analysis , 2016, Computational Linguistics.

[57]  Juan J. Lastra-Díaz,et al.  HESML V1R3 Java software library of ontology-based semantic similarity measures and information content models , 2017 .

[58]  Ana M. García-Serrano,et al.  Formal concept analysis for topic detection: A clustering quality experimental analysis , 2017, Inf. Syst..

[59]  Jorge Martínez Gil CoTO: A novel approach for fuzzy aggregation of semantic similarity measures , 2016, Cogn. Syst. Res..

[60]  Abdelmajid Ben Hamadou,et al.  Taxonomy-based information content and wordnet-wiktionary-wikipedia glosses for semantic relatedness , 2015, Applied Intelligence.

[61]  Ana M. García-Serrano,et al.  Linked Data-based Conceptual Modelling for Recommendation: A FCA-Based Approach , 2014, EC-Web.

[62]  Jian-Huang Lai,et al.  Exploring information from the topology beneath the Gene Ontology terms to improve semantic similarity measures. , 2016, Gene.

[63]  Jim Euchner Design , 2014, Catalysis from A to Z.

[64]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[65]  Ted Pedersen,et al.  Empiricism Is Not a Matter of Faith , 2008, Computational Linguistics.

[66]  Xiao Hua Chen,et al.  A WordNet-based semantic similarity measurement combining edge-counting and information content theory , 2015, Eng. Appl. Artif. Intell..

[67]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[68]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[69]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[70]  Antske Fokkens,et al.  Offspring from Reproduction Problems: What Replication Failure Teaches Us , 2013, ACL.

[71]  Stefania Montani,et al.  Retrieval and clustering for supporting business process adjustment and analysis , 2014, Inf. Syst..

[72]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[73]  Juan J. Lastra-Díaz,et al.  WordNet-based word similarity reproducible experiments based on HESML V1R1 and ReproZip , 2016 .

[74]  Masaki Aono,et al.  Metric of intrinsic information content for measuring semantic similarity in an ontology , 2010, APCCM.

[75]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[76]  Martin Bichler,et al.  Reproducible experiments on dynamic resource allocation in cloud data centers , 2016, Inf. Syst..

[77]  Abdelmajid Ben Hamadou,et al.  Ontology-based approach for measuring semantic similarity , 2014, Eng. Appl. Artif. Intell..

[78]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[79]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[80]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[81]  Cynthia Brandt,et al.  Semantic similarity in the biomedical domain: an evaluation across knowledge sources , 2012, BMC Bioinformatics.

[82]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[83]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[84]  Ana M. García-Serrano,et al.  A novel family of IC-based similarity measures with a detailed experimental survey on WordNet , 2015, Eng. Appl. Artif. Intell..

[85]  Steffen Staab,et al.  Taxonomy Learning - Factoring the Structure of a Taxonomy into a Semantic Classification Decision , 2002, COLING.

[86]  Xiaomei Wu,et al.  Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge- and IC-Based Hybrid Method , 2013, PloS one.

[87]  Jan Mendling,et al.  Simplifying process model abstraction: Techniques for generating model names , 2014, Inf. Syst..

[88]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[89]  Francisco M. Couto,et al.  Enhancement of Chemical Entity Identification in Text Using Semantic Similarity Validation , 2013, PloS one.

[90]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[91]  Sebastian Ahrndt,et al.  Design and Use of a Semantic Similarity Measure for Interoperability Among Agents , 2016, MATES.

[92]  Kurt Mehlhorn,et al.  Review of algorithms and data structures: the basic toolbox by Kurt Mehlhorn and Peter Sanders , 2011, SIGA.

[93]  Pablo Castells,et al.  An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval , 2007, IEEE Transactions on Knowledge and Data Engineering.

[94]  V. Ramachandran,et al.  Priority Queues and Dijkstra ’ s Algorithm , 2007 .

[95]  Mohamed Ali Hadj Taieb,et al.  Computing semantic similarity between biomedical concepts using new information content approach , 2016, J. Biomed. Informatics.

[96]  Wineke A. M. van Lent,et al.  Similarity of business process models : metrics and evaluation , 2009 .

[97]  Federica Mandreoli,et al.  Knowledge-based sense disambiguation (almost) for all structures , 2011, Inf. Syst..

[98]  Benjamin C. M. Fung,et al.  Subject-based semantic document clustering for digital forensic investigations , 2013, Data Knowl. Eng..

[99]  Wanli Zuo,et al.  An Approach for Calculating Semantic Similarity between Words Using WordNet , 2011, 2011 Second International Conference on Digital Manufacturing & Automation.

[100]  Helena Sofia Pinto,et al.  The Next Generation of Similarity Measures that Fully Explore the Semantics in Biomedical Ontologies , 2013, J. Bioinform. Comput. Biol..

[101]  Nicola J. Mulder,et al.  Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery , 2016, Briefings Bioinform..

[102]  Ahmad Abdollahzadeh Barforoush,et al.  A new word sense similarity measure in wordnet , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[103]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[104]  Mounira Harzallah,et al.  A generic framework for comparing semantic similarities on a subsumption hierarchy , 2008, ECAI.

[105]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[106]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[107]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[108]  Xiao-Ying Liu,et al.  Measuring Semantic Similarity in Wordnet , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[109]  Euripides G. M. Petrakis,et al.  X-Similarity: Computing Semantic Similarity between Concepts from Different Ontologies , 2006, J. Digit. Inf. Manag..

[110]  Joseph G. Davis,et al.  A semantic similarity measure for linked data: An information content-based approach , 2016, Knowl. Based Syst..

[111]  Ted Pedersen,et al.  Information Content Measures of Semantic Similarity Perform Better Without Sense-Tagged Text , 2010, NAACL.

[112]  M. Dolores del Castillo,et al.  SyMSS: A syntax-based measure for short-text semantic similarity , 2011, Data Knowl. Eng..

[113]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[114]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[115]  Jérôme Euzenat,et al.  A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness , 2010, SEMWEB.

[116]  David Sánchez,et al.  A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge , 2012, Int. J. Semantic Web Inf. Syst..

[117]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[118]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[119]  Mario Cannataro,et al.  Semantic similarity analysis of protein data: assessment with biological features and issues , 2012, Briefings Bioinform..

[120]  Ana M. García-Serrano,et al.  A new family of information content models with an experimental survey on WordNet , 2015, Knowl. Based Syst..

[121]  Sylvie Ranwez,et al.  The Semantic Measures Library: Assessing Semantic Similarity from Knowledge Representation Analysis , 2014, NLDB.

[122]  Montserrat Batet,et al.  Utility preserving query log anonymization via semantic microaggregation , 2013, Inf. Sci..

[123]  Valerie V. Cross,et al.  Using semantic similarity in ontology alignment , 2011, OM.

[124]  Pablo Castells,et al.  An Ontology-Based Information Retrieval Model , 2005, ESWC.

[125]  Juan J. Lastra-Díaz,et al.  HESML V1R1 Java software library of ontology-based semantic similarity measures and information content models , 2016 .

[126]  Jan Mendling,et al.  Activity labeling in process modeling: Empirical insights and recommendations , 2010, Inf. Syst..

[127]  C.W.J. van Miltenburg Wordnet-based similarity metrics for adjectives , 2016 .

[128]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[129]  Nuno Seco,et al.  Design, Implementation and Evaluation of a New Semantic Similarity Metric Combining Features and Intrinsic Information Content , 2008, OTM Conferences.

[130]  Ming Che Lee,et al.  A novel sentence similarity measure for semantic-based expert systems , 2011, Expert Syst. Appl..