Biomedical ontology alignment: an approach based on representation learning

BackgroundWhile representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance.ResultsAn ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results.ConclusionsOur proposed representation learning approach leverages terminological embeddings to capture semantic similarity. Our results provide evidence that the approach produces embeddings that are especially well tailored to the ontology matching task, demonstrating a novel pathway for the problem.

[1]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[2]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[3]  Miguel Ángel Rodríguez-García,et al.  Integrating phenotype ontologies with PhenomeNET , 2016, OM@ISWC.

[4]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[5]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[6]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[7]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[8]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative , 2007 .

[9]  Warith Eddine Djeddi,et al.  A Novel Approach Using Context-Based Measure for Matching Large Scale Ontologies , 2014, DaWaK.

[10]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[11]  Emanuel Santos,et al.  To repair or not to repair: reconciling correctness and coherence in ontology reference alignments , 2013, OM.

[12]  Christoph Lofi,et al.  Measuring Semantic Similarity and Relatedness with Distributional and Knowledge- based Approaches , 2015 .

[13]  Nicu Sebe,et al.  Detecting anomalous events in videos by learning deep representations of appearance and motion , 2017, Comput. Vis. Image Underst..

[14]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[15]  Erhard Rahm,et al.  Evolution of biomedical ontologies and mappings: Overview of recent approaches , 2016, Computational and structural biotechnology journal.

[16]  Rudolf Wille,et al.  Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts , 2009, ICFCA.

[17]  Pamela Faber Benítez The cognitive shift in terminology and specialized translation , 2009 .

[18]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[19]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[20]  Erik Marchi,et al.  A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  L. S. Shapley,et al.  College Admissions and the Stability of Marriage , 2013, Am. Math. Mon..

[22]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[23]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[24]  Bernardo Cuenca Grau,et al.  LogMap: Logic-Based and Scalable Ontology Matching , 2011, SEMWEB.

[25]  Cynthia L. Smith,et al.  Integrating phenotype ontologies across multiple species , 2010, Genome Biology.

[26]  Pascal Hitzler,et al.  String Similarity Metrics for Ontology Alignment , 2013, SEMWEB.

[27]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[28]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Mary E. Mangan,et al.  The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data , 2005, Genome Biology.

[31]  Christian Meilicke,et al.  Alignment incoherence in ontology matching , 2011 .

[32]  Shai Ben-David,et al.  Multi-task and Lifelong Learning of Kernels , 2015, ALT.

[33]  Songmao Zhang,et al.  Identifying and validating ontology mappings by formal concept analysis , 2016, OM@ISWC.

[34]  Cosmin Stroe,et al.  Automatic Configuration Selection Using Ontology Matching Task Profiling , 2012, ESWC.

[35]  Xiang Zhang,et al.  Multi-domain ontology mapping based on semantics , 2017, Cluster Computing.

[36]  Stijn Heymans,et al.  Semantic validation of the use of SNOMED CT in HL7 clinical documents , 2011, J. Biomed. Semant..

[37]  Charu C. Aggarwal,et al.  Outlier Detection with Autoencoder Ensembles , 2017, SDM.

[38]  Bowen Zhou,et al.  Medical Synonym Extraction with Concept Space Models , 2015, IJCAI.

[39]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[40]  Jun Zhao,et al.  Ontology Matching with Word Embeddings , 2014, CCL.

[41]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[42]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[43]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[44]  Panagiotis G. Ipeirotis,et al.  Automatic Extraction of Useful Facet Hierarchies from Text Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[45]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[46]  Robert Arp,et al.  Building Ontologies with Basic Formal Ontology , 2015 .

[47]  Mark S. Tuttle,et al.  NCI Thesaurus: Using Science-Based Terminology to Integrate Cancer Research Results , 2004, MedInfo.

[48]  Hinrich Schütze,et al.  Learning Better Embeddings for Rare Words Using Distributional Representations , 2015, EMNLP.

[49]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[50]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[51]  José L. V. Mejino,et al.  Pushing the envelope: challenges in a frame-based representation of human anatomy , 2004, Data Knowl. Eng..

[52]  Boris Vrdoljak,et al.  Cromatcher: An Ontology Matching System Based on Automated Weighted Aggregation and Iterative Final Alignment , 2016, J. Web Semant..

[53]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[54]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[55]  Mahantesh Halappanavar,et al.  On Stable Marriages and Greedy Matchings , 2016, CSC.

[56]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Stephen Clark,et al.  Specializing Word Embeddings for Similarity or Relatedness , 2015, EMNLP.

[58]  Werner Ceusters,et al.  Towards a Reference Terminology for Ontology Research and Development in the Biomedical Domain , 2006, KR-MED.

[59]  Michael B. Spring,et al.  Ontology Mapping: As a Binary Classification Problem , 2008, 2008 Fourth International Conference on Semantics, Knowledge and Grid.

[60]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[61]  Anton Nekrutenko,et al.  The elastic analysis with galaxy on the cloud , 2010, Genome Biology.

[62]  M. de Rijke,et al.  Siamese CBOW: Optimizing Word Embeddings for Sentence Representations , 2016, ACL.

[63]  Ian Horrocks,et al.  Ontology Integration Using Mappings: Towards Getting the Right Logical Consequences , 2009, ESWC.

[64]  Olivier Bodenreider,et al.  Of Mice and Men: Aligning Mouse and Human Anatomies , 2005, AMIA.

[65]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[66]  Ani Nenkova,et al.  Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2016, NAACL 2016.

[67]  Valentina Ivanova,et al.  Experiences from the anatomy track in the ontology alignment evaluation initiative , 2017, Journal of Biomedical Semantics.

[68]  Songmao Zhang,et al.  Matching biomedical ontologies based on formal concept analysis , 2018, Journal of Biomedical Semantics.

[69]  L. B. Wilson,et al.  Stable marriage assignment for unequal sets , 1970 .

[70]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[71]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[72]  Yi Xing,et al.  Evidence of functional selection pressure for alternative splicingevents that accelerate evolution of protein subsequences , 2005, Genome Biology.

[73]  Ian Horrocks,et al.  Logic-based assessment of the compatibility of UMLS ontology sources , 2011, J. Biomed. Semant..

[74]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[75]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[76]  Lorena Otero-Cerdeira,et al.  Ontology matching: A literature review , 2015, Expert Syst. Appl..

[77]  Emanuel Santos,et al.  The AgreementMakerLight Ontology Matching System , 2013, OTM Conferences.

[78]  Emanuel Santos,et al.  AgreementMakerLight 2.0: Towards Efficient Large-Scale Ontology Matching , 2014, SEMWEB.

[79]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[80]  Zhifang Sui,et al.  ERSOM: A Structural Ontology Matching Approach Using Automatically Learned Entity Representation , 2015, EMNLP.

[81]  A. Tversky Features of Similarity , 1977 .

[82]  Noah A. Smith,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016, ACL 2016.

[83]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[84]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[85]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[86]  Eric P. Xing,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2014, ACL 2014.

[87]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[88]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[89]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[90]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[91]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[92]  Giovanna Guerrini,et al.  Detecting and Correcting Conservativity Principle Violations in Ontology-to-Ontology Mappings , 2014, SEMWEB.

[93]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[94]  John M. Hancock,et al.  Using ontologies to describe mouse phenotypes , 2004, Genome Biology.

[95]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[96]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[97]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[98]  Olivier Bodenreider,et al.  Experience in Aligning Anatomical Ontologies , 2007, Int. J. Semantic Web Inf. Syst..

[99]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[100]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.