Knowledge based word-concept model estimation and refinement for biomedical text mining.

Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches.

[1]  Hwee Tou Ng,et al.  Word Sense Disambiguation Improves Information Retrieval , 2012, ACL.

[2]  Antonio Jimeno-Yepes,et al.  Knowledge-based biomedical word sense disambiguation: comparison of approaches , 2010, BMC Bioinformatics.

[3]  Jian-Yun Nie,et al.  Integrating word relationships into language models , 2005, SIGIR '05.

[4]  Ted Pedersen,et al.  Using UMLS Concept Unique Identifiers (CUIs) for Word Sense Disambiguation in the Biomedical Domain , 2007, AMIA.

[5]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[6]  Antonio Jimeno-Yepes,et al.  Studying the correlation between different word sense disambiguation methods and summarization effectiveness in biomedical texts , 2011, BMC Bioinformatics.

[7]  Rafael Berlanga Llavori,et al.  Tailored semantic annotation for semantic search , 2015, J. Web Semant..

[8]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[9]  Rafael Berlanga Llavori,et al.  Exploiting semantic annotations for open information extraction: an experience in the biomedical domain , 2014, Knowledge and Information Systems.

[10]  Eneko Agirre,et al.  Graph-based Word Sense Disambiguation of biomedical documents , 2010, Bioinform..

[11]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[12]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[13]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[14]  Bridget T. McInnes,et al.  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation , 2011, BMC Bioinformatics.

[15]  Antonio Jimeno-Yepes,et al.  Knowledge-based and knowledge-lean methods combined in unsupervised word sense disambiguation , 2012, IHI '12.

[16]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[17]  Antonio Jimeno-Yepes,et al.  Applications of Ontologies and Text Mining in the Biomedical Domain , 2010, Ontology Theory, Management and Design.

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  Bridget T. McInnes An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline , 2008, ACL.

[20]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[21]  Antonio Jimeno-Yepes,et al.  Terminological cleansing for improved information retrieval based on ontological terms , 2009, ESAIR '09.

[22]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[24]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[25]  Andrey Rzhetsky,et al.  Quantifying the Impact and Extent of Undocumented Biomedical Synonymy , 2014, PLoS Comput. Biol..

[26]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[27]  Eneko Agirre,et al.  Exploiting domain information for Word Sense Disambiguation of medical documents , 2011, J. Am. Medical Informatics Assoc..

[28]  Anand Kumar,et al.  Text mining and ontologies in biomedicine: Making sense of raw text , 2005, Briefings Bioinform..

[29]  Paul R. Cohen,et al.  Empirical methods for artificial intelligence , 1995, IEEE Expert.

[30]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[31]  Antonio Jimeno-Yepes,et al.  Integration of UMLS and MEDLINE in Unsupervised Word Sense Disambiguation , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[32]  Fabien L. Gandon,et al.  Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy , 2009, BMC Bioinformatics.

[33]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[34]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[35]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[36]  ChengXiang Zhai,et al.  The lemur toolkit for lan-guage modeling and information retrieval , 2003 .

[37]  Mark Stevenson,et al.  Scaling up WSD with Automatically Generated Examples , 2012, BioNLP@HLT-NAACL.

[38]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[39]  Stephen Potter,et al.  A Survey of Knowledge Acquisition from Natural Language 1 , 2001 .

[40]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[41]  Dietrich Rebholz-Schuhmann,et al.  Ontology refinement for improved information retrieval , 2010, Inf. Process. Manag..

[42]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[43]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[45]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006 .

[46]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[47]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..