A framework for ontology-based question answering with application to parasite immunology

BackgroundLarge quantities of biomedical data are being produced at a rapid pace for a variety of organisms. With ontologies proliferating, data is increasingly being stored using the RDF data model and queried using RDF based querying languages. While existing systems facilitate the querying in various ways, the scientist must map the question in his or her mind to the interface used by the systems. The field of natural language processing has long investigated the challenges of designing natural language based retrieval systems. Recent efforts seek to bring the ability to pose natural language questions to RDF data querying systems while leveraging the associated ontologies. These analyze the input question and extract triples (subject, relationship, object), if possible, mapping them to RDF triples in the data. However, in the biomedical context, relationships between entities are not always explicit in the question and these are often complex involving many intermediate concepts.ResultsWe present a new framework, OntoNLQA, for querying RDF data annotated using ontologies which allows posing questions in natural language. OntoNLQA offers five steps in order to answer natural language questions. In comparison to previous systems, OntoNLQA differs in how some of the methods are realized. In particular, it introduces a novel approach for discovering the sophisticated semantic associations that may exist between the key terms of a natural language question, in order to build an intuitive query and retrieve precise answers. We apply this framework to the context of parasite immunology data, leading to a system called AskCuebee that allows parasitologists to pose genomic, proteomic and pathway questions in natural language related to the parasite, Trypanosoma cruzi. We separately evaluate the accuracy of each component of OntoNLQA as implemented in AskCuebee and the accuracy of the whole system. AskCuebee answers 68 % of the questions in a corpus of 125 questions, and 60 % of the questions in a new previously unseen corpus. If we allow simple corrections by the scientists, this proportion increases to 92 %.ConclusionsWe introduce a novel framework for question answering and apply it to parasite immunology data. Evaluations of translating the questions to RDF triple queries by combining machine learning, lexical similarity matching with ontology classes, properties and instances for specificity, and discovering associations between them demonstrate that the approach performs well and improves on previous systems. Subsequently, OntoNLQA offers a viable framework for building question answering systems in other biomedical domains.

[1]  Jan Griebsch,et al.  All-Pairs Ancestor Problems in Weighted Dags , 2007, ESCAPE.

[2]  Enrico Motta,et al.  An Infrastructure for Acquiring High Quality Semantic Metadata , 2006, ESWC.

[3]  Ross Buchan,et al.  Codon pair bias in prokaryotic and eukaryotic genomes , 2005, BMC Bioinformatics.

[4]  Vipul Kashyap,et al.  The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside , 2011, J. Biomed. Semant..

[5]  Adrian Paschke,et al.  A journey to Semantic Web query federation in the life sciences , 2009, BMC Bioinformatics.

[6]  Amit P. Sheth,et al.  A Semantic Problem Solving Environment for Integrative Parasite Research: Identification of Intervention Targets for Trypanosoma cruzi , 2012, PLoS neglected tropical diseases.

[7]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[8]  Mark Klein,et al.  Semantic Process Retrieval with iSPARQL , 2007, ESWC.

[9]  K. Bretonnel Cohen,et al.  BioCreAtIvE Task1A: entity identification with a stochastic tagger , 2005, BMC Bioinformatics.

[10]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[11]  Amit P. Sheth,et al.  From Questions to Effective Answers: On the Utility of Knowledge-Driven Querying Systems for Life Sciences Data , 2013, DILS.

[12]  Bijan Parsia,et al.  Pellet: An OWL DL Reasoner , 2004, Description Logics.

[13]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[14]  Carol Friedman,et al.  Introduction: named entity recognition in biomedicine , 2004, J. Biomed. Informatics.

[15]  Larry R. Taube,et al.  Weighted similarity measure heuristics for the group technology machine clustering problem , 1985 .

[16]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[17]  Abraham Bernstein,et al.  How Useful Are Natural Language Interfaces to the Semantic Web for Casual End-Users? , 2007, ISWC/ASWC.

[18]  Eileen Kraemer,et al.  TriTrypDB: a functional genomic resource for the Trypanosomatidae , 2009, Nucleic Acids Res..

[19]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[20]  Ann M. Hess,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Filtering for increased power for microarray data analysis , 2008 .

[21]  Yuji Matsumoto,et al.  Japanese Named Entity Extraction with Redundant Morphological Analysis , 2003, NAACL.

[22]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[23]  Jacques Ravel,et al.  Visualization of comparative genomic analyses by BLAST score ratio , 2005, BMC Bioinformatics.

[24]  Benjamin M. Good,et al.  The Life Sciences Semantic Web is Full of Creeps! , 2006, Briefings Bioinform..

[25]  Marc Ehrig,et al.  Ontology Alignment: Bridging the Semantic Gap , 2006 .

[26]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[27]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[28]  Amit P. Sheth,et al.  TcruziKB: Enabling Complex Queries for Genomic Data Exploration , 2008, 2008 IEEE International Conference on Semantic Computing.

[29]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[30]  Enrico Motta,et al.  Integration of micro-gravity and geodetic data to constrain shallow system mass changes at Krafla Volcano, N Iceland , 2006 .

[31]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[32]  Oren Etzioni,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[33]  Richard Power,et al.  Composing Questions through Conceptual Authoring , 2007, CL.

[34]  Amit P. Sheth,et al.  Ontology-Driven Provenance Management in eScience: An Application in Parasite Research , 2009, OTM Conferences.

[35]  Hamish Cunningham,et al.  Natural Language Interfaces to Ontologies: Combining Syntactic Analysis and Ontology-Based Lookup through the User Interaction , 2010, ESWC.

[36]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[37]  Xiaohua Hu,et al.  The Evaluation of Sentence Similarity Measures , 2008, DaWaK.

[38]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[39]  N. F. Noy,et al.  Ontology Development 101: A Guide to Creating Your First Ontology , 2001 .

[40]  Nigel Shadbolt,et al.  A Visual Approach to Semantic Query Design Using a Web-Based Graphical Query Designer , 2008, EKAW.

[41]  Walter Daelemans,et al.  Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 , 2003 .

[42]  Brandon Barker,et al.  Genomic analysis of gene regulation complexity , 2008, BMC Bioinformatics.

[43]  Pierre Zweigenbaum,et al.  Indexing UMLS Semantic Types for Medical Question-Answering , 2005, MIE.

[44]  Hsin-Hsi Chen,et al.  Enhancing Performance of Protein Name Recognizers Using Collocation , 2003, BioNLP@ACL.

[45]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[46]  Josef van Genabith,et al.  Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation , 2008, COLING 2008.

[47]  Renaud Delbru SIREn: entity retrieval system for the web of data , 2009 .

[48]  Enrico Motta,et al.  Is Question Answering fit for the Semantic Web?: A survey , 2011, Semantic Web.

[49]  Marcus Hutter,et al.  Bayesian DNA copy number analysis , 2009, BMC Bioinformatics.

[50]  Bernard De Baets,et al.  BioGateway: a semantic systems biology tool for the life sciences , 2009, BMC Bioinformatics.

[51]  Robert E. Tarjan,et al.  Scaling and related techniques for geometry problems , 1984, STOC '84.

[52]  Jimmy J. Lin,et al.  Overview of the TREC 2007 Question Answering Track , 2008, TREC.

[53]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[54]  Jun'ichi Tsujii,et al.  Probabilistic term variant generator for biomedical terms , 2003, SIGIR.

[55]  Sriharsha Veeramachaneni,et al.  A Simple Semi-supervised Algorithm For Named Entity Recognition , 2009, HLT-NAACL 2009.

[56]  Stefanos D. Kollias,et al.  A String Metric for Ontology Alignment , 2005, SEMWEB.

[57]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[58]  Camden J. Hallmark,et al.  Chagas Disease: “The New HIV/AIDS of the Americas” , 2012, PLoS neglected tropical diseases.

[59]  Abraham Bernstein,et al.  Querying the Semantic Web with Ginseng: A Guided Input Natural Language Search Engine , 2009 .

[60]  Tetsuro Toyoda,et al.  BioSPARQL: ontology-based smart building of SPARQL queries for biological linked open data , 2011, SWAT4LS.

[61]  Maria Teresa Pazienza,et al.  Semantic turkey: a browser-integrated environment for knowledge acquisition and management , 2012 .

[62]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[63]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[64]  Dorothea Wagner,et al.  Speed-Up Techniques for Shortest-Path Computations , 2007, STACS.

[65]  Malvina Nissim,et al.  Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web , 2004, NLPBA/BioNLP.

[66]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[67]  Hongfang Liu,et al.  Pacific Symposium on Biocomputing 9:238-249(2004) BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY , 2022 .

[68]  Uzay Kaymak,et al.  RDF-GL: A SPARQL-Based Graphical Query Language for RDF , 2010, Emergent Web Intelligence.

[69]  Prashant Doshi,et al.  On the Utility of WordNet for Ontology Alignment: Is it Really Worth it? , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[70]  J. Gobeill,et al.  Question answering for biology and medicine , 2009, 2009 9th International Conference on Information Technology and Applications in Biomedicine.

[71]  Jian Su,et al.  Effective Adaptation of Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain , 2003, BioNLP@ACL.

[72]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[73]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[74]  Alexander A. Morgan,et al.  Gene Name Extraction Using FlyBase Resources , 2003, BioNLP@ACL.

[75]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[76]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[77]  Enrico Motta,et al.  AquaLog: An ontology-driven question answering system for organizational semantic intranets , 2007, J. Web Semant..

[78]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[79]  Ismailcem Budak Arpinar,et al.  QUESTION ANSWERING in LINKED DATA for SCIENTIFIC EXPLORATION , 2010 .

[80]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[81]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[82]  Jin-Dong Kim,et al.  Natural Language Query Processing for Life Science Knowledge - Position Paper , 2010, AMT.

[83]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[84]  Maria Wolters,et al.  Prosody and the Resolution of Pronominal Anaphora , 2000, International Conference on Computational Linguistics.

[85]  Hong Yu,et al.  AskHERMES: An online question answering system for complex clinical questions , 2011, J. Biomed. Informatics.

[86]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[87]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .