Knowledge Rich Natural Language Queries over Structured Biological Databases

Increasingly, keyword, natural language and NoSQL queries are used for information retrieval from both traditional and non-traditional databases such as web, document, GIS, legal, and health databases. While their popularity are undeniable for obvious reasons, their engineering is far from simple. In most part, semantics and intent preserving mapping of a well understood natural language query expressed over a structured database schema to a structured query language is still a difficult task, and research to tame the complexity is intense. In this paper, we propose a multi-level knowledge-based middleware to facilitate such mappings that separate the conceptual level from the physical level. We augment these multi-level abstractions with a concept reasoner and a query strategy engine to dynamically link arbitrary natural language querying to well defined structured queries. We demonstrate the feasibility of our approach by presenting a Datalog based prototype system, called BioSmart, that can compute responses to arbitrary natural language queries over arbitrary databases once a syntactic classification of the natural language query is made.

[1]  Ismailcem Budak Arpinar,et al.  Linking and querying genomic datasets using natural language , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[2]  H. V. Jagadish,et al.  Qunits: queried units in database search , 2009, CIDR.

[3]  Ulf Leser,et al.  GeneView: a comprehensive semantic search engine for PubMed , 2012, Nucleic Acids Res..

[4]  Mimmo Parente,et al.  Natural Language Query Processing Framework for Biomedical Literature , 2015, IFSA-EUSFLAT.

[5]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[6]  Stefan Riezler,et al.  NLmaps: A Natural Language Interface to Query OpenStreetMap , 2016, COLING.

[7]  Jerome Wang,et al.  An Applied Evaluation of SNOMED CT as a Clinical Vocabulary for the Computerized Diagnosis and Problem List , 2003, AMIA.

[8]  A. P. Peter,et al.  Cyanobacterial KnowledgeBase (CKB), a Compendium of Cyanobacterial Genomes and Proteomes , 2015, PloS one.

[9]  Fei Li,et al.  Understanding Natural Language Queries over Relational Databases , 2016, SGMD.

[10]  Val Tannen,et al.  TreeBASE2: Rise of the Machines , 2010 .

[11]  Hasan M. Jamil,et al.  Designing Integrated Computational Biology Pipelines Visually , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[13]  D. Warren,et al.  Xsb -a System for Eeciently Computing Well Founded Semantics , 1997 .

[14]  Robert Rinker,et al.  Visual orchestration and autonomous execution of distributed and heterogeneous computational biology pipelines , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[15]  Hasan M. Jamil,et al.  Mapping abstract queries to big data web resources for on-the-fly data integration and information retrieval , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[16]  Tova Milo,et al.  A Natural Language Interface for Querying General and Individual Knowledge , 2015, Proc. VLDB Endow..

[17]  Michael Travers,et al.  BioBIKE: A Web-based, programmable, integrated biological knowledge base , 2009, Nucleic Acids Res..

[18]  J. Lamerdin,et al.  The photosynthetic apparatus of Prochlorococcus: Insights through comparative genomics , 2004, Photosynthesis Research.

[19]  Enrique Baca-García,et al.  Novel Use of Natural Language Processing (NLP) to Predict Suicidal Ideation and Psychiatric Symptoms in a Text-Based Mental Health Intervention in Madrid , 2016, Comput. Math. Methods Medicine.

[20]  Juliana Freire,et al.  PruSM: a prudent schema matching approach for web forms , 2010, CIKM.

[21]  Aminul Islam,et al.  The Power of Declarative Languages: A Comparative Exposition of Scientific Workflow Design Using BioFlow and Taverna , 2009, 2009 Congress on Services - I.

[22]  Wesley W. Chu CoBase: A Cooperative Query Answering Facility for Database Systems , 1993, DEXA.

[23]  Sébastien Ferré,et al.  Sparklis: An expressive query builder for SPARQL endpoints with guidance in natural language , 2016, Semantic Web.

[24]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[25]  Hasan M. Jamil,et al.  A Visual Interface for Querying Heterogeneous Phylogenetic Databases , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Louxin Zhang,et al.  WebPHYLIP: a web interface to PHYLIP , 1999, Bioinform..

[27]  Kazi Zakia Sultana,et al.  A Model for Contextual Cooperative Query Answering in E-Commerce Applications , 2009, FQAS.

[28]  Shazzad Hosain,et al.  On-the-Fly Integration and Ad Hoc Querying of Life Sciences Databases Using LifeDB , 2009, DEXA.

[29]  Jeff Elhai Humans, Computers, and the Route to Biological Insights: Regaining Our Capacity for Surprise , 2011, J. Comput. Biol..

[30]  M. Gerstein,et al.  The GENCODE pseudogene resource , 2012, Genome Biology.

[31]  Hasan M. Jamil,et al.  Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Aminul Islam,et al.  A declarative language and toolkit for scientific workflow implementation and execution , 2010, Int. J. Bus. Process. Integr. Manag..

[33]  Robert Rinker,et al.  Implementing computational biology pipelines using VisFlow , 2017, Int. J. Data Min. Bioinform..

[34]  Andrzej Zielezinski,et al.  ORCAN—a web‐based meta‐server for real‐time detection and functional annotation of orthologs , 2017, Bioinform..

[35]  Dimitra Gkatzia,et al.  Natural Language Generation enhances human decision-making with uncertain information , 2016, ACL.

[36]  Joel Dudley,et al.  MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences , 2008, Briefings Bioinform..

[37]  Hasan M Jamil,et al.  A natural language interface plug-in for cooperative query answering in biological databases , 2012, BMC Genomics.

[38]  References , 1971 .

[39]  Hasan M. Jamil,et al.  VisFlow: A Visual Database Integration and Workflow Querying System , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[40]  C. Sensen,et al.  Using Multiple Tools for Automated Genome Interpretation in an Integrated Environment , 1996 .

[41]  Cong Yu,et al.  Enabling Schema-Free XQuery with meaningful query focus , 2008, The VLDB Journal.

[42]  Salvador Capella-Gutiérrez,et al.  PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome , 2013, Nucleic Acids Res..

[43]  Romas Aleliunas,et al.  A knowledge-based subsystem for a natural language interface to a database that predicts and explains query failures , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[44]  I. Melzer Web Services Description Language , 2010 .

[45]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[46]  Philipp Cimiano,et al.  Towards portable natural language interfaces to knowledge bases - The case of the ORAKEL system , 2008, Data Knowl. Eng..

[47]  Jon D. Patrick,et al.  Restricted natural language based querying of clinical databases , 2014, J. Biomed. Informatics.

[48]  Umar Farooq Minhas,et al.  ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores , 2016, Proc. VLDB Endow..

[49]  Francisco Curbera,et al.  Web services description language (wsdl) version 1. 2 , 2001 .

[50]  N. Malcolm On Knowledge and Belief , 1954 .

[51]  B. B. Aklilu,et al.  Molecular Evolution and Functional Diversification of Replication Protein A1 in Plants , 2016, Front. Plant Sci..

[52]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[53]  Y. Tateno,et al.  Ortholog-Finder: A Tool for Constructing an Ortholog Data Set , 2016, Genome biology and evolution.

[54]  Jens Lehmann,et al.  AskNow: A Framework for Natural Language Query Formalization in SPARQL , 2016, ESWC.

[55]  Fleur Mougin,et al.  Querying biomedical Linked Data with natural language questions , 2017, Semantic Web.

[56]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[57]  Jeff Shrager,et al.  The evolution of BioBike: Community adaptation of a biocomputing platform , 2007 .

[58]  Edleno Silva de Moura,et al.  A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces , 2010, Proc. VLDB Endow..

[59]  Amelia A. Lewis Web Services Description Language (WSDL) Version 2.0: Additional MEPs , 2007 .

[60]  Hasan M. Jamil,et al.  An Efficient Web-Based Wrapper and Annotator for Tabular Data , 2010, Int. J. Softw. Eng. Knowl. Eng..

[61]  Parke Godfrey,et al.  An Architecture for a Cooperative Database System , 1994, ADB.

[62]  Steffen Staab,et al.  Web-Prospector - An Automatic, Site-Wide Wrapper Induction Approach for Scientific Deep-Web Databases , 2009, BTW.

[63]  Michael Kifer,et al.  Logical foundations of object-oriented and frame-based languages , 1995, JACM.

[64]  Kathleen Dahlgren,et al.  Natural Language Query in the Biochemistry and Molecular Biology Domains Based on Cognition Search™ , 2009, Summit on translational bioinformatics.

[65]  Hasan M. Jamil,et al.  Toward a Cooperative Natural Language Query Interface for Biological Databases , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[66]  Konstantinos Sagonas,et al.  XSB as an efficient deductive database engine , 1994, SIGMOD '94.

[67]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[68]  Hasan M. Jamil,et al.  A Visual Interface for on-the-fly Biological Database Integration and Workflow Design Using VizBuilder , 2009, DILS.

[69]  Abraham Bernstein,et al.  Evaluating the usability of natural language query languages and interfaces to Semantic Web knowledge bases , 2010, J. Web Semant..

[70]  Hasan M. Jamil,et al.  Pruning Forests to Find the Trees , 2016, SSDBM.

[71]  Fei Li,et al.  Schema-free SQL , 2014, SIGMOD Conference.

[72]  Miguel Calejo InterProlog: Towards a Declarative Embedding of Logic Programming in Java , 2004, JELIA.

[73]  Nikolaos Aletras,et al.  Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective , 2016, PeerJ Comput. Sci..