Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases

This paper introduces a novel framework for the accurate retrieval of relational concepts from huge texts. Prior to retrieval, all sentences are annotated with predicate argument structures and ontological identifiers by applying a deep parser and a term recognizer. During the run time, user requests are converted into queries of region algebra on these annotations. Structural matching with pre-computed semantic annotations establishes the accurate and efficient retrieval of relational concepts. This framework was applied to a text retrieval system for MEDLINE. Experiments on the retrieval of biomedical correlations revealed that the cost is sufficiently small for real-time applications and that the retrieval precision is significantly improved.

[1]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[2]  Charles L. A. Clarke,et al.  An Algebra for Structured Text Search and a Framework for its Implementation , 1995, Comput. J..

[3]  Toshihisa Takagi,et al.  Gene/Protein/Family Name Recognition in Biomedical Literature , 2004, HLT-NAACL 2004.

[4]  Jean Véronis,et al.  Text Encoding Initiative , 1995, Springer Netherlands.

[5]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[6]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[7]  Jun'ichi Tsujii,et al.  Syntax Annotation for the GENIA Corpus , 2005, IJCNLP.

[8]  Jun'ichi Tsujii,et al.  Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain , 2005, IJCNLP.

[9]  Jun'ichi Tsujii,et al.  Nested region algebra extended with variables for tag-annotated text search , 2008, CIKM '08.

[10]  Yusuke Miyao,et al.  Fast and scalable HPSG parsing , 2006 .

[11]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[12]  Charles L. A. Clarke,et al.  An algebra for structured text search , 1996 .

[13]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[14]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[15]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[16]  K. Taura GXP : An Interactive Shell for the Grid Environment , 2004, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04).

[17]  Jun'ichi Tsujii,et al.  Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing , 2005, ACL.

[18]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[19]  Jun'ichi Tsujii,et al.  Improving the performance of dictionary-based approaches in protein name recognition , 2004, J. Biomed. Informatics.