An Automatic Unsupervised Querying Algorithm for Efficient Information Extraction in Biomedical Domain

In the domain of bioinformatics, extracting a relation such as protein-protein interations from a large database of text documents is a challenging task. One major issue with biomedical information extraction is how to efficiently digest the sheer size of unstructured biomedical data corpus. Often, among these huge biomedical data, only a small fraction of the documents contain information that is relevant to the extraction task. We propose a novel query expansion algorithm to automatically discover the characteristics of documents that are useful for extraction of a target relation. Our technique introduces a hybrid query re-weighting algorithm combining the modified Robertson Sparck-Jones query ranking algorithm with a keyphrase extraction algorithm. Our technique also adopts a novel query translation technique that incorporates POS categories to query translation. We conduct a series of experiments and report the experimental results. The results show that our technique is able to retrieve more documents that contain protein-protein pairs from MEDLINE as iteration increases. Our technique is also compared with SLIPPER, a supervised rule-based query expansion technique. The results show that our technique outperforms SLIPPER from 17.90% to 29.98 better in four iterations.

[1]  Kevin Chen-Chuan Chang,et al.  Boolean Query Mapping Across Heterogeneous Information Sources , 1996, IEEE Trans. Knowl. Data Eng..

[2]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[3]  Min Song,et al.  KPSpotter: a flexible information gain-based keyphrase extraction system , 2003, WIDM '03.

[4]  Edward A. Fox,et al.  Automatic query formulations in information retrieval , 1983, J. Am. Soc. Inf. Sci..

[5]  James C. French,et al.  A Classification Approach to Boolean Query Reformulation , 1997, J. Am. Soc. Inf. Sci..

[6]  Ruud W. van der Pol,et al.  Dipe-D: A Tool for Knowledge-Based Query Formulation in Information Retrieval , 2004, Information Retrieval.

[7]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.