Automated extraction of information in molecular biology

We review data mining techniques in molecular biology, specifically those that extract information from the scientific literature itself. As more of the biological literature is published electronically, there is an opportunity, and even a need, to automatically summarize the literature in a customized way, for example by associating keywords to a topic. These keywords can be extracted from relevant publications. The process of keyword extraction can be automated and optimized to keep literature pointers automatically up‐to‐date or to filter relevant information from the literature. To illustrate these points, OMIM (Online Mendelian Inheritance in Man), a database of human inherited diseases, was linked to the literature and keywords were derived that covered distinct aspects such as genetic information on the one hand and disease‐specific protein and phenotypic information on the other. They were used to extract information that is helpful for keeping entries about disease up‐to‐date.

[1]  J. Naylor,et al.  Mendelian inheritance in man: A catalog of human genes and genetic disorders , 1996 .

[2]  Gerald Salton,et al.  Automatic text processing , 1988 .

[3]  Mark Sanderson,et al.  Conceptual Information Retrieval – A Case Study in Adaptive Partial Parsing , 1992 .

[4]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[5]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[6]  Rolf Apweiler,et al.  A novel method for automatic functional annotation of proteins , 1999, Bioinform..

[7]  Hans Paijmans Comparing the document representations of two IR-systems: CLARIT and TOPIC , 1993 .

[8]  Yasunori Yamamoto,et al.  Automatic Construction of Knowledge Base from Biological Papers , 1997, ISMB.

[9]  Gerald DeJong,et al.  Conceptual information retrieval , 1980, SIGIR '80.

[10]  Christopher J. Van Wyk,et al.  Data Structures and C Programs, 2nd Ed. (Addison-Wesley Series in Computer Science) , 1991 .

[11]  Daniel G. Shapiro,et al.  RUBRIC: A System for Rule-Based Information Retrieval , 1985, IEEE Transactions on Software Engineering.

[12]  Karen Spärck Jones,et al.  Natural language processing for information retrieval , 1996, CACM.

[13]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[14]  Yorick Wilks,et al.  Information Extraction: Beyond Document Retrieval , 1998, Int. J. Comput. Linguistics Chin. Lang. Process..

[15]  John E. Ulmschneider,et al.  A practical stemming algorithm for online search assistance , 1983 .

[16]  Jaime Prilusky,et al.  GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support , 1998, Bioinform..