A Case Study of iHOP and Other Language Processing Systems

Text Mining is the process of extracting (novel) interesting and non-trivial information and knowledge from unstructured text (Google TM search result for ''define: text mining''). Infor- mation retrieval, natural language processing, information extraction, and text mining provide methodologies to shift the burden of tracing and relating data contained in text from the human user to the computer. The emergence of high-throughput techniques has allowed biosciences to switch its research focus on Systems Biology, increasing the demands on text mining and extraction of information from heterogeneous sources. This chapter will introduce the most fundamental uses of language processing methods in biology and present the basic resources openly available in the field. The search for information about a common disease, chronic myeloid leukemia, is used to exemplify the capabilities. Tools such as PubMed, eTBLAST, METIS, EBIMed, MEDIE, MarkerInfoFinder, HCAD, iHOP, Chilibot, and G2D - selected from a comprehensive list of currently available systems - provide users with a basic platform for performing complex opera- tions on information accumulated in text.

[1]  A. Rzhetsky,et al.  Self-Correcting Maps of Molecular Pathways , 2006, PloS one.

[2]  Philip E. Bourne,et al.  Will a Biological Database Be Different from a Biological Journal? , 2005, PLoS Comput. Biol..

[3]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[4]  A. Valencia,et al.  The success (or not) of HUGO nomenclature , 2006, Genome Biology.

[5]  Terri K. Attwood,et al.  METIS: multiple extraction techniques for informative sentences , 2005, Bioinform..

[6]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[7]  Haixu Tang,et al.  MedBlast: searching articles related to a biological sequence , 2004, Bioinform..

[8]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Miguel A. Andrade-Navarro,et al.  Automatic Annotation for Biological Sequences by Etraction of Keywords from MEDLINE Abstracts: Development of a Prototype System , 1997, ISMB.

[11]  Alfonso Valencia,et al.  iHOP web services , 2007, Nucleic Acids Res..

[12]  James Lewis,et al.  Data and text mining Text similarity : an alternative way to search MEDLINE , 2006 .

[13]  Joyce A. Mitchell,et al.  Using literature-based discovery to identify disease candidate genes , 2005, Int. J. Medical Informatics.

[14]  Alfonso Valencia Search and retrieve , 2002 .

[15]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[16]  Alfonso Valencia,et al.  HCAD, closing the gap between breakpoints and genes , 2004, Nucleic Acids Res..

[17]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[18]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[19]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[20]  Alfonso Valencia,et al.  Life cycles of successful genes. , 2003, Trends in genetics : TIG.

[21]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[22]  J. Melo,et al.  Chronic myeloid leukemia--advances in biology and new approaches to treatment. , 2003, The New England journal of medicine.

[23]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  N R Smalheiser,et al.  Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[26]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[27]  Kei-Hoi Cheung,et al.  Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences , 2006 .

[28]  W. John Wilbur,et al.  The Effectiveness of Document Neighboring in Search Enhancement , 1994, Inf. Process. Manag..

[29]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[30]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[31]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[32]  A. Strife,et al.  Chronic myelogenous leukemia as a paradigm of early cancer and possible curative strategies , 2003, Leukemia.

[33]  R. A. Etten Oncogenic signaling: new insights and controversies from chronic myeloid leukemia. , 2007 .

[34]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[35]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.