Use of figures in literature mining for biomedical digital libraries

The maintenance of biomedical digital libraries (including organism databases and protein databases) involves analysis of a large number of documents. Much work is done manually: curators study large numbers of biomedical documents while updating and annotating organism databases such as MGI (mouse genome informatics) and Flybase (a database of the fruit-fly genome). We summarize the annotation process in organism databases, and describe some of the roles played by the gene ontology and by document databases such as PubMed. Efforts are ongoing to automate parts of the annotation process. Biomedical text mining contests, such as the TREC Genomics Track (Hersh et al., 2004, 2005), define annotation subtasks, and provide training and test data. So far, these efforts have focused on the analysis of the text content of documents. We are investigating the analysis of figures in biomedical documents; the information derived from figure analysis may later be combined with the information derived from text analysis. We present an algorithm for using figures in document triage; triage involves determining which documents are relevant to a given annotation task. In our triage algorithm, we segment figures into subfigures and classify the subfigures as graphical, gel, fluorescence microscopy, and other microscopy. A secondary classification into subcategories is performed by clustering, using clusters created from the subfigures in the labeled training data. The classifications of all subfigures in a document are combined to form a document descriptor. The document descriptor is then classified using a naive Bayes classifier, as either relevant or irrelevant to the given annotation task

[1]  Anil K. Jain,et al.  Shape-Based Retrieval: A Case Study With Trademark Image Databases , 1998, Pattern Recognit..

[2]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[3]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[4]  Steven Dickman,et al.  Tough Mining , 2003, PLoS biology.

[5]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[6]  Kareem Darwish,et al.  The GUC Goes to TREC 2004: Using Whole or Partial Documents for Retrieval and Classification in the Genomics Track , 2004, TREC.

[7]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[8]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[9]  Stephen L. Lessnick,et al.  β-Catenin–induced melanoma growth requires the downstream target Microphthalmia-associated transcription factor , 2002, The Journal of Cell Biology.

[10]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.

[12]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[13]  J McEntyre,et al.  PubMed: bridging the information gap. , 2001, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[14]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[15]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[16]  Robert F Murphy,et al.  Automated interpretation of subcellular patterns from immunofluorescence microscopy. , 2004, Journal of immunological methods.

[17]  Robert F. Murphy,et al.  EXTRACTING AND STRUCTURING SUBCELLULAR LOCATION INFORMATION FROM ON-LINE JOURNAL ARTICLES: THE SUBCELLULAR LOCATION IMAGE FINDER , 2004 .

[18]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..