Extracting information from text and images for location proteomics

There is extensive interest in automating the collection, organization and summarization of biological data. Data in the form of figures and accompanying captions in literature present special challenges for such efforts. Based on our previously developed search engines to find fluorescence microscope images depicting protein subcellular patterns, we introduced text mining and Optical Character Recognition (OCR) techniques for caption understanding and figure-text matching, so as to build a robust, comprehensive toolset for extracting information about protein subcellular localization from the text and images found in online journals. Our current system can generate assertions such as "Figure N depicts a localization of type L for protein P in cell type C".

[1]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[2]  Raymond J. Mooney,et al.  Extracting gene and protein names from biomedical abstracts , 2002 .

[3]  Jean-Michel Jolion,et al.  Text localization, enhancement and binarization in multimedia documents , 2002, Object recognition supported by user interaction for service robots.

[4]  Linda G. Shapiro,et al.  Computer Vision , 2001 .

[5]  Wayne Nilback An introduction to digital image processing , 1985 .

[6]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[7]  Michael J. E. Sternberg,et al.  Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines , 2001, Pacific Symposium on Biocomputing.

[8]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[9]  Robert F. Murphy,et al.  Robust classification of subcellular location patterns in fluorescence microscope images , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[10]  Edward M. Riseman,et al.  TextFinder: An Automatic System to Detect and Recognize Text In Images , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  Alex Waibel,et al.  An automatic sign recognition and translation system , 2001, PUI '01.

[13]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[14]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..

[15]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[16]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[17]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[18]  Jie Yao,et al.  Searching online journals for fluorescence microscope images depicting protein subcellular location patterns , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[19]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[20]  William W. Cohen,et al.  Understanding captions in biomedical publications , 2003, KDD '03.

[21]  B. Kapralos,et al.  I An Introduction to Digital Image Processing , 2022 .

[22]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[25]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.