Figure Text Extraction in Biomedical Literature

Background Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures. Methodology We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons. Results/Conclusions The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

[1]  Xilin Chen,et al.  Automatic detection and recognition of signs from natural scenes , 2004, IEEE Transactions on Image Processing.

[2]  Jean-Philippe Thiran,et al.  A localization/verification scheme for finding text in images and video frames based on contrast independent features and machine learning methods , 2004, Signal Process. Image Commun..

[3]  Jean-Marc Odobez,et al.  Text detection, recognition in images and video frames , 2004, Pattern Recognit..

[4]  Michael Krauthammer,et al.  Yale Image Finder (YIF): a new search engine for retrieving biomedical images , 2008, Bioinform..

[5]  Michal Irani,et al.  Super-resolution from a single image , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Marc-Peter Schambach Fast script word recognition with very large vocabulary , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[7]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[8]  Hong Yu,et al.  FigSum: Automatically Generating Structured Text Summaries for Figures in Biomedical Literature , 2009, AMIA.

[9]  Hong Yu,et al.  Hierarchical Image Classification in the Bioscience Literature , 2009, AMIA.

[10]  Victoria J. Hodge,et al.  A Novel Binary Spell Checker , 2001, ICANN.

[11]  Neill W Campbell,et al.  IEEE International Conference on Computer Vision and Pattern Recognition , 2008 .

[12]  Fast Lexicon-Based Scene Text Recognition with Sparse Belief Propagation , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[13]  William W. Cohen,et al.  Extracting information from text and images for location proteomics , 2003, BIOKDD.

[14]  S. J. Perantonis,et al.  Detection in Indoor / Outdoor Scene Images , 2005 .

[15]  Palaiahnakote Shivakumara,et al.  Efficient video text detection using edge features , 2008, 2008 19th International Conference on Pattern Recognition.

[16]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[17]  Eric P. Xing,et al.  Structured correspondence topic models for mining captioned figures in biological literature , 2009, KDD.

[18]  Herman Stehouwer,et al.  Language Models for Contextual Error Detection and Correction , 2009 .

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[21]  B. John Oommen,et al.  Spelling correction using probabilistic methods , 1984, Pattern Recognit. Lett..

[22]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[23]  Sudeep Sarkar,et al.  Robust outdoor text detection using text intensity and shape features , 2008, 2008 19th International Conference on Pattern Recognition.

[24]  Alan L. Yuille,et al.  Detecting and reading text in natural scenes , 2004, CVPR 2004.

[25]  Hong Yu,et al.  BioEx: A Novel User-Interface that Accesses Images from Abstract Sentences , 2006, HLT-NAACL.

[26]  Hong Yu,et al.  Accessing bioscience images from abstract sentences , 2006, ISMB.

[27]  Patrick Ruch Using Contextual Spelling Correction to Improve Retrieval Effectiveness in Degraded Text Collections , 2002, COLING.

[28]  Klaus U. Schulz,et al.  Adaptive text correction with Web-crawled domain-dependent dictionaries , 2007, TSLP.

[29]  Yuntao Qian,et al.  Improved recognition of figures containing fluorescence microscope images in online journal articles using graphical models , 2008, Bioinform..

[30]  Simon M. Lucas,et al.  Fast lexicon-based word recognition in noisy index card images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[31]  Ioannis Pratikakis,et al.  A Hybrid System for Text Detection in Video Frames , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[32]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[33]  Mário J. Silva,et al.  Spelling Correction for Search Engine Queries , 2004, EsTAL.

[34]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[35]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[36]  James H. Martin,et al.  Contextual Spelling Correction Using Latent Semantic Analysis , 1997, ANLP.

[37]  Cheng Thao,et al.  GoldMiner: a radiology image search engine. , 2007, AJR. American journal of roentgenology.

[38]  Allen R. Hanson,et al.  A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.

[39]  Kwanghoon Sohn,et al.  Static text region detection in video sequences using color and orientation consistencies , 2008, 2008 19th International Conference on Pattern Recognition.

[40]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[41]  Mike Paterson,et al.  Longest Common Subsequences , 1994, MFCS.

[42]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[43]  Shih-Fu Chang,et al.  Exploring Text and Image Features to Classify Images in Bioscience Literature , 2006, BioNLP@NAACL-HLT.

[44]  Hong Yu,et al.  Are figure legends sufficient? Evaluating the contribution of associated text to biomedical figure comprehension , 2009, Journal of biomedical discovery and collaboration.

[45]  Rongling Wu,et al.  Semiparametric functional mapping of quantitative trait loci governing long-term HIV dynamics , 2007, ISMB/ECCB.

[46]  Hsieh Hou,et al.  Cubic splines for image interpolation and digital filtering , 1978 .

[47]  Shashank Agarwal,et al.  Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction, Methods, Results and Discussion , 2009, Summit on translational bioinformatics.

[48]  Ioannis Pratikakis,et al.  A Two-Step Dewarping of Camera Document Images , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[49]  Hagit Shatkay,et al.  Integrating image data into biomedical text categorization , 2006, ISMB.

[50]  Bernard Gosselin,et al.  An Embedded Application for Degraded Text Recognition , 2005, EURASIP J. Adv. Signal Process..

[51]  Jie Yao,et al.  Searching online journals for fluorescence microscope images depicting protein subcellular location patterns , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[52]  Raanan Fattal,et al.  Image upsampling via imposed edge statistics , 2007, ACM Trans. Graph..

[53]  Silke Wagner,et al.  Using web search engines to improve text recognition , 2008, 2008 19th International Conference on Pattern Recognition.