DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures

Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/.

[1]  Hong Yu,et al.  Automatic Figure Ranking and User Interfacing for Intelligent Figure Search , 2010, PloS one.

[2]  Hong Yu,et al.  Learning to Rank Figures within a Biomedical Article , 2014, PloS one.

[3]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[4]  Hong Yu,et al.  Figure Text Extraction in Biomedical Literature , 2011, PloS one.

[5]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Jin Hyung Kim,et al.  Scene Text Extraction with Edge Constraint and Text Collinearity , 2010, 2010 20th International Conference on Pattern Recognition.

[7]  Robert M. Haralick,et al.  Performance evaluation of document layout analysis algorithms on the UW data set , 1997, Electronic Imaging.

[8]  Jun Zhang,et al.  Multi-Orientation Scene Text Detection with Adaptive Clustering , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jiri Matas,et al.  Scene Text Localization and Recognition with Oriented Stroke Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[11]  Cheng-Lin Liu,et al.  A Hybrid Approach to Detect and Localize Texts in Natural Scene Images , 2011, IEEE Transactions on Image Processing.

[12]  David S. Doermann,et al.  Text Detection and Recognition in Imagery: A Survey , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hong Yu,et al.  Accessing bioscience images from abstract sentences , 2006, ISMB.

[14]  S.M. Lucas,et al.  ICDAR 2005 text locating competition results , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[15]  Zhuowen Tu,et al.  Detecting Texts of Arbitrary Orientations in 1 Natural Images , 2012 .

[16]  Chucai Yi,et al.  Text String Detection From Natural Scenes by Structure-Based Partition and Grouping , 2011, IEEE Transactions on Image Processing.

[17]  Eric P. Xing,et al.  Structured correspondence topic models for mining captioned figures in biological literature , 2009, KDD.

[18]  Neill W Campbell,et al.  IEEE International Conference on Computer Vision and Pattern Recognition , 2008 .

[19]  Klaus Meyer-Wegener,et al.  NEOCR: A Configurable Dataset for Natural Image Text Recognition , 2011, CBDAR.

[20]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[22]  Jin Hyung Kim,et al.  Texture-Based Approach for Text Detection in Images Using Support Vector Machines and Continuously Adaptive Mean Shift Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Hagit Shatkay,et al.  Integrating image data into biomedical text categorization , 2006, ISMB.

[24]  Yingli Tian,et al.  Localizing Text in Scene Images by Boundary Clustering, Stroke Segmentation, and String Fragment Classification , 2012, IEEE Transactions on Image Processing.

[25]  Hong Yu,et al.  Lancet: a high precision medication event extraction system for clinical text , 2010, J. Am. Medical Informatics Assoc..

[26]  John M. Conroy,et al.  Beyond Captions: Linking Figures with Abstract Sentences in Biomedical Articles , 2012, PloS one.

[27]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[28]  Allen R. Hanson,et al.  Scene Text Recognition Using Similarity and a Lexicon with Sparse Belief Propagation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Erik G. Learned-Miller,et al.  Improving Open-Vocabulary Scene Text Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[30]  C. V. Jawahar,et al.  Top-down and bottom-up cues for scene text recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Chunheng Wang,et al.  Scene Text Recognition Using Part-Based Tree-Structured Character Detection , 2013, CVPR 2013.

[32]  Manabu Torii,et al.  A framework for biomedical figure segmentation towards image-based document retrieval , 2013, BMC Systems Biology.

[33]  Eric P. Xing,et al.  Structured literature image finder: Parsing text and figures in biomedical literature , 2010, J. Web Semant..

[34]  Michael Krauthammer,et al.  Yale Image Finder (YIF): a new search engine for retrieving biomedical images , 2008, Bioinform..

[35]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[36]  Simon M. Lucas,et al.  ICDAR 2003 robust reading competitions , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[37]  Hong Yu,et al.  Automatic figure classification in bioscience literature , 2011, J. Biomed. Informatics.

[38]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Jean-Michel Jolion,et al.  Object count/area graphs for the evaluation of object detection and segmentation algorithms , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[40]  Partha Pratim Roy,et al.  ICDAR 2011 Robust Reading Competition - Challenge 1: Reading Text in Born-Digital Images (Web and Email) , 2011, 2011 International Conference on Document Analysis and Recognition.

[41]  Hyung Il Koo,et al.  Scene Text Detection via Connected Component Clustering and Nontext Filtering , 2013, IEEE Transactions on Image Processing.

[42]  Andreas Dengel,et al.  ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[43]  Yuntao Qian,et al.  Improved recognition of figures containing fluorescence microscope images in online journal articles using graphical models , 2008, Bioinform..

[44]  Xian-Sheng Hua,et al.  An automatic performance evaluation protocol for video text detection algorithms , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[45]  Shashank Agarwal,et al.  An IR-Aided Machine Learning Framework for the BioCreative II.5 Challenge , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[47]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Chunheng Wang,et al.  Scene text detection using graph model built upon maximally stable extremal regions , 2013, Pattern Recognit. Lett..

[49]  Alan L. Yuille,et al.  Detecting and reading text in natural scenes , 2004, CVPR 2004.