Automated data extraction of bar chart raster images

Objective: To develop software utilizing optical character recognition toward the automatic extraction of data from bar charts for meta-analysis. Methods: We utilized a multistep data extraction approach that included figure extraction, text detection, and image disassembly. PubMed Central papers that were processed in this manner included clinical trials regarding macular degeneration, a disease causing blindness with a heavy disease burden and many clinical trials. Bar chart characteristics were extracted in both an automated and manual fashion. These two approaches were then compared for accuracy. These characteristics were then compared using a Bland-Altman analysis. Results: Based on Bland-Altman analysis, 91.8% of data points were within the limits of agreement. By comparing our automated data extraction with manual data extraction, automated data extraction yielded the following accuracies: X-axis labels 79.5%, Y-tick values 88.6%, Y-axis label 88.6%, Bar value <5% error 88.0%. Discussion: Based on our analysis, we achieved an agreement between automated data extraction and manual data extraction. A major source of error was the incorrect delineation of 7s as 2s by optical character recognition library. We also would benefit from adding redundancy checks in the form of a deep neural network to boost our bar detection accuracy. Further refinements to this method are justified to extract tabulated and line graph data to facilitate automated data gathering for meta-analysis.

[1]  Guannan Gao,et al.  Probabilistic Hough Transform , 2011 .

[2]  Paul Mitchell,et al.  Different antivascular endothelial growth factor treatments and regimens and their outcomes in neovascular age-related macular degeneration: a literature review , 2013, British Journal of Ophthalmology.

[3]  Glenn J Jaffe,et al.  Intravitreal aflibercept injection for neovascular age-related macular degeneration: ninety-six-week results of the VIEW studies. , 2014, Ophthalmology.

[4]  B. Yaspan,et al.  Mechanisms of age‐related macular degeneration and therapeutic opportunities , 2014, The Journal of pathology.

[5]  Xiaoling Xia,et al.  Inception-v3 for flower classification , 2017, 2017 2nd International Conference on Image, Vision and Computing (ICIVC).

[6]  Jost B Jonas,et al.  Global prevalence of age-related macular degeneration. , 2014, The Lancet. Global health.

[7]  Yara T. E. Lechanteur,et al.  Prevalence of Age-Related Macular Degeneration in Europe , 2017, Ophthalmology.

[8]  Chew Lim Tan,et al.  A system for understanding imaged infographics and its applications , 2007, DocEng '07.

[9]  P. Jong Prevalence of age-related macular degeneration in the United States. , 2004 .

[10]  Diego Klabjan,et al.  Data Extraction from Charts via Single Deep Neural Network , 2019, ArXiv.

[11]  Laure Huot,et al.  Ranibizumab versus Bevacizumab for Neovascular Age-related Macular Degeneration: Results from the GEFAL Noninferiority Randomized Trial. , 2013, Ophthalmology.

[12]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[13]  Benita J. O’Colmain,et al.  Prevalence of age-related macular degeneration in the United States. , 2004, Archives of ophthalmology.

[14]  Derek Bradley,et al.  Adaptive Thresholding using the Integral Image , 2007, J. Graph. Tools.

[15]  Peter K Kaiser,et al.  Prospective evaluation of visual acuity assessment: a comparison of snellen versus ETDRS charts in clinical practice (An AOS Thesis). , 2009, Transactions of the American Ophthalmological Society.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Eric P. Xing,et al.  Structured Literature Image Finder: Extracting Information from Text and Images in Biomedical Literature , 2009, BioLINK@ISMB/ECCB.

[18]  Won Ki Lee,et al.  Efficacy and Safety of Ranibizumab With or Without Verteporfin Photodynamic Therapy for Polypoidal Choroidal Vasculopathy: A Randomized Clinical Trial , 2017, JAMA ophthalmology.

[19]  R. Klein,et al.  Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. , 2014, The Lancet. Global health.

[20]  Luc Van Gool,et al.  A+: Adjusted Anchored Neighborhood Regression for Fast Super-Resolution , 2014, ACCV.

[21]  Hagit Shatkay,et al.  Segmenting Compound Biomedical Figures into Their Constituent Panels , 2017, CLEF.

[22]  S Sivaprasad,et al.  Is it necessary to use three mandatory loading doses when commencing therapy for neovascular age-related macular degeneration using bevacizumab? (BeMOc Trial) , 2013, Eye.

[23]  C. Regillo,et al.  Randomized, double-masked, sham-controlled trial of ranibizumab for neovascular age-related macular degeneration: PIER Study year 1. , 2008, American journal of ophthalmology.

[24]  Michael W Kattan,et al.  Meta-analysis: Its strengths and limitations. , 2008, Cleveland Clinic journal of medicine.

[25]  Manju Patel,et al.  Year 2 efficacy results of 2 randomized controlled clinical trials of pegaptanib for neovascular age-related macular degeneration. , 2006, Ophthalmology.

[26]  Darius M Moshfeghi,et al.  Stereotactic radiotherapy for neovascular age-related macular degeneration: year 2 results of the INTREPID study. , 2015, Ophthalmology.

[27]  K. Eng,et al.  Ranibizumab in neovascular age-related macular degeneration , 2006, Clinical interventions in aging.

[28]  Michael Elad,et al.  On Single Image Scale-Up Using Sparse-Representations , 2010, Curves and Surfaces.

[29]  D. Giavarina Understanding Bland Altman analysis , 2015, Biochemia medica.

[30]  Jost B. Jonas,et al.  Updates on the Epidemiology of Age‐Related Macular Degeneration , 2017, Asia-Pacific journal of ophthalmology.

[31]  Jing Peng,et al.  Bar charts detection and analysis in biomedical literature of PubMed Central , 2017, AMIA.

[32]  Keiichi Abe,et al.  Topological structural analysis of digitized binary images by border following , 1985, Comput. Vis. Graph. Image Process..

[33]  Michael Larsen,et al.  Treat-and-Extend versus Monthly Regimen in Neovascular Age-Related Macular Degeneration: Results with Ranibizumab from the TREND Study. , 2018, Ophthalmology.

[34]  Christophe Garcia,et al.  ICDAR2015 competition on Text Image Super-Resolution , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[35]  Terje R Pedersen,et al.  Comparison of ranibizumab and bevacizumab for neovascular age-related macular degeneration according to LUCAS treat-and-extend protocol. , 2015, Ophthalmology.

[36]  Eric P. Xing,et al.  Structured literature image finder: Parsing text and figures in biomedical literature , 2010, J. Web Semant..