Scalable algorithms for scholarly figure mining and semantics

Most scholarly papers contain one or multiple figures. Often these figures show experimental results, e.g, line graphs are used to compare various methods. Compared to the text of the paper, figures and their semantics have received relatively less attention. This has significantly limited semantic search capabilities in scholarly search engines. Here, we report scalable algorithms for generating semantic metadata for figures. Our system has four sequential modules: 1. Extraction of figure, caption and mention; 2. Binary classification of figures as compound (contains sub-figures) or not; 3. Three class classification of non compound figures as line graph, bar graph or others; and 4. Automatic processing of line graphs to generate a textual summary. In each step a metadata file is generated, each having richer information than the previous one. The algorithms are scalable yet each individual step has an accuracy greater than 80%.

[1]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[2]  Kathleen F. McCoy,et al.  Abstractive Summarization of Line Graphs from Popular Media , 2011 .

[3]  Christoph M. Friedrich,et al.  FHDO Biomedical Computer Science Group at Medical Classification Task of ImageCLEF 2015 , 2015, CLEF.

[4]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[5]  Peng Wu,et al.  Recognizing the Intended Message of Line Graphs , 2010, Diagrams.

[6]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Daniel L. Chester,et al.  Automatically Recognizing Intended Messages in Grouped Bar Charts , 2012, Diagrams.

[8]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[10]  References , 1971 .

[11]  Michael J. Cafarella,et al.  Searching for Statistical Diagrams , 2011 .

[12]  C. Lee Giles,et al.  Automatic Extraction of Figures from Scholarly Documents , 2015, DocEng.

[13]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[14]  C. Lee Giles,et al.  An Architecture for Information Extraction from Figures in Digital Libraries , 2015, WWW.

[15]  Zhe Chen,et al.  DiagramFlyer: A Search Engine for Data-Driven Diagrams , 2015, WWW.

[16]  Henning Müller,et al.  Overview of the ImageCLEF 2015 Medical Classification Task , 2015, CLEF.

[17]  Chew Lim Tan,et al.  A system for understanding imaged infographics and its applications , 2007, DocEng '07.

[18]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[19]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.