Identifying the Central Figure of a Scientific Paper

Publishers are increasingly using graphical abstracts to facilitate scientific search, especially across disciplinary boundaries. They are presented on various media, easily shared and information rich. However, very small amount of scientific publications are equipped with graphical abstracts. What can we do with the vast majority of papers with no selected graphical abstract? In this paper, we first hypothesize that scientific papers actually include a "central figure" that serve as a graphical abstract. These figures convey the key results and provide a visual identity for the paper. Using survey data collected from 6,263 authors regarding 8,353 papers over 15 years, we find that over 87% of papers are considered to contain a central figure, and that these central figures are primarily used to summarize important results, explain the key methods, or provide additional discussion. We then train a model to automatically recognize the central figure, achieving top-3 accuracy of 78% and exact match accuracy of 34%. We find that the primary boost in accuracy comes from figure captions that resemble the abstract. We make all our data and results publicly available at https://github.com/viziometrics/centraul_figure. Our goal is to automate central figure identification to improve search engine performance and to help scientists connect ideas across the literature.

[1]  Colin Ware,et al.  Information Visualization: Perception for Design , 2000 .

[2]  Zhe Chen,et al.  DiagramFlyer: A Search Engine for Data-Driven Diagrams , 2015, WWW.

[3]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[4]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[5]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[6]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[7]  Bill Howe,et al.  PhyloParser: A Hybrid Algorithm for Extracting Phylogenies from Dendrograms , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[8]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[9]  Ali Farhadi,et al.  A Diagram is Worth a Dozen Images , 2016, ECCV.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Roxann Roberson-Nay,et al.  Constructing knowledge. The role of graphs and tables in hard and soft psychology. , 2002, The American psychologist.

[12]  Christian Wartena,et al.  NOA: A Search Engine for Reusable Scientific Images Beyond the Life Sciences , 2018, ECIR.

[13]  Andrew D Higginson,et al.  Heavy use of equations impedes communication among biologists , 2012, Proceedings of the National Academy of Sciences.

[14]  EunKyung Chung,et al.  An investigation on Graphical Abstracts use in scholarly articles , 2017, Int. J. Inf. Manag..

[15]  Cláudio T. Silva,et al.  VisTrails: enabling interactive multiple-view visualizations , 2005, VIS 05. IEEE Visualization, 2005..

[16]  Benjamin Bach,et al.  Picturing Science: Design Patterns in Graphical Abstracts , 2018, Diagrams.

[17]  V. S. Reed,et al.  Pictorial superiority effect. , 1976, Journal of experimental psychology. Human learning and memory.

[18]  Vitalii Zhelezniak,et al.  Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors , 2019, ICLR.

[19]  William S. Cleveland,et al.  Graphs in Scientific Publications , 1984 .

[20]  Ingrid Zukerman,et al.  The automated understanding of simple bar charts , 2011, Artif. Intell..

[21]  Jevin D. West,et al.  Viziometrics: Analyzing Visual Information in the Scientific Literature , 2016, IEEE Transactions on Big Data.

[22]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[23]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[24]  Daniel A. Keim,et al.  Document Cards: A Top Trumps Visualization for Documents , 2009, IEEE Transactions on Visualization and Computer Graphics.

[25]  C. Lee Giles,et al.  Automatic Extraction of Data from Bar Charts , 2015, K-CAP.

[26]  Bill Howe,et al.  VizioMetrix: A Platform for Analyzing the Visual Information in Big Scholarly Data , 2016, WWW.

[27]  B. Mons,et al.  Nano-Publication in the e-science era , 2009 .