OCR++: A Robust Framework For Information Extraction from Scholarly Articles

This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text, table and figure headings, URLs and footnotes) and bibliography (citation instances and references). We analyze a diverse set of scientific articles written in English language to understand generic writing patterns and formulate rules to develop this hybrid framework. Extensive evaluations show that the proposed framework outperforms the existing state-of-the-art tools with huge margin in structural information extraction along with improved performance in metadata and bibliography extraction tasks, both in terms of accuracy (around 50% improvement) and processing time (around 52% improvement). A user experience study conducted with the help of 30 researchers reveals that the researchers found this system to be very helpful. As an additional objective, we discuss two novel use cases including automatically extracting links to public datasets from the proceedings, which would further accelerate the advancement in digital libraries. The result of the framework can be exported as a whole into structured TEI-encoded documents. Our framework is accessible online at this http URL

[1]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[2]  Andreas Dengel,et al.  Clustering and classification of document structure-a machine learning approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[3]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[4]  Michael Granitzer,et al.  A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management , 2012, SAC '12.

[5]  Donato Malerba,et al.  A knowledge-based approach to the layout analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[7]  Erik Wilde,et al.  Introducing Mr. DLib, a Machine-readable Digital Library , 2011, JCDL '11.

[8]  Niloy Ganguly,et al.  FeRoSA: A Faceted Recommendation System for Scientific Articles , 2016, PAKDD.

[9]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[10]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..