PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search

We introduce PDFMEF, a multi-entity knowledge extraction framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open-source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFigures for figure and table extraction, and algorithm extraction [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can substitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly documents and to efficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.

[1]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[2]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[3]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[4]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[5]  Hui Han,et al.  Automatic acknowledgement indexing: expanding the semantics of contribution in the CiteSeer digital library , 2005, K-CAP '05.

[6]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[7]  Vinay K. Chaudhri,et al.  Enabling experts to build knowledge bases from science textbooks , 2007, K-CAP '07.

[8]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[9]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[10]  Peter Clark,et al.  Large-scale extraction and use of knowledge from text , 2009, K-CAP '09.

[11]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[12]  C. Lee Giles,et al.  Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents , 2011, TOIS.

[13]  Prasenjit Mitra,et al.  An algorithm search engine for software developers , 2011, SUITE '11.

[14]  Oren Etzioni Search needs a shake-up , 2011, Nature.

[15]  Hagit Shatkay,et al.  An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[16]  Oren Etzioni,et al.  Constructing a Textual KB from a Biology TextBook , 2012, AKBC-WEKEX@NAACL-HLT.

[17]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[18]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[19]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[20]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[21]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[22]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.

[23]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[24]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.