Evaluation and Comparison of Open Source Bibliographic Reference Parsers: A Business Use Case

Bibliographic reference parsing refers to extracting machine-readable metadata, such as the names of the authors, the title, or journal name, from bibliographic reference strings. Many approaches to this problem have been proposed so far, including regular expressions, knowledge bases and supervised machine learning. Many open source reference parsers based on various algorithms are also available. In this paper, we apply, evaluate and compare ten reference parsing tools in a specific business use case. The tools are Anystyle-Parser, Biblio, CERMINE, Citation, Citation-Parser, GROBID, ParsCit, PDFSSA4MET, Reference Tagger and Science Parse, and we compare them in both their out-of-the-box versions and tuned to the project-specific data. According to our evaluation, the best performing out-of-the-box tool is GROBID (F1 0.89), followed by CERMINE (F1 0.83) and ParsCit (F1 0.75). We also found that even though machine learning-based tools and tools based on rules or regular expressions achieve on average similar precision (0.77 for ML-based tools vs. 0.76 for non-ML-based tools), applying machine learning-based tools results in the recall three times higher than in the case of non-ML-based tools (0.66 vs. 0.22). Our study also confirms that tuning the models to the task-specific data results in the increase in the quality. The retrained versions of reference parsers are in all cases better than their out-of-the-box counterparts; for GROBID F1 increased by 3% (0.92 vs. 0.89), for CERMINE by 11% (0.92 vs. 0.83), and for ParsCit by 16% (0.87 vs. 0.75).

[1]  Bolanle Adefowoke Ojokoh,et al.  A trigram hidden Markov model for metadata extraction from heterogeneous references , 2011, Inf. Sci..

[2]  Wolfgang Glänzel,et al.  A Hirsch-type index for journals , 2006, Scientometrics.

[3]  Francisco Herrera,et al.  A methodology for Institution-Field ranking based on a bidimensional analysis: the IFQ2A index , 2011, Scientometrics.

[4]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[5]  Robert A. Morris,et al.  A New Approach towards Bibliographic Reference Identification, Parsing and Inline Citation Matching , 2009, IC3.

[6]  James P. Callan,et al.  Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding , 2017, WWW.

[7]  Atsuhiro Takasu,et al.  Examination of effective features for CRF-based bibliography extraction from reference strings , 2016, 2016 Eleventh International Conference on Digital Information Management (ICDIM).

[8]  Francisco Herrera,et al.  h-Index: A review focused in its variants, computation and standardization for different scientific fields , 2009, J. Informetrics.

[9]  Patrice Bellot,et al.  Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools , 2012, DocEng '12.

[10]  Anthony F. J. van Raan Comparison of the Hirsch-index with standard bibliometric indicators and with peer judgment for 147 chemistry research groups , 2013, Scientometrics.

[11]  Johan Bollen,et al.  Journal status , 2006, Scientometrics.

[12]  Per Ahlgren,et al.  Document-document similarity approaches and science mapping: Experimental comparison of five approaches , 2009, J. Informetrics.

[13]  Erik Wilde,et al.  Introducing Mr. DLib, a Machine-readable Digital Library , 2011, JCDL '11.

[14]  Dominika Tkaczyk,et al.  GROTOAP: ground truth for open access publications , 2012, JCDL '12.

[15]  Jie Zou,et al.  Locating and parsing bibliographic references in HTML medical articles , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[16]  Christian Sternitzke,et al.  Similarity measures for document mapping: A comparative study on the level of an individual scientist , 2007, Scientometrics.

[17]  Dominika Tkaczyk,et al.  Extracting Contextual Information from Scientific Literature Using CERMINE System , 2015, SemWebEval@ESWC.

[18]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[19]  Jean-François Molinari,et al.  A new methodology for ranking scientific institutions , 2008, Scientometrics.

[20]  Angelo Di Iorio,et al.  Semantic Publishing Challenge - Assessing the Quality of Scientific Output by Information Extraction and Interlinking , 2015, SemWebEval@ESWC.

[21]  Xiaoli Zhang,et al.  A structural SVM approach for reference parsing , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[22]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[23]  Qing Zhang,et al.  Parsing citations in biomedical articles using conditional random fields , 2011, Comput. Biol. Medicine.

[24]  Jan-Ming Ho,et al.  BibPro: A Citation Parser Based on Sequence Alignment , 2012, IEEE Trans. Knowl. Data Eng..

[25]  Dominika Tkaczyk,et al.  GROTOAP2 - The Methodology of Creating a Large Ground Truth Dataset of Scientific Articles , 2014, D Lib Mag..

[26]  Yung-Chun Chang,et al.  A Frame-Based Approach for Reference Metadata Extraction , 2014, TAAI.

[27]  Dominika Tkaczyk,et al.  CERMINE: automatic extraction of structured metadata from scientific literature , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[28]  Jorge E. Hirsch,et al.  An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship , 2009, Scientometrics.

[29]  Andrei Voronkov,et al.  PDFX: fully-automated PDF-to-XML conversion of scientific literature , 2013, ACM Symposium on Document Engineering.

[30]  Vicente P. Guerrero-Bote,et al.  A new approach to the metric of journals' scientific prestige: The SJR indicator , 2010, J. Informetrics.

[31]  Erik Hetzner A simple method for citation metadata extraction using hidden markov models , 2008, JCDL '08.

[32]  Marcos André Gonçalves,et al.  A flexible approach for extracting metadata from bibliographic citations , 2009, J. Assoc. Inf. Sci. Technol..

[33]  Jean-François Molinari,et al.  Mathematical aspects of a new criterion for ranking scientific institutions based on the h-index , 2008, Scientometrics.

[34]  Jöran Beel,et al.  Mr. DLib: Recommendations-as-a-Service (RaaS) for Academia , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[35]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[36]  Ping Yin,et al.  Metadata Extraction from Bibliographies Using Bigram HMM , 2004, ICADL.

[37]  Bela Gipp,et al.  Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.