Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications

The extraction of individual reference strings from the reference section of scientific publications is an important step in the citation extraction pipeline. Current approaches divide this task into two steps by first detecting the reference section areas and then grouping the text lines in such areas into reference strings. We propose a classification model that considers every line in a publication as a potential part of a reference string. By applying line-based conditional random fields rather than constructing the graphical model based on individual words, dependencies and patterns that are typical in reference sections provide strong features while the overall complexity of the model is reduced. We evaluated our novel approach RefExt against various state-of-the-art tools (CERMINE, GROBID, and ParsCit) and a gold standard which consists of 100 German language full text publications from the social sciences. The evaluation demonstrates that we are able to outperform state-of-the-art tools which rely on the identification of reference section areas.

[1]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[2]  Martin Korner Reference String Extraction Using Line-Based Conditional Random Fields , 2017 .

[3]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[4]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[5]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[6]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[7]  Dominika Tkaczyk,et al.  CERMINE: automatic extraction of structured metadata from scientific literature , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Andrei Voronkov,et al.  PDFX: fully-automated PDF-to-XML conversion of scientific literature , 2013, ACM Symposium on Document Engineering.

[9]  Siegfried Handschuh,et al.  Reference information extraction and processing using conditional random fields , 2012 .

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Henk F. Moed,et al.  Citation Analysis in Research Evaluation , 1899 .

[12]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[13]  Robert E. Mercer,et al.  Method Mention Extraction from Scientific Research Papers , 2012, COLING.

[14]  Philipp Mayr,et al.  Digital Library Research in Action: Supporting Information Retrieval in Sowiport , 2015, D Lib Mag..