ParsCit: an Open-source CRF Reference String Parsing Package

We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.

[1]  Robert D. Cameron,et al.  A Universal Citation Database as a Catalyst for Reform in Scholarly Communication , 1997, First Monday.

[2]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[3]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[4]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[7]  Jan-Ming Ho,et al.  Extracting Citation Metadata from Online Publication Lists Using BLAST , 2004, PAKDD.

[8]  Ng Yong Kiat Citation Parsing Using Maximum Entropy and Repairs , 2005 .

[9]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[10]  Robert Dale,et al.  Evidence-Based Information Extraction for High Accuracy Citation and Author Name Identification , 2007, RIAO.

[11]  Marti A. Hearst,et al.  Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding , 2007, EMNLP.

[12]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[13]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.