Improved bibliographic reference parsing based on repeated patterns

Parsing details like author names and titles out of bibliographic references of scientific publications is an important issue that has received considerable attention recently. However, most existing techniques are tailored to the highly standardized reference styles used in the last two to three decades. They do not perform well with the wide variety of reference styles used in older, historic publications. Thus, they are of limited use when creating comprehensive bibliographies covering both historic and contemporary scientific publications. This paper presents a generic approach to bibliographic reference parsing, named RefParse, which is independent of any specific reference style. Its core feature is an inference mechanism that exploits the regularities inherent in any list of references to deduce its format. In addition, our approach learns names of authors, journals, and publishers to increase the accuracy in scenarios where human users double check parsing results to increase data quality. Our evaluation shows that our approach performs comparably to existing ones with contemporary reference lists and also works well with older ones.

[1]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[2]  Peter Mutschke,et al.  Enhancing Information Retrieval in Federated Bibliographic Data Sources Using Author Network Based Stratagems , 2001, ECDL.

[3]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[4]  Jan-Ming Ho,et al.  BibPro: A Citation Parser Based on Sequence Alignment Techniques , 2008, 22nd International Conference on Advanced Information Networking and Applications - Workshops (aina workshops 2008).

[5]  Erik Hetzner A simple method for citation metadata extraction using hidden markov models , 2008, JCDL '08.

[6]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[7]  Constance A. Rinaldo,et al.  The Biodiversity Heritage Library: sharing biodiversity literature with the world , 2009 .

[8]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[9]  Klemens Böhm,et al.  Improved Bibliographic Reference Parsing Based on Repeated Patterns , 2012, TPDL.

[10]  José Luis Borbinha,et al.  Quality Control of Metadata: A Case with UNIMARC , 2006, ECDL.

[11]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[12]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[13]  Thomas M. Breuel,et al.  Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers , 2007 .

[14]  H. Rosner Data on wings. , 2013, Scientific American.

[15]  Jan-Ming Ho,et al.  Extracting Citation Metadata from Online Publication Lists Using BLAST , 2004, PAKDD.

[16]  Dobrivoje Popovic,et al.  Computational Intelligence in Time Series Forecasting: Theory and Engineering Applications (Advances in Industrial Control) , 2005 .

[17]  Klemens Böhm,et al.  High-throughput crowdsourcing mechanisms for complex tasks , 2013, Social Network Analysis and Mining.

[18]  Richard L. Smith,et al.  PREDICTIVE INFERENCE , 2004 .

[19]  David King,et al.  Towards a universal bibliography – the RefBank approach , 2012 .

[20]  Ian H. Witten,et al.  Tag insertion complexity , 2001, Proceedings DCC 2001. Data Compression Conference.

[21]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[22]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[23]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[24]  Norbert Fuhr,et al.  Daffodil: An Integrated Desktop for Supporting High-Level Search Activities in Federated Digital Libraries , 2002, ECDL.