A simple method for citation metadata extraction using hidden markov models

This paper describes a simple method for extracting metadata fields from citations using hidden Markov models. The method is easy to implement and can achieve levels of precision and recall for heterogeneous citations comparable to or greater than other HMM-based methods. The method consists largely of string manipulation and otherwise depends only on an implementation of the Viterbi algorithm, which is widely available, and so can be implemented by diverse digital library systems.

[1]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[2]  Ping Yin,et al.  Metadata Extraction from Bibliographies Using Bigram HMM , 2004, ICADL.

[3]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[4]  David Barber,et al.  Tagging of name records for genealogical data browsing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[5]  Abdel Belaïd,et al.  Citation recognition for scientific publications in digital libraries , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[6]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[7]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[8]  Thomas M. Breuel,et al.  Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[10]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[11]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[12]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[13]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[14]  A.J. Viterbi A personal history of the Viterbi algorithm , 2006, IEEE Signal Processing Magazine.

[15]  Jun Yang,et al.  AUTOBIB: automatic extraction of bibliographic information on the Web , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[16]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[17]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.