Metadata Extraction from Bibliographies Using Bigram HMM

In recent years, we have seen huge volumes of research papers available on the World Wide Web. Metadata provides a good approach for organizing and retrieving these useful resources. Accordingly, automatic extraction of metadata from these papers and their bibliographies is meaningful and has been widely studied. In this paper, we utilize a bigram HMM (Hidden Markov Model) for automatic extraction of metadata (i.e. title, author, date, journal, pages, etc.) from bibliographies with various styles. Different from the traditional HMM, which only uses word frequency, this model also considers both words' bigram sequential relation and position information in text fields. We have evaluated the model on a real corpus downloaded from Web and compared it with other methods. Experiments show that the bigram HMM yields the best result and seem to be the most promising candidate for metadata extraction of bibliographies.

[1]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[2]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[3]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[4]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Nancy R. Zhang,et al.  Hidden Markov Models for Information Extraction , 2001 .

[7]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[8]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[9]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[10]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[11]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[12]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[13]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[14]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.