BibPro: A Citation Parser Based on Sequence Alignment

Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the resources to meet researchers' needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark. Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a reasonable performance.

[1]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[2]  Ping Yin,et al.  Metadata Extraction from Bibliographies Using Bigram HMM , 2004, ICADL.

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  Edward A. Fox,et al.  "What is a good digital library?" - A quality model for digital libraries , 2007, Inf. Process. Manag..

[5]  Berthier A. Ribeiro-Neto,et al.  A comparative study of citations and links in document classification , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[6]  Byung-Won On,et al.  Are your citations clean? , 2007, CACM.

[7]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[8]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[10]  Jan-Ming Ho,et al.  BibPro: A Citation Parser Based on Sequence Alignment Techniques , 2008, 22nd International Conference on Advanced Information Networking and Applications - Workshops (aina workshops 2008).

[11]  Erik Hetzner A simple method for citation metadata extraction using hidden markov models , 2008, JCDL '08.

[12]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[13]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[14]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[15]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[16]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[17]  Nivio Ziviani,et al.  Link-based similarity measures for the classification of Web documents , 2006 .

[18]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[19]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[21]  Jan-Ming Ho,et al.  Extracting Citation Metadata from Online Publication Lists Using BLAST , 2004, PAKDD.

[22]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[23]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[24]  Jan-Ming Ho,et al.  PLF: A Publication List Web Page Finder for Researchers , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[25]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[26]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[27]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.