CEBBIP: a parser of bibliographic information in chinese electronic books

Bibliographic information is essential for many digital library applications, such as citation analysis, academic searching and topic discovery. And bibliographic data extraction has attracted a great deal of attention in recent years. In this paper, we address the problem of automatic extraction of bibliographic data in Chinese electronic book and propose a tool called CEBBIP* for the task, which includes three main systems: data preprocessing, data parsing and data postprocessing. In the data preprocessing system, the tool adopts a rules-based method to locate citation data in a book and to segment citation data into citation strings of individual referencing literature. And a learning-based approach, Conditional Random Fields (CRF), is employed to parse citation strings in the data parsing system. Finally, the tool takes advantage of document intrinsic local format consistency to enhance citation data segmentation and parsing through clustering techniques. CEBBIP has been used in a commercial E-book production system. Experimental results show that CEBBIP's precision rate is very high. More specially, adopting the document intrinsic local format consistency obviously improves the citation data segmenting and parsing accuracy.

[1]  Zhi Tang,et al.  A mixed approach to book splitting , 2008, Electronic Imaging.

[2]  Jan-Ming Ho,et al.  Extracting Citation Metadata from Online Publication Lists Using BLAST , 2004, PAKDD.

[3]  Jan-Ming Ho,et al.  BibPro: A Citation Parser Based on Sequence Alignment Techniques , 2008, 22nd International Conference on Advanced Information Networking and Applications - Workshops (aina workshops 2008).

[4]  Abdel Belaïd,et al.  A segmentation method for bibliographic references by contextual tagging of fields , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[6]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[7]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[8]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[9]  Irwin King,et al.  Bibliographic Attributes Extraction with Layer-upon-Layer Tagging , 2007 .

[10]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[13]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.