论文信息 - Citation recognition for scientific publications in digital libraries

Citation recognition for scientific publications in digital libraries

A method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori model, gathering structural elements from basic tags to subfields and fields. Significant tags are first grouped in homogeneous classes according to their categories and then reduced in canonical forms corresponding to record fields: "authors", "title", "conference name", "date", etc. Nonlabeled tokens are integrated in one or another field by either applying PoS correction rules or using a interor intra-field model generated from well-detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% words are correctly attributed, and about 75,9% references are completely segmented from 2,575 references.

Abdel Belaïd | Dominique Besagni

[1] Abdel Belaïd. Retrospective document conversion: application to the library domain , 1998, International Journal on Document Analysis and Recognition.

[2] Abdel Belaïd. Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[3] Richard M. Schwartz,et al. Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[4] M. Suzuki,et al. Automatic reference linking in distributed digital libraries , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[5] Bernard Mérialdo,et al. Tagging English Text with a Probabilistic Model , 1994, CL.

[6] Julian M. Kupiec,et al. Robust part-of-speech tagging using a hidden Markov model , 1992 .

[7] Ian Marshall,et al. Choice of grammatical word-class without global syntactic analysis: Tagging words in the lob corpus , 1983, Comput. Humanit..

[8] Steven J. DeRose,et al. Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[9] C. Lee Giles,et al. Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[10] B. C. Griffith,et al. The Structure of Scientific Literatures I: Identifying and Graphing Specialties , 1974 .

[11] Eric Brill,et al. A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[12] Atro Voutilainen,et al. Tagging accurately - Don't guess if you know , 1994, ANLP.