论文信息 - Approximate matching for OCR-processed bibliographic data

Approximate matching for OCR-processed bibliographic data

This paper presents a method for matching bibliographies in references of academic papers obtained as document images with records of bibliographic databases. The main subject of this paper is to handle the erroneous bibliographic data obtained by a document understanding methodology. The presented method can find a candidate record set from referral databases in spite of the errors of string by means of approximate matching which is performed as an exact matching of k substrings of length m chosen from the strings of bibliographic data in references and in databases. For the accuracy /spl alpha/ of the OCR, theoretical observation shows that the accuracy of the presented method is 1-(1-/spl alpha//sup m/)/sup k/ under the assumption that the OCR error occurs randomly and independently in the string. The method is applied to references of 187 Japanese articles and achieves accuracy of 94.05%.

[1] Abdel Belaïd,et al. Bibliography references validation using emergent architecture , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2] Yoshihiro Shima,et al. Evaluation of Information Retrieval Method based on 'non - deterministic text' of Character Recognition , 1995 .

[3] Stephen V. Rice,et al. The Third Annual Test of OCR Accuracy , 1994 .

[4] Y. Tanaka,et al. Transmedia Machine and its Keyword Search over Image Texts , 1988, RIAO.

[5] Andreas Myka,et al. Fuzzy Full-Text Searches in OCR Databases , 1995, ADL.