Approximate matching for OCR-processed bibliographic data

This paper presents a method for matching bibliographies in references of academic papers obtained as document images with records of bibliographic databases. The main subject of this paper is to handle the erroneous bibliographic data obtained by a document understanding methodology. The presented method can find a candidate record set from referral databases in spite of the errors of string by means of approximate matching which is performed as an exact matching of k substrings of length m chosen from the strings of bibliographic data in references and in databases. For the accuracy /spl alpha/ of the OCR, theoretical observation shows that the accuracy of the presented method is 1-(1-/spl alpha//sup m/)/sup k/ under the assumption that the OCR error occurs randomly and independently in the string. The method is applied to references of 187 Japanese articles and achieves accuracy of 94.05%.