Quality enhancement in information extraction from scanned documents

When constructing a large document archive, an important element is the digitizing of printed documents. Although various techniques for document image analysis such as Optical Character Recognition (OCR) have been developed, error handling is required in constructing real document archive systems. This paper discusses the problem from the quality enhancement perspective and proposes a robust reference extraction method for academic articles scanned with OCR mark-up. We applied the proposed method to articles appearing in various journals, and these experiments showed that the proposed method achieved a recognition accuracy of more than 94%. This paper also discusses manual correction and investigates experimentally the relationship between extraction accuracy and cost reduction.

[1]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[2]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[3]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Lawrence O'Gorman,et al.  The RightPages image-based electronic library for alerting and browsing , 1992, Computer.

[5]  Atsuhiro Takasu,et al.  DVHMM: variable length text recognition error model , 2002, Object recognition supported by user interaction for service robots.