论文信息 - Pattern matcher for OCR-corrupted documents and its evaluation

Pattern matcher for OCR-corrupted documents and its evaluation

Document classification is one of the fundamental technologies prior to document routing, document understanding, and information extraction algorithms. Pattern matchers with rule-based components are in use in news agencies with electronic text as input. However, classification of OCR documents must deal with the ambiguities of the underlying OCR engine. The ambiguities of character segmentation and classification lead towards a directed graph of characters as the results of the OCR process - the so-called character hypothesis lattice. This paper deals with techniques to enhance the pattern matcher in order to cope with CHLs.

Stefan Agne | Hans-Guenther Hein

[1] Philip J. Hayes,et al. CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[2] David D. Lewis,et al. Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[3] Herbert Schorr,et al. Innovative applications of artificial intelligence 2 , 1989 .

[4] Philippe Codognet,et al. WAMCC: Compiling Prolog to C , 1995, ICLP.

[5] Rainer Hoch,et al. From paper to office document standard representation , 1992, Computer.

[6] Stephen V. Rice,et al. The Fourth Annual Test of OCR Accuracy , 1995 .

[7] Philip J. Hayes,et al. TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[8] Andreas Dengel,et al. The specialist board a technology workbench for document analysis and understanding , 1996 .

[9] Michael J. Fischer,et al. The String-to-String Correction Problem , 1974, JACM.

[10] David D. Lewis,et al. Evaluating Text Categorization I , 1991, HLT.