Pattern matcher for OCR-corrupted documents and its evaluation

Document classification is one of the fundamental technologies prior to document routing, document understanding, and information extraction algorithms. Pattern matchers with rule-based components are in use in news agencies with electronic text as input. However, classification of OCR documents must deal with the ambiguities of the underlying OCR engine. The ambiguities of character segmentation and classification lead towards a directed graph of characters as the results of the OCR process - the so-called character hypothesis lattice. This paper deals with techniques to enhance the pattern matcher in order to cope with CHLs.