Document highlighting - message classification in printed business letters

This paper presents the INFOCLAS system applying statistical methods of information retrieval primarily for the classification of German business letters into corresponding message types such as order, offer, confirmation, etc. INFOCLAS is a first step towards understanding of documents. Actually, it is composed of three modules: the central indexer (extraction and weighting of indexing terms), the classifier (classification of business letters into given types) and the focuser (highlighting relevant letter parts). The system employs several knowledge sources including a database of about 100 letters, word frequency statistics for German, message type specific words, morphological knowledge as well as the underlying document model. As output, the system evaluates a set of weighted hypotheses about the type of letter at hand, or highlights relevant text (text focus), respectively. Classification of documents allows the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.

[1]  Ulrich Kressel,et al.  Towards the Understanding of Printed Documents , 1992 .

[2]  Rainer Hoch,et al.  Intelligent Interfaces between Paper and Computer , 1993 .

[3]  David R. Ferguson,et al.  Intelligent Forms Processing , 1990, IBM Syst. J..

[4]  Michael L. Mauldin,et al.  Performance in FERRET: A Conceptual Information Retrieval System. , 1991, SIGIR 1991.

[5]  Philip J. Hayes,et al.  TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[6]  G Salton,et al.  Global Text Matching for Information Retrieval , 1991, Science.

[7]  Rainer Hoch,et al.  Fragmentary string matching by selective access to hybrid tries , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[8]  Michael L. Mauldin,et al.  Retrieval performance in Ferret a conceptual information retrieval system , 1991, SIGIR '91.

[9]  Lawrence O'Gorman,et al.  The RightPages image-based electronic library for alerting and browsing , 1992, Computer.

[10]  Gerald Salton,et al.  Automatic text processing , 1988 .

[11]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[12]  Lisa F. Rau,et al.  Integrating Top-Down And Bottom-Up Strategies In A Text Processing System , 1988, ANLP.

[13]  Peter Willett,et al.  An algorithm for the calculation of exact term discrimination values , 1985, Inf. Process. Manag..

[14]  Gerald P. Michalski The world of documents , 1991 .

[15]  Edward A. Fox,et al.  Development of the coder system: A testbed for artificial intelligence methods in information retrieval , 1987, Inf. Process. Manag..

[16]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[17]  Wolfgang Finkler,et al.  MORPHIX A Fast Realization of a Classification-Based Approach to Morphology , 1988 .

[18]  Mary Dee Harris Introduction to Natural Language Processing , 1984 .

[19]  Klaus Kreplin,et al.  Knowledge based document classification supporting integrated document handling , 1988, COCS '88.

[20]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[21]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[22]  Rainer Hoch,et al.  Pi_{ODA} : the paper interface to ODA , 1992 .