Using IR techniques for text classification in document analysis

This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.1

[1]  Kazem Taghva,et al.  The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[2]  Michael L. Mauldin Conceptual Information Retrieval , 1991 .

[3]  Michael L. Mauldin,et al.  Retrieval performance in Ferret a conceptual information retrieval system , 1991, SIGIR '91.

[4]  Wendy G. Lehnert,et al.  Strategies for Natural Language Processing , 1982 .

[5]  Douglas E. Appelt,et al.  FASTUS: A Finite-state Processor for Information Extraction from Real-world Text , 1993, IJCAI.

[6]  Gerard Salton,et al.  Automatic indexing , 1980, ACM '80.

[7]  Peter Willett,et al.  An algorithm for the calculation of exact term discrimination values , 1985, Inf. Process. Manag..

[8]  Wolfgang Finkler,et al.  MORPHIX A Fast Realization of a Classification-Based Approach to Morphology , 1988 .

[9]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[10]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[11]  G Salton,et al.  Global Text Matching for Information Retrieval , 1991, Science.

[12]  Jonathan J. Hull,et al.  Word Recognition Result Interpretation Using the Vector Space Model for Information Retrieval , 1993 .

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Lisa F. Rau,et al.  Integrating Top-Down And Bottom-Up Strategies In A Text Processing System , 1988, ANLP.

[15]  Edward A. Fox,et al.  Development of the coder system: A testbed for artificial intelligence methods in information retrieval , 1987, Inf. Process. Manag..

[16]  Lynn A. Streeter,et al.  Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval , 1989, Inf. Process. Manag..

[17]  Michael L. Mauldin,et al.  Performance in FERRET: A Conceptual Information Retrieval System. , 1991, SIGIR 1991.

[18]  Philip J. Hayes,et al.  TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[19]  Harold Borko,et al.  Automatic indexing , 1981, ACM '81.

[20]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[21]  Ulrich Kressel,et al.  Towards the Understanding of Printed Documents , 1992 .

[22]  Rainer Hoch,et al.  Intelligent Interfaces between Paper and Computer , 1993 .

[23]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.