Advances in Document Classification by Voting of Competitive Approaches

This paper presents a complex approach for the content-based text categorization of printed German business letters into pre-defined message types such as order, invoice, offer, etc. The categorization results of two competing classifiers are combined by means of a voting component embodying knowledge about the strengths and weaknesses of the classifiers. The individual classifiers differ strongly in their basic assumptions: While the first one considers layout and typographic information with respect to certain keywords the second one is a more conventional text categorization approach which merely incorporates textual features. Since this whole categorization tool is embedded into a document analysis system, a highly precise classification is essential for a subsequent goal-directed extraction of structured information aimed at the integration of the document into the current business workflow of a company.

[1]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[2]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[4]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[5]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[6]  Sargur N. Srihari,et al.  A theory of classifier combination: the neural network approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[8]  Stephan Baumann,et al.  May document analysis tools bridge the gap between paper and workflows? A critical survey , 1996, Proceedings First IFCIS International Conference on Cooperative Information Systems.

[9]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[10]  Philip J. Hayes,et al.  TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[11]  Achim Weigel,et al.  Lexical postprocessing by heuristic search and automatic determination of the edit costs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12]  T. Ho A theory of multiple classifier systems and its application to visual word recognition , 1992 .

[13]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[14]  RiloffEllen,et al.  Information extraction as a basis for high-precision text classification , 1994 .

[15]  J. Franke,et al.  A comparison of two approaches for combining the votes of cooperating classifiers , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[16]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[17]  Ching Y. Suen,et al.  Combination of multiple classifiers with measurement values , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[18]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .