Learning of Pattern-Based Rules for Document Classification

Automatic processing of office documents, such as orders, invoices, or offers entails a significant potential for saving costs. Because such domains have a high percentage of special vocabulary, purely statistical approaches fail in automatic classification. The inherent structure and short text messages require specific approaches. We propose a rule-based method to classify mixed stacks of documents into a set of hierarchically organized classes. Rules are learned by extracting patterns of different types from a document sample. The paper focuses on the architecture and on the learning process, presents comparing results to other techniques, and gives an outlook on how to further improve the system.