Rule-based versus training-based extraction of index terms from business documents: how to combine the results

Current systems for automatic extraction of index terms from business documents either take a rule-based or training-based approach. As both approaches have their advantages and disadvantages it seems natural to combine both methods to get the best of both worlds. We present a combination method with the steps selection, normalization, and combination based on comparable scores produced during extraction. Furthermore, novel evaluation metrics are developed to support the assessment of each step in an existing extraction system. Our methods were evaluated on an example extraction system with three individual extractors and a corpus of 12,000 scanned business documents.

[1]  Alexander Schill,et al.  Automatic indexing of scanned documents: a layout-based approach , 2012, Electronic Imaging.

[2]  Evgeniy Bart,et al.  Information extraction by finding repeated structure , 2010, DAS '10.

[3]  Li Zhang,et al.  Focused named entity recognition using machine learning , 2004, SIGIR '04.

[4]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[5]  Thierry Denoeux,et al.  A neural network classifier based on Dempster-Shafer theory , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[6]  George Nagy,et al.  Style consistent classification of isogenous patterns , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Eric Saund Scientific challenges underlying production document processing , 2011, Electronic Imaging.

[9]  Bing Chen,et al.  An Approach of Multiple Classifiers Ensemble Based on Feature Selection , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[10]  Arthur P. Dempster,et al.  A Generalization of Bayesian Inference , 1968, Classic Works of the Dempster-Shafer Theory of Belief Functions.

[11]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[13]  Alexander Schill,et al.  Continuous User Feedback Learning for Data Capture from Business Documents , 2012, HAIS.

[14]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[15]  Jianying Hu,et al.  Comparison and Classification of Documents Based on Layout Similarity , 2000, Information Retrieval.

[16]  Henning Fernau,et al.  An Optimal Construction of Finite Automata from Regular Expressions , 2008, FSTTCS.

[17]  R O H I N,et al.  InfoXtract : A customizable intermediate level information extraction engine , 2022 .

[18]  David Windridge,et al.  Combined Classifier Optimisation via Feature Selection , 2000, SSPR/SPR.

[19]  Fabio Roli,et al.  Dynamic classifier selection based on multiple classifier behaviour , 2001, Pattern Recognit..