Interactive Definition and Tuning of One-Class Classifiers for Document Image Classification

With mass of data, document image classification systems have to face new trends like being able to process heterogeneous data streams efficiently. Generally, when processing data streams, few knowledge is available about the content of the possible streams. Furthermore, as getting labelled data is costly, the classification model has to be learned from few available labelled examples. To handle such specific context, we think that combining one-class classifiers could be a very interesting alternative to quickly define and tune classification systems dedicated to different document streams. The main interest of one-class classifiers is that no interdependence occurs between each classifier model allowing easy removal, addition or modification of classes of documents. Such reconfiguration will not have any impact on the other classifiers. It is also noticeable that each classifier can use a different set of features compared to the other to handle the same class or even different classes. In return, as only one class is well-specified during the learning step, one-class classifiers have to be defined carefully to obtain good performances. It is more difficult to select the representative training examples and the discriminative features with only positive examples. To overcome these difficulties, we have defined a complete framework offering different methods that can help a system designer to define and tune one-class classifier models. The aims are to make easier the selection of good training examples and of suitable features depending on the class to recognize into the document stream. For that purpose, the proposed methods compute different measures to evaluate the relevance of the available features and training examples. Moreover, a visualization of the decision space according to selected examples and features is proposed to help such a choice and, an automatic tuning is proposed for the parameters of the models according to the class to recognize when a validation stream is available. The pertinence of the proposed framework is illustrated on two different use cases (a real data stream and a public data set).

[1]  David S. Doermann,et al.  Page classification through logical labelling , 2002, Object recognition supported by user interaction for service robots.

[2]  Azriel Rosenfeld,et al.  Classification of document pages using structure-based features , 2001, International Journal on Document Analysis and Recognition.

[3]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[4]  Shehroz S. Khan,et al.  One-class classification: taxonomy of study and review of techniques , 2013, The Knowledge Engineering Review.

[5]  Joachim Denzler,et al.  One-class classification with Gaussian processes , 2013, Pattern Recognit..

[6]  Robert P. W. Duin,et al.  Data description in subspaces , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[7]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[8]  Yillbyung Lee,et al.  Form classification using DP matching , 2000, SAC '00.

[9]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[10]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Filter Feature Selection for One-Class Classification , 2014, Journal of Intelligent & Robotic Systems.

[11]  Dongjoon Kong,et al.  A New Feature Selection Method for One-Class Classification Problems , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Véronique Eglin,et al.  Document page similarity based on layout visual saliency: application to query by example and document classification , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[13]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[14]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[15]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16]  Naohiro Furukawa,et al.  Form reading based on form-type identification and form-data recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Donato Malerba,et al.  Machine Learning for Intelligent Processing of Printed Documents , 2000, Journal of Intelligent Information Systems.

[19]  Paolo Frasconi,et al.  Hidden Tree Markov Models for Document Image Classification , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Naohiro Furukawa,et al.  Form type identification for banking applications and its implementation issues , 2003, IS&T/SPIE Electronic Imaging.