Hybrid Feature Selection for Historical Document Layout Analysis

In this paper we propose a novel hybrid feature selection method for historical Document Image Analysis (DIA). Adapted greedy forward selection and genetic selection are used in a cascading way. We apply the proposed method to the task of historical document layout analysis on three handwritten datasets of diverse nature. The documents contain complex layouts, different handwriting styles, and several results of decay. The task is to segment each page into four areas: periphery, background, text block, and decoration. The proposed method selected significantly less features and resulted in significantly lower error rates than using all features. Compared to several conventional feature selection methods, the proposed method is competitive with respect to the number of selected features and the resultant error rates. In addition, we found that some features, e.g., Gradient, Laplacian, and local binary patterns (LBP), are selected by most of the feature selection methods and we give some explanations. This finding suggests a clue for the layout analysis on handwritten documents in general.

[1]  Marcus Liwicki,et al.  Robust Text Line Segmentation for Historical Manuscript Images Using Color and Texture , 2014, 2014 22nd International Conference on Pattern Recognition.

[2]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Nong Sang,et al.  Local Binary Pattern histogram based Texton learning for texture classification , 2011, 2011 18th IEEE International Conference on Image Processing.

[4]  Eibe Frank,et al.  Large-scale attribute selection using wrappers , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[5]  Jose Miguel Puerta,et al.  A GRASP algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets , 2011, Pattern Recognit. Lett..

[6]  Cinthia Obladen de Almendra Freitas,et al.  Feature Selection for Forensic Handwriting Identification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[8]  Rolf Ingold,et al.  Multi Resolution Layout Analysis of Medieval Manuscripts Using Dynamic MLP , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[10]  Rolf Ingold,et al.  Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[11]  Leslie S. Smith,et al.  Feature subset selection in large dimensionality domains , 2010, Pattern Recognit..

[12]  Andreas Keller,et al.  Lexicon-free handwritten word spotting using character HMMs , 2012, Pattern Recognit. Lett..

[13]  Habib Rostami,et al.  A Hybrid Approach for Optimal Feature Selection based on Evolutionary Algorithms and Classic Approaches , 2013 .

[14]  Nicole Vincent,et al.  Fast Feature Selection for Handwritten Digit Recognition , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[15]  J. Kittler,et al.  Feature Set Search Alborithms , 1978 .

[16]  Horst Bunke,et al.  Improving writer identification by means of feature selection and extraction , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[17]  Marcus Liwicki,et al.  Page Segmentation for Historical Handwritten Document Images Using Color and Texture Features , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[18]  Jesfis Peral,et al.  Heuristics -- intelligent search strategies for computer problem solving , 1984 .

[19]  Richard Weber,et al.  A wrapper method for feature selection using Support Vector Machines , 2009, Inf. Sci..

[20]  Zehong Yang,et al.  Recognition of gray character using gabor filters , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).

[21]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[22]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[23]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.