A New Dataset Size Reduction Approach for PCA-Based Classification in OCR Application

A major problem of pattern recognition systems is due to the large volume of training datasets including duplicate and similar training samples. In order to overcome this problem, some dataset size reduction and also dimensionality reduction techniques have been introduced. The algorithms presently used for dataset size reduction usually remove samples near to the centers of classes or support vector samples between different classes. However, the samples near to a class center include valuable information about the class characteristics and the support vector is important for evaluating system efficiency. This paper reports on the use of Modified Frequency Diagram technique for dataset size reduction. In this new proposed technique, a training dataset is rearranged and then sieved. The sieved training dataset along with automatic feature extraction/selection operation using Principal Component Analysis is used in an OCR application. The experimental results obtained when using the proposed system on one of the biggest handwritten Farsi/Arabic numeral standard OCR datasets, Hoda, show about 97% accuracy in the recognition rate. The recognition speed increased by 2.28 times, while the accuracy decreased only by 0.7%, when a sieved version of the dataset, which is only as half as the size of the initial training dataset, was used.

[1]  Xiaoou Li,et al.  Support Vector classification for large data sets by reducing training data with change of classes , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[2]  Kenji Nakayama,et al.  Training Data Selection Method for Generalization by Multilayer Neural Networks , 1998 .

[3]  Alireza Alaei,et al.  A New Dataset of Persian Handwritten Documents and Its Segmentation , 2011, 2011 7th Iranian Conference on Machine Vision and Image Processing.

[4]  Character representation and recognition using quad tree-based fractal encoding scheme , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5]  Joakim Lindblad,et al.  A new set distance and its application to shape registration , 2012, Pattern Analysis and Applications.

[6]  Mykola Pechenizkiy,et al.  A comparative study of dimensionality reduction techniques to enhance trace clustering performances , 2013, Expert Syst. Appl..

[7]  Ehsanollah Kabir,et al.  Introducing a very large dataset of handwritten Farsi digits and a study on their varieties , 2007, Pattern Recognit. Lett..

[8]  Leon Bobrowski Ranked linear models and sequential patterns recognition , 2007, Pattern Analysis and Applications.

[9]  Ayoub Al-Hamadi,et al.  A structural features based segmentation for off-line handwritten Arabic text , 2010, 2010 5th International Symposium On I/V Communications and Mobile Network.

[10]  Reza Azmi,et al.  A hybrid GA and SA algorithms for feature selection in recognition of hand-printed Farsi characters , 2010, 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[11]  Bachir Boucheham PLA data reduction for speeding up time series comparison , 2012, Int. Arab J. Inf. Technol..

[12]  Ching Y. Suen,et al.  Application of Support Vector Machines for Recognition of Handwritten Arabic/Persian Digits , 2003 .

[13]  Hyun-Chul Kim,et al.  A numeral character recognition using the PCA mixture model , 2002, Pattern Recognit. Lett..

[14]  Karim Faez,et al.  Use of Legal Amount to Confirm or Correct the Courtesy Amount on Farsi Bank Checks , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[15]  Rajkumar Buyya,et al.  Heterogeneity in Mobile Cloud Computing: Taxonomy and Open Challenges , 2014, IEEE Communications Surveys & Tutorials.

[16]  Ashraf A. Kassim,et al.  Dual classifier system for handprinted alphanumeric character recognition , 1998, Pattern Analysis and Applications.

[17]  Saeed Mozaffari,et al.  Feature comparison between fractal codes and wavelet transform in handwritten alphanumeric recognition using SVM classifier , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[18]  Karim Faez,et al.  Language-Based Feature Extraction Using Template-Matching in Farsi/Arabic Handwritten Numeral Recognition , 2007 .

[19]  Madasu Hanmandlu,et al.  Unconstrained handwritten character recognition based on fuzzy logic , 2003, Pattern Recognit..

[20]  Yasmine N. Elglaly,et al.  Isolated Handwritten Arabic Characters Recognition using Multilayer Perceptrons and K Nearest Neighbor Classifiers , 2010 .

[21]  Pornchai Phukpattaranont,et al.  Feature reduction and selection for EMG signal classification , 2012, Expert Syst. Appl..

[22]  Kuldip K. Paliwal,et al.  Fast principal component analysis using fixed-point algorithm , 2007, Pattern Recognit. Lett..

[23]  The On/Off (LMCA) Dual Arabic Handwriting Database , 2008 .

[24]  Ching Y. Suen,et al.  Multi-modal nonlinear feature reduction for the recognition of handwritten numerals , 2004, First Canadian Conference on Computer and Robot Vision, 2004. Proceedings..

[25]  Jianlong Qiu,et al.  Feature Selection in Decision Systems: A Mean-Variance Approach , 2013 .

[26]  Karim Faez,et al.  Recognition of isolated handwritten Persian/Arabic characters and numerals using support vector machines , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).

[27]  D. Babu,et al.  CHARACTER RECOGNITION USING DEMPSTER-SHAFER THEORY-COMBINING DIFFERENT DISTANCE MEASUREMENT METHODS , 2010 .

[28]  Sabri A. Mahmoud,et al.  Recognition : A Survey , 2013 .

[29]  Sriganesh Madhvanath,et al.  Principal component analysis for online handwritten character recognition , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[30]  Mohammad Rahmati,et al.  Recognition of Persian handwritten digits using image profiles of multiple orientations , 2004, Pattern Recognit. Lett..

[31]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[32]  S. V. N. Vishwanathan,et al.  Use of Multi-category Proximal SVM for Data Set Reduction , 2001, HIS.

[33]  Andy C. Downton,et al.  Syntactic and contextual post-processing of handwritten addresses for optical character recognition , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[34]  M.N. Ayyaz,et al.  Efficient Training Data Reduction for SVM based Handwritten Digits Recognition , 2007, 2007 International Conference on Electrical Engineering.

[35]  Gheith A. Abandah,et al.  Handwritten Arabic character recognition using multiple classifiers based on letter form , 2008 .

[36]  Júlio C. Nievola,et al.  Comparing the dimensionality reduction methods in gene expression databases , 2012, Expert Syst. Appl..

[37]  Wu Zhongdong,et al.  Reduction of training datasets via fuzzy entropy for support vector machines , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).