An improved fast edit approach for two-string approximated mean computation applied to OCR

This paper presents a new fast algorithm for computing an approximation to the mean of two strings of characters representing a 2D shape and its application to a new Wilson-based editing procedure. The approximate mean is built up by including some symbols from the two original strings. In addition, a Greedy approach to this algorithm is studied, which allows us to reduce the time required to compute an approximate mean. The new dataset editing scheme relaxes the criterion for deleting instances proposed by the Wilson editing procedure. In practice, not all instances misclassified by their near neighbors are pruned. Instead, an artificial instance is added to the dataset in the hope of successfully classifying the instance in the future. The new artificial instance is the approximated mean of the misclassified sample and its same-class nearest neighbor. Experiments carried out over three widely known databases of contours show that the proposed algorithm performs very well when computing the mean of two strings, and outperforms methods proposed by other authors. In particular, the low computational time required by the heuristic approach makes it very suitable when dealing with long length strings. Results also show that the proposed preprocessing scheme can reduce the classification error in about 83% of trials. There is empirical evidence that using the Greedy approximation to compute the approximated mean does not affect the performance of the editing procedure.

[1]  Daniel Keysers,et al.  Comparison and Combination of State-of-the-art Techniques for Handwritten Character Recognition: Topping the MNIST Benchmark , 2007, ArXiv.

[2]  Tieniu Tan,et al.  Reducing the Effect of Noise on Human Contour in Gait Recognition , 2007, ICB.

[3]  Jack Koplowitz,et al.  On the relation of performance to editing in nearest neighbor rules , 1981, Pattern Recognit..

[4]  Anil K. Jain,et al.  Learning 2D shape models , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[5]  János Csirik,et al.  Dynamic computation of generalised median strings , 2002, Pattern Analysis & Applications.

[6]  Francesc J. Ferri,et al.  Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[7]  Filiberto Pla,et al.  A Stochastic Approach to Wilson's Editing Algorithm , 2005, IbPRIA.

[8]  Donghai Guan,et al.  Nearest neighbor editing aided by unlabeled data , 2009, Inf. Sci..

[9]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[10]  Ju Lynn Ong,et al.  Mean Shape Models for Polyp Detection in CT Colonography , 2008, 2008 Digital Image Computing: Techniques and Applications.

[11]  Anil K. Jain,et al.  Automatic Construction of 2D Shape Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Juan Ramón Rico-Juan,et al.  Comparison of AESA and LAESA search algorithms using string and tree-edit-distances , 2003, Pattern Recognit. Lett..

[13]  Milan Sonka,et al.  Learning Shape Models from Examples Using Automatic Shape Clustering and Procrustes Analysis , 1999, IPMI.

[14]  E. Vidal,et al.  COMPARISON OF SEVERAL EDITING AND CONDENSING TECHNIQUES FOR COLOUR IMAGE SEGMENTATION AND OBJECT LOCATION , 1992 .

[15]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[16]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[17]  Francisco Casacuberta,et al.  Topology of Strings: Median String is NP-Complete , 1999, Theor. Comput. Sci..

[18]  R.A.M. Cardenas A learning model for multiple-prototype classification of strings , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[19]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[20]  Angelo Marcelli,et al.  Towards a genetic based prototyper for character shapes , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Ivan Tomek,et al.  A Generalization of the k-NN Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[23]  Francisco Casacuberta,et al.  Median strings for k-nearest neighbour classification , 2003, Pattern Recognit. Lett..

[24]  Juan Ramón Rico-Juan,et al.  A New Editing Scheme Based on a Fast Two-String Median Computation Applied to OCR , 2010, SSPR/SPR.

[25]  Marion Langer,et al.  Automatic contour model creation out of polygonal CAD models for markerless Augmented Reality , 2007, 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.

[26]  Josep Lladós,et al.  A mean string algorithm to compute the average among a set of 2D shapes , 2002, Pattern Recognit. Lett..

[27]  Andrés Marzal,et al.  Contour-Based Shape Retrieval Using Dynamic Time Warping , 2005, CAEPIA.

[28]  Yoonsik Tak,et al.  A Leaf Image Retrieval Scheme Based on Partial Dynamic Time Warping and Two-Level Filtering , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).