Towards Versatile Document Analysis Systems

The research goal of highly versatile document analysis systems, capable of performing useful functions on the great majority of document images, seems to be receding, even in the face of decades of research. One family of nearly universally applicable capabilities includes document image content extraction tools able to locate regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. The severity of the methodological problems is suggested by the lack of agreement within the R&D community on even what is meant by a representative set of samples in this context. Even when this is agreed, it is often not clear how sufficiently large sets for training and testing can be collected and ground truthed. Perhaps this can be alleviated by discovering a principled way to amplify sample sets using synthetic variations. We will then need classification methodologies capable of learning automatically from these huge sample sets in spite of their poorly parameterized—or unparameterizable—distributions. Perhaps fast expected-time approximate k-nearest neighbors classifiers are a good solution, even if they tend to require enormous data structures: hashed k-d trees seem promising. We discuss these issues and report recent progress towards their resolution. Keyword: versatile document analysis systems, DAS methodology, document image content extraction, classification, k Nearest Neighbors, k-d trees, CART, spatial data structures, computational geometry, hashing

[1]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[3]  Theodosios Pavlidis 36 years on the pattern recognition front: Lecture given at ICPR'2000 in Barcelona, Spain on the occasion of receiving the K.S. Fu prize , 2003, Pattern Recognit. Lett..

[4]  George Nagy,et al.  Style consistent classification of isogenous patterns , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Tin Kam Ho,et al.  Large-Scale Simulation Studies in Image Pattern Recognition , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  David G. Stork,et al.  Pattern Classification , 1973 .

[7]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[8]  Donald Ervin Knuth,et al.  Computer modern typefaces , 1986 .

[9]  George Nagy,et al.  Style context with second-order statistics , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Chak-Kuen Wong,et al.  Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees , 1977, Acta Informatica.

[11]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[12]  Henry S. Baird,et al.  Versatile document image content extraction , 2006, Electronic Imaging.

[13]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[14]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[15]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[16]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.