Safely selecting subsets of training data

Highly versatile classifiers for document analysis systems demand representative training sets which can be dauntingly large, often challenging conventional trainable classifier technologies. We propose to select a small subset of training data, matched to each particular test set, in hopes of improved speed without loss of accuracy. Since selection must occur on line, we cannot use classifiers that require off-line training. Fortunately, Nearest Neighbors classifiers support on-line training; we use a fast approximate kNN technology using hashed k-D trees. The distribution of samples in k-D bins can be used to measure similarity between any two document images: we select the three most similar training images for any given test image. In experiments on a document image content extraction system, our algorithm was able to prune 118 training images to three, for a speedup of a factor of 17 with no loss of accuracy. Other experiments with an oracle and manual selection suggest that it may be possible to improve accuracy as well.

[1]  Trevor Darrell,et al.  Locality-Sensitive Hashing Using Stable Distributions , 2006 .

[2]  Dawei Yin,et al.  Time and space optimization of document content classifiers , 2010, Electronic Imaging.

[3]  Henry S. Baird,et al.  Document image content inventories , 2007, Electronic Imaging.

[4]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[5]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[6]  Kenneth L. Clarkson,et al.  A Randomized Algorithm for Closest-Point Queries , 1988, SIAM J. Comput..

[7]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[8]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[9]  Sunil Arya,et al.  Approximate nearest neighbor queries in fixed dimensions , 1993, SODA '93.

[10]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[11]  Nick Roussopoulos,et al.  K-Nearest Neighbor Search for Moving Query Point , 2001, SSTD.

[12]  Henry S. Baird,et al.  Versatile document image content extraction , 2006, Electronic Imaging.

[13]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[14]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[15]  Henry S. Baird,et al.  Towards Versatile Document Analysis Systems , 2006, Document Analysis Systems.

[16]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[17]  Matthew R. Casey FAST APPROXIMATE NEAREST NEIGHBORS , 2006 .