Using Lucene to index and search the digitized 1940 US Census

An improved approach toward enabling search capabilities over large digitized document archives is described, in which Lucene indices were incorporated in a framework developed to provide automatic searchable access to the 1940 US Census, a collection composed of digitized handwritten forms. As an alternative to trying to recognize the handwritten text in the images, Word Spotting feature vectors are used to describe each cell's content. Instead of querying the system using regular ASCII text, any query is rendered as an image, and a ranked list of matching results is presented to the user. Among other preprocessing steps required by the framework, an index must be compiled to provide fast access to the feature vectors. The advantages and drawbacks of using Lucene to index these vectors instead of other indexing methods are discussed in light of the challenges confronted when dealing with digitized document collections of considerable size. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  Luigi Marini,et al.  Using Lucene to index and search the digitized 1940 US Census , 2013, Concurr. Comput. Pract. Exp..

[2]  Edward M. Riseman,et al.  Word spotting: a new approach to indexing handwriting , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  R. Manmatha,et al.  A search engine for historical manuscript images , 2004, SIGIR '04.

[4]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Luigi Marini,et al.  Digitization and search: A non-traditional use of HPC , 2012, 2012 IEEE 8th International Conference on E-Science.

[6]  Claudio Gennaro,et al.  An Approach to Content-Based Image Retrieval Based on the Lucene Search Engine Library , 2010, ECDL.

[7]  Pasquale Savino,et al.  Approximate similarity search in metric spaces using inverted files , 2008, Infoscale.

[8]  Luigi Marini,et al.  A framework to access handwritten information within large digitized paper collections , 2012, 2012 IEEE 8th International Conference on E-Science.

[9]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .