Document image dataset indexing and compression using connected components clustering

We present a method for document image dataset indexing and compression by clustering of connected components. Our method extracts connected components from each dataset image and performs component clustering to make a hash table that is a compressed indexing of the dataset. Clustering is based on component similarity which is estimated by comparing shape features extracted from the components. Then, the hash table is saved in a text file, and the text file is further compressed using any available compression methodology. Component encoding in the hash table is storage efficient and done using components' contour points and a reduced number of interior points that are sufficient for component reconstruction. We evaluate our method's performances in indexing and compression using four document image datasets. Experimental results show that indexing significantly improves efficiency when used in document image retrieval. In addition, comparative evaluation with two compression standards, namely the ZIP and XZ formats, show competitive performances. Our compression rates are below 20% and the compression errors are very low being at the order of 10-6% per image.

[1]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[2]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[3]  Yuzuru Tanaka,et al.  Compression and String Matching Method for Printed Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[4]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Li Yu,et al.  Math Spotting: Retrieving Math in Technical Documents Using Handwritten Query Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[6]  Yann LeCun,et al.  DjVu: analyzing and compressing scanned documents for Internet distribution , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[7]  Keisuke Kameyama,et al.  An Application-Independent and Segmentation-Free Approach for Spotting Queries in Document Images , 2014, 2014 22nd International Conference on Pattern Recognition.

[8]  Yoshua Bengio,et al.  High quality document image compression with "DjVu" , 1998, J. Electronic Imaging.

[9]  Yun-Sheng Yen,et al.  Compression of Chinese Document Images by Complex Shape Matching , 2013, Comput. J..

[10]  Giovanni Soda,et al.  Digital Libraries and Document Image Retrieval Techniques: A Survey , 2011, Learning Structure and Schemas from Documents.

[11]  Keisuke Kameyama,et al.  A modular approach for query spotting in document images and its optimization using genetic algorithms , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[12]  Ye Duan,et al.  Lidar depth image compression using clustering, re-indexing, and JPEG2000 , 2011, Defense + Commercial Sensing.