Automatically detecting and classifying noises in document images

Image filtering to remove noise in document images follows two different approaches. The first one uses human classification of the noise present in an image for identifying a noise filter to use. The second approach is to blindly apply a batch of filters to an image. The former approach, although widely used, may insert noise in the filtering process due to the incorrect classification of the noise or even unsuitable filtering parameters. This paper presents a new paradigm for document image filtering. It aims at doing a more accurate and computationally efficient document cleanup by pre-characterizing the noise that is present in the document based on a set of human labeled training samples. The current focus of the project is on pre-characterization of the following types of noise: back-to-front interference or bleed through, skew and orientation, blur and framing.

[1]  Hubert Konik,et al.  Automatic blur detection for meta-data extraction in content-based retrieval context , 2003, IS&T/SPIE Electronic Imaging.

[2]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Rafael Dueire Lins,et al.  Image Classification to Improve Printing Quality of Mixed-Type Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[6]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[7]  Rafael Dueire Lins,et al.  Detailing a Quantitative Method for Assessing Algorithms to Remove Back-to-Front Interference in Documents , 2008, J. Univers. Comput. Sci..

[8]  Mohamed Abdel-Mottaleb,et al.  Image browsing using hierarchical clustering , 1999, Proceedings IEEE International Symposium on Computers and Communications (Cat. No.PR00250).

[9]  Nikos Papamarkos,et al.  An Evaluation Technique for Binarization Algorithms , 2008, J. Univers. Comput. Sci..

[10]  Paul Scheunders,et al.  A comparison of clustering algorithms applied to color image quantization , 1997, Pattern Recognit. Lett..

[11]  Heung-Kyu Lee,et al.  A Ranking Algorithm Using Dynamic Clustering for Content-Based Image Retrieval , 2002, CIVR.

[12]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[13]  Steven J. Simske Low-resolution photo/drawing classification: metrics, method and archiving optimization , 2005, IEEE International Conference on Image Processing 2005.

[14]  Rafael Dueire Lins,et al.  PhotoDoc : A Toolbox for Processing Document Images Acquired Using Portable Digital Cameras , 2007 .

[15]  Rafael Dueire Lins A Taxonomy for Noise in Images of Paper Documents - The Physical Noises , 2009, ICIAR.