Segmentation et classification dans les images de documents numérisés. (Segmentation and classification of digitized document images)

Les travaux de cette these ont ete effectues dans le cadre de l'analyse et du traitement d'images de documents imprimes afin d'automatiser la creation de revues de presse. Les images en sortie du scanner sont traitees sans aucune information a priori ou intervention humaine. Ainsi, pour les caracteriser, nous presentons un systeme d'analyse de documents composites couleur qui realise une segmentation en zones colorimetriquement homogenes et qui adapte les algorithmes d'extraction de textes aux caracteristiques locales de chaque zone. Les informations colorimetriques et textuelles fournies par ce systeme alimentent une methode de segmentation physique des pages de presse numerisee. Les blocs issus de cette decomposition font l'objet d'une classification permettant, entre autres, de detecter les zones publicitaires. Dans la continuite et l'expansion des travaux de classification effectues dans la premiere partie, nous presentons un nouveau moteur de classification et de classement generique, rapide et facile a utiliser. Cette approche se distingue de la grande majorite des methodes existantes qui reposent sur des connaissances a priori sur les donnees et dependent de parametres abstraits et difficiles a determiner par l'utilisateur. De la caracterisation colorimetrique au suivi des articles en passant par la detection des publicites, l'ensemble des approches presentees ont ete combinees afin de mettre au point une application permettant la classification des documents de presse numerisee par le contenu.

[1]  Anil K. Jain,et al.  Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Frank Lebourgeois,et al.  Serialized k-Means for Adaptative Color Image Segmentation: Application to Document Images and Others , 2004, Document Analysis Systems.

[4]  Shamik Sural,et al.  Soccer video processing for the detection of advertisement billboards , 2008, Pattern Recognit. Lett..

[5]  Haruo Asada,et al.  Major components of a complete text reading system , 1992 .

[6]  Daniel P. Lopresti,et al.  Validation of Image Defect Models for Optical Character Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Shoji Tominaga A Color Classification Algorithm for Color Images , .

[8]  Yuxin Peng,et al.  Color-based clustering for text detection and extraction in image , 2007, ACM Multimedia.

[9]  P. Nagabhushan,et al.  Text Extraction in Complex Color Document Images for Enhanced Readability , 2010, Intell. Inf. Manag..

[10]  Ling-Yu Duan,et al.  Robust Commercial Retrieval in Video Streams , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[11]  Meng Liu,et al.  Efficient Mean‐shift Clustering Using Gaussian KD‐Tree , 2010, Comput. Graph. Forum.

[12]  Chew Lim Tan,et al.  Adaptive Region Growing Color Segmentation for Text Using Irregular Pyramid , 2004, Document Analysis Systems.

[13]  Abdel Belaïd,et al.  Self-organizing Maps and Ancient Documents , 2004, Document Analysis Systems.

[14]  Arthur Robert Weeks,et al.  Color segmentation in the HSI color space using the K-means algorithm , 1997, Electronic Imaging.

[15]  L. Chen,et al.  Coarse adaptive color image segmentation for visual object classification , 2008, 2008 15th International Conference on Systems, Signals and Image Processing.

[16]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[17]  Ki-Sang Hong,et al.  Binarization of noisy gray-scale character images by thin line modeling , 1999, Pattern Recognit..

[18]  Nikos A. Nikolaou,et al.  Color reduction for complex document images , 2009, Int. J. Imaging Syst. Technol..

[19]  Jean Serra,et al.  Image Analysis and Mathematical Morphology , 1983 .

[20]  Liming Chen,et al.  Color quantization for image processing using self information , 2007, 2007 6th International Conference on Information, Communications & Signal Processing.

[21]  Yao Zhao,et al.  Robust Commercial Detection System , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[22]  Sung-Il Chien,et al.  An improved binarization algorithm based on a water flow model for document image with inhomogeneous backgrounds , 2005, Pattern Recognit..

[23]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[24]  Song Mao,et al.  Stochastic language models for style-directed layout analysis of document images , 2003, IEEE Trans. Image Process..

[25]  Sargur N. Srihari,et al.  Document Image Binarization Based on Texture Features , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Noel Murphy,et al.  Automatic TV advertisement detection from MPEG bitstream , 2002, Pattern Recognit..

[27]  Paul Scheunders,et al.  A comparison of clustering algorithms applied to color image quantization , 1997, Pattern Recognit. Lett..

[28]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[29]  Philip A. Chou,et al.  Turbo recognition: a statistical approach to layout analysis , 2000, IS&T/SPIE Electronic Imaging.

[30]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[31]  Charalambos Strouthopoulos,et al.  Text extraction in complex color documents , 2002, Pattern Recognit..

[32]  Anil K. Jain,et al.  Locating text in complex color images , 1995, Pattern Recognit..

[33]  Atilla Baskurt,et al.  Improving Zernike Moments Comparison for Optimal Similarity and Rotation Angle Retrieval , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Frank Lebourgeois Content Based Image Retrieval Using Gradient Color Fields , 2000, ICPR.

[35]  Shigeru Akamatsu,et al.  Recognizing Characters in Scene Images , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Lawrence O. Hall,et al.  Text extraction from color documents-clustering approaches in three and four dimensions , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[37]  A. G. Ramakrishnan,et al.  Text Localization and Extraction from Complex Color Images , 2005, ISVC.

[38]  Xiaolin Wu,et al.  Color quantization by dynamic programming and principal analysis , 1992, TOGS.

[39]  Ali M. S. Zalzala,et al.  A genetic rule-based data clustering toolkit , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[40]  Ning Yang,et al.  Semi-supervised learning for text-line detection , 2010, Pattern Recognit. Lett..

[41]  Jean-Michel Jolion,et al.  Text localization, enhancement and binarization in multimedia documents , 2002, Object recognition supported by user interaction for service robots.

[42]  Seong-Whan Lee,et al.  Text extraction in MPEG compressed video for content-based indexing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[43]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[44]  Frank Le bourgeois,et al.  Caractérisation des écritures médiévales par des méthodes statistiques basées sur les cooccurrences , 2011 .

[45]  Horst Bunke,et al.  Text extraction from colored book and journal covers , 2000, International Journal on Document Analysis and Recognition.

[46]  V. John Mathews,et al.  Adaptive, quadratic preprocessing of document images for binarization , 1998, IEEE Trans. Image Process..

[47]  Frédéric Jurie,et al.  Randomized Clustering Forests for Image Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Jitendra Malik,et al.  Using contours to detect and localize junctions in natural images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[50]  Hua Yang,et al.  Extraction of bibliography information based on image of book cover , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[51]  Nenghai Yu,et al.  On Detection of Advertising Images , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[52]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[53]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[54]  Charalambos Strouthopoulos,et al.  Adaptive color reduction , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[55]  Marc Teboulle,et al.  Grouping Multidimensional Data - Recent Advances in Clustering , 2006 .

[56]  Apostolos Antonacopoulos,et al.  Colour text segmentation in web images based on human perception , 2007, Image Vis. Comput..

[57]  Jean-Michel Jolion,et al.  Object count/area graphs for the evaluation of object detection and segmentation algorithms , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[58]  P. Nagabhushan,et al.  Foreground Text Extraction in Color Document Images for Enhanced Readability , 2009, PReMI.

[59]  Neil C. Rowe,et al.  Automatic removal of advertising from web-page display , 2002, JCDL '02.

[60]  Dimitris K. Tasoulis,et al.  Enhancing principal direction divisive clustering , 2010, Pattern Recognit..

[61]  Frank Lebourgeois,et al.  DEBORA: Digital AccEss to BOoks of the RenAissance , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[62]  Venu Govindaraju,et al.  Text - image separation in Devanagari documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[63]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[64]  Nikolaos G. Bourbakis,et al.  A fuzzy region growing approach for segmentation of color images , 1997, Pattern Recognit..

[65]  Jian Yang,et al.  KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Øivind Due Trier,et al.  Evaluation of Binarization Methods for Document Images , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[67]  Anil K. Jain,et al.  Text information extraction in images and video: a survey , 2004, Pattern Recognit..

[68]  Frank Lebourgeois,et al.  Serialized unsupervised classifier for adaptative color image segmentation: application to digitized ancient manuscripts , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[69]  Michael A. Arbib,et al.  Color Image Segmentation using Competitive Learning , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[70]  Frank Lebourgeois,et al.  Chromatic / Achromatic Separation in Noisy Document Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[71]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..