The roles of document structure in document image retrieval and classification

Title of Dissertation: The Roles of Document Structure in Document Image Retrieval and Classification Christian Kwang-Un Shin, Doctor of Philosophy, 2000 Dissertation directed by: Professor Azriel Rosenfeld Department of Computer Science Current document management and database systems provide text search and retrieval capabilities, but generally lack the ability to utilize the documents’ logical and physical structures. This dissertation defines a general framework for describing the physical and logical structure of documents, and describes a general system for document image retrieval that is able to make use of document structure. It discusses the use of structural similarity for retrieval; it defines a measure of structural similarity between document images based on content area overlap, and also compares similarity ratings based on this measure with human relevance judgments. Finally, it investigates document type classification using features related to physical layout structure, and using both decision-tree and self-organizing map classifiers; in these experiments too, ground truth was provided by human judgments. The Roles of Document Structure in Document Image Retrieval and Classification by Christian Kwang-Un Shin Dissertation submitted to the Faculty of the Graduate School of The University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2000 Advisory Committee: Professor Azriel Rosenfeld, Chairman/Advisor Doctor David S. Doermann Professor Larry S. Davis Doctor Daniel DeMenthon Professor Kyu Yong Choi, Dean’s Representative c Copyright by Christian Kwang-Un Shin 2000

[1]  Euripides G. M. Petrakis,et al.  Similarity Searching in Large Image DataBases , 1994 .

[2]  T. Pavlidis,et al.  Page segmentation without rectangle assumption , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[3]  Omid Ebrahimi Kia,et al.  Document image compression and analysis , 1997 .

[4]  Sargur N. Srihari,et al.  Document image binarization based on texture analysis , 1994, Electronic Imaging.

[5]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[6]  H. Emptoz,et al.  A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[7]  David Doermann,et al.  Classification of Document Page Images , 1999 .

[8]  S. Tsujimoto,et al.  Understanding multi-articled documents , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[9]  Atsuhiro Takasu,et al.  A document understanding method for database construction of an electronic library , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[10]  Dan S. Bloomberg,et al.  Detecting and locating partially specified keywords in scanned images using hidden Markov models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[11]  Henry S. Baird,et al.  Language-free layout analysis , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[12]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[13]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[14]  Andreas Dengel,et al.  Clustering and classification of document structure-a machine learning approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[15]  Takashi Saitoh,et al.  Document image segmentation and text area ordering , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[16]  King-Lup Liu,et al.  Similarity based Retrieval of Pictures Using Indices on Spatial Relationships , 1995, VLDB.

[17]  Matti Pietikäinen,et al.  Graphical Tools and Techniques for Querying Document Image Databases , 1997, BSDIA.

[18]  Francine Chen,et al.  Spotting phrases in lines of imaged text , 1995, Electronic Imaging.

[19]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Gladys Monagan,et al.  A Retrieval System for Graphical Documents , 1995 .

[21]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[22]  Abdel Belaïd,et al.  Gestion D'hypothhses Pour La Reconnaissance Structurelle De Documents Hypothesis Management for Structured Document Recognition , 1991 .

[23]  Clement T. Yu,et al.  Reasoning About Spatial Relationships in Picture Retrieval Systems , 1994, VLDB.

[24]  Jianying Hu,et al.  Document classification using layout analysis , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[25]  Michael Bieber,et al.  Heuristic Classification of Office Documents , 1994, Int. J. Artif. Intell. Tools.

[26]  Larry Spitz,et al.  Duplicate document detection , 1997, Electronic Imaging.

[27]  Matti Pietikäinen,et al.  A document management interface utilizing page decomposition and content-based compression , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[28]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[29]  Jonathan J. Hull Document Image Matching and Retrieval With Multiple Distortion-Invariant Descriptors , 1995 .

[30]  Jeff L. DeCurtins,et al.  Keyword spotting via word shape recognition , 1995, Electronic Imaging.

[31]  David S. Doermann,et al.  Structure-preserving document image compression , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[32]  Abdel Belaïd,et al.  Page segmentation by segment tracing , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[33]  Hanan Samet,et al.  MARCO: MAp Retrieval by COntent , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[35]  Andreas Dengel,et al.  Initial learning of document structure , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[36]  P. Herrmann,et al.  Retrieval of document images using layout knowledge , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[37]  Jason Tsong-Li Wang,et al.  Nested segmentation: an approach for layout analysis in document classification , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[38]  Lucy A. Suchman,et al.  Reflections on a Work-Oriented Design Project , 1996, Hum. Comput. Interact..

[39]  David S. Doermann,et al.  Classification of document page images based on visual similarity of layout structures , 1999, Electronic Imaging.

[40]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[41]  John G. Hughes,et al.  Object-oriented databases , 1991, Prentice Hall International series in computer science.

[42]  Carl Lagoze,et al.  Dienst: Implementation Reference Manual , 1995 .

[43]  George Nagy,et al.  DOCUMENT ANALYSIS WITH AN EXPERT SYSTEM , 1986 .

[44]  Gerd Maderlechner,et al.  Classification of documents by form and content , 1997, Pattern Recognit. Lett..

[45]  Øivind Due Trier,et al.  Improvement of "integrated function algorithm" for binarization of document images , 1995, Pattern Recognit. Lett..

[46]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[47]  Yasuto Ishitani,et al.  Document skew detection based on local region complexity , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[48]  Matti Pietikäinen,et al.  Page segmentation and classification using fast feature extraction and connectivity analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[49]  Luc Van Gool,et al.  Texture analysis Anno 1983 , 1985, Comput. Vis. Graph. Image Process..

[50]  Suh-Yin Lee,et al.  Retrieval of similar pictures on pictorial databases , 1991, Pattern Recognit..

[51]  Eiichi Tanaka,et al.  High speed string edit methods using hierarchical files and hashing technique , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[52]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[53]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[54]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[55]  Suh-Yin Lee,et al.  Similarity retrieval of iconic image database , 1989, Pattern Recognit..

[56]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[57]  Naohiro Amamoto,et al.  Block segmentation and text area extraction of vertically/horizontally written document , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[58]  Azriel Rosenfeld,et al.  The Development of a General Framework for Intelligent Document Image Retrieval , 1996, DAS.

[59]  Suzanne Liebowitz Taylor,et al.  Classification and functional decomposition of business documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[60]  Andreas Dengel,et al.  Computer understanding of document structure , 1996 .

[61]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[62]  Yuki Hirayama,et al.  A block segmentation method for document images with complicated column structures , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[63]  Azriel Rosenfeld,et al.  The function of documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[64]  King-Sun Fu,et al.  An Image Understanding System Using Attributed Symbolic Representation and Inexact Graph-Matching , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  C. Bigelow,et al.  Digital typography , 1987 .

[66]  S.C. Hinds,et al.  A rule-based system for document image segmentation , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[67]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[68]  Masashi Koga,et al.  Structure analysis method of graph image for document image retrieval , 1993, Electronic Imaging.

[69]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[70]  Rohini K. Srihari Intelligent document understanding: Understanding photographs with captions , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[71]  Horst Bunke,et al.  IMAGE PROCESSING METHODS FOR DOCUMENT IMAGE ANALYSIS , 1997 .

[72]  Robert M. Haralick,et al.  An automatic algorithm for text skew estimation in document images using recursive morphological transforms , 1994, Proceedings of 1st International Conference on Image Processing.

[73]  Junichi Kanai,et al.  Preliminary evaluation of histogram-based binarization algorithms , 1995, Electronic Imaging.

[74]  Dov Dori,et al.  Object-Process Analysis: Maintaining the Balance Between System Structure and Behaviour , 1995, J. Log. Comput..

[75]  Wolfgang Horak,et al.  Office Document Architecture and Office Document Interchange Formats: Current Status of International Standardization , 1985, Computer.

[76]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[77]  Michael Bieber,et al.  A tool for classifying office documents , 1993, Proceedings of 1993 IEEE Conference on Tools with Al (TAI-93).

[78]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[79]  Simone Santini,et al.  Similarity Matching , 1995, ACCV.

[80]  Robert M. Haralick,et al.  Extraction of text layout structures on document images based on statistical characterization , 1995, Electronic Imaging.

[81]  Rohini K. Srihari Automatic indexing and content-based retrieval of captioned photographs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[82]  Rama Chellappa,et al.  Multiscale Document Page Segmentation Using Soft Decision Integration , 1997 .

[83]  David S. Doermann,et al.  The detection of duplicates in document image databases , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.