Document page retrieval based on geometric layout features

Today, the keyword retrieval method is most standard and popular, and has been widely used in many applications. However, even the keyword retrieval method cannot always satisfy various types of information search subjects, because various kinds of information resources such as image data, graphics data, etc. must be managed in multi-media society, in addition to the word-dependent information. Of course, the methods which are more or less applicable to the characteristics of data resources such as structure, design, application, usage, volumn, etc., are necessary to make information-based activities of users successful. In this paper, we address a document page retrieval method based on the characteristics of page layout structure. Although the keyword retrieval method is a very excellent means in document page retrieval, we must pay attention to the case that keyword are not necessarily effective: it is not easy for foreigners to use keywords in different language or it is difficult for children to remember unknown words. In our method, the main idea is to focus on the geometric/positional relationships between characteristic components in identifying the document pages. Moreover, our original viewport is to introduce the inverted index, used commonly in the conventional information retrieval systems, but not to make use of structural/spatial relationship between characteristic components, which are standard in traditional map retrieval systems.

[1]  Boris Chidlovskii,et al.  Scalable indexing for layout based document retrieval and ranking , 2010, SAC '10.

[2]  Masakazu Iwamura,et al.  Camera-based document image retrieval as voting for partial signatures of projective invariants , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[3]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[4]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[5]  Toyohide Watanabe,et al.  Spatial Relation for Geometrical / Topological Map Retrieval , 2006, KES.

[6]  Shijian Lu,et al.  Document Image Retrieval through Word Shape Coding , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yehezkel Lamdan,et al.  Geometric Hashing: A General And Efficient Model-based Recognition Scheme , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[8]  Thomas M. Breuel,et al.  Distance measures for layout-based document image retrieval , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Joaquim A. Jorge,et al.  Sketch-based retrieval of complex drawings using hierarchical topology and geometry , 2009, Comput. Aided Des..

[11]  Chun-Jen Chen,et al.  A linear-time component-labeling algorithm using contour tracing technique , 2004, Comput. Vis. Image Underst..

[12]  A. Baddeley Human Memory: Theory and Practice, Revised Edition , 1990 .

[13]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[14]  F. Leymarie Tracking and Describing Deformable Objects Using Active Contour Models , 1990 .

[15]  Vernon H. Gregg,et al.  Introduction to Human Memory , 1986 .

[16]  Toyohide Watanabe,et al.  Structure recognition methods for various types of documents , 2005, Machine Vision and Applications.