Effectively Searching Maps in Web Documents

Maps are an important source of information in archaeology and other sciences. Users want to search for historical maps to determine recorded history of the political geography of regions at different eras, to find out where exactly archaeological artifacts were discovered, etc. Currently, they have to use a generic search engine and add the term map along with other keywords to search for maps. This crude method will generate a significant number of false positives that the user will need to cull through to get the desired results. To reduce their manual effort, we propose an automatic map identification, indexing, and retrieval system that enables users to search and retrieve maps appearing in a large corpus of digital documents using simple keyword queries. We identify features that can help in distinguishing maps from other figures in digital documents and show how a Support-Vector-Machine-based classifier can be used to identify maps. We propose map-level-metadata e.g., captions, references to the maps in text, etc. and document-level metadata, e.g., title, abstract, citations, how recent the publication is, etc. and show how they can be automatically extracted and indexed. Our novel ranking algorithm weights different metadata fields differently and also uses the document-level metadata to help rank retrieved maps. Empirical evaluations show which features should be selected and which metadata fields should be weighted more. We also demonstrate improved retrieval results in comparison to adaptations of existing methods for map retrieval. Our map search engine has been deployed in an online map-search system that is part of the Blind-Review digital library system.

[1]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[2]  Mounia Lalmas Uniform Representation of Content and Structure for structured document retrieval , 2001 .

[3]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[4]  Hanan Samet,et al.  MAGELLAN: Map Acquisition of GEographic Labels by Legend ANalysis , 1998, International Journal on Document Analysis and Recognition.

[5]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[6]  H. Frank,et al.  Statistics: concepts and applications , 1996 .

[7]  Ioannis A. Kakadiaris,et al.  Understanding diagrams in technical documents , 1992, Computer.

[8]  William C. Schefler,et al.  Statistics: Concepts and Applications , 1988 .

[9]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[11]  F. Clarke,et al.  Nonlinear oscillations and boundary value problems for Hamiltonian systems , 1982 .

[12]  C. Lee Giles,et al.  Classification of source code archives , 2003, SIGIR '03.

[13]  Gabriella Tarantello,et al.  Subharmonic solutions with prescribed minimal period for nonautonomous Hamiltonian systems , 1988 .

[14]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[15]  José Luis Borbinha,et al.  Geographically-aware information retrieval for collections of digitized historical maps , 2007, GIR '07.

[16]  Sandip Debnath,et al.  Learning metadata from the evidence in an on-line citation matching scheme , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[17]  Amir D. Aczel Statistics:Concepts and Applications , 1995 .

[18]  Kam-Fai Wong,et al.  A retrospective study of a hybrid document-context based retrieval model , 2007, Inf. Process. Manag..

[19]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[20]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[21]  Paul H. Rabinowitz,et al.  On subharmonic solutions of hamiltonian systems , 1980 .

[22]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[25]  Michael E. Lesk,et al.  Creating a searchable map library via data mining , 2008, JCDL '08.

[26]  J G Daugman,et al.  Information Theory and Coding , 2005 .

[27]  Hanan Samet,et al.  MARCO: MAp Retrieval by COntent , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Christopher S. G. Khoo,et al.  G-Portal: a map-based digital library for distributed geospatial and georeferenced resources , 2002, JCDL '02.