Design and Implementation of a Geographic Search Engine

In this paper, we describe the design and initial implementation of a geographic search engine prototype for Germany, based on a large crawl of the de domain. Geographic search engines provide a flexible interface to the Web that allows users to constrain and order search results in an intuitive manner, by focusing a query on a particular geographic region. Geographic search technology has recently received significant commercial interest, but there has bee n only a limited amount of academic work. Our prototype performs massive extraction of geographic features from crawled data, which are then mapped to coordinates and aggregated across link and site structure. This assigns to each web page a set of relevant locations, called the geographic footprint of the page. The resulting footprint d ata is then integrated into a high-performance query processor on a cluster-based architecture. We discuss the various techniques, both new and existing, that are used for recognizing, matching, mapping, and aggregating geographic features, and describe how to integrate geographic query processing into a standard search architecture and interface.

[1]  Andrew Daviel,et al.  Geographic registration of HTML documents , 2007 .

[2]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[3]  Lakshminarayanan Subramanian,et al.  An investigation of geographic mapping techniques for internet hosts , 2001, SIGCOMM 2001.

[4]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[5]  Ian Dickinson,et al.  A Means for Expressing Location Information in the Domain Name System , 1996, RFC.

[6]  Luis Gravano,et al.  Categorizing web queries according to geographical locality , 2003, CIKM '03.

[7]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[8]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[9]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.

[10]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[11]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  S. M. McCune,et al.  American Behavioral Scientist , 1977 .

[13]  Matthew Zook,et al.  Old Hierarchies or New Networks of Centrality? , 2001 .

[14]  Kevin S. McCurley,et al.  Geospatial mapping and navigation of the web , 2001, WWW '01.

[15]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[16]  Torsten Suel,et al.  I/O-efficient techniques for computing pagerank , 2002, CIKM '02.

[17]  Luis Gravano,et al.  Exploiting Geographical Location Information of Web Pages , 1999, WebDB.

[18]  Max J. Egenhofer,et al.  Toward the semantic geospatial web , 2002, GIS '02.

[19]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[20]  Bernhard Seeger,et al.  Exploiting the Internet As a Geospatial Database , 2003 .

[21]  Luis Gravano,et al.  Computing Geographical Scopes of Web Resources , 2000, VLDB.