Towards a Search System for the Web Exploiting Spatial Data of a Web Document

In this paper, we describe our work in progress in the scope of information retrieval exploiting the spatial data extracted from web documents. We discuss problems of a search for web documents by geographic distance, where the geographic distance of a document is determined automatically using information extraction methods. We present here our approach of building a distributed search system, which deals with several problems of this area. Search by geographic distance is useful, for example if we are looking for the nearest restaurant, hotel or any other business near our location (reference point). Almost every company today presents its business on the Internet sharing business information along with contact information. There can be miscellaneous geographic information extracted from the contact information (but no only from it) and used to compute geographic distance of a document. Under a document's geographic distance, we understand the distance between a search reference point and a geographic location related to the document. In our approach, we chose postal addresses and GPS coordinates for spatial data extraction. The reference point can be dynamically changed and one document can be related to more than one geographic location. Geographic locations are automatically discovered in document's textual content. Document is then indexed by all its known geographic locations, so later when searching, the document can be found near different geographic locations to which it is related. Domain of the search is automatically built by crawling through linked web documents.