Evaluation on geospatial information extraction and retrieval: Mining thematic maps from web source

The World Wide Web easily becomes the largest repository of natural language text data. We are particularly interested in state-of-the-art methods in exploiting geospatial information the web. The survey is done in the context of its extraction methods, retrieval, visualization, and further possible mining or knowledge discovery scenarios in order to produce thematic maps automatically from the web corpus. We found that Web-based Geographic Information Retrieval (GIR) methods that returns selected relevant area instead of points is still lacking, even though area modeling is common in GIS. We also found that most GIR methods is still focused on places and buildings instead of theme or information around some area. Thus it indicates that the state of the art GIR methods are not yet sufficient for thematic extraction and retrieval to generate thematic maps from web corpus. Bayesian topic models such as Latent Dirichlet Allocation may serve as a good basis to serve such use cases.

[1]  Sumaia Mohammed Al-Ghuribi,et al.  A Comprehensive Survey on Web Content Extraction Algorithms and Techniques , 2013, 2013 International Conference on Information Science and Applications (ICISA).

[2]  Xiaoyan Ren,et al.  Web information extraction based on IEBIDTech , 2012, World Automation Congress 2012.

[3]  Min Song,et al.  Handbook of Research on Text and Web Mining Technologies , 2008 .

[4]  M. Sanderson,et al.  Analyzing geographic queries , 2004 .

[5]  Damien Palacio,et al.  On the evaluation of Geographic Information Retrieval systems , 2010, International Journal on Digital Libraries.

[6]  Marty Himmelstein Local Search: The Internet Is the Yellow Pages , 2005, Computer.

[7]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[8]  Paolo Rosso,et al.  Geooreka: Enhancing Web Searches with Geographical Information , 2009, SEBD.

[9]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  Jeremy Witmer,et al.  Mining Wikipedia Article Clusters for Geospatial Entities and Relationships , 2009, AAAI Spring Symposium: Social Semantic Web: Where Web 2.0 Meets Web 3.0.

[12]  Arno Scharl,et al.  The Geospatial Web: How Geobrowsers, Social Software and the Web 2.0 are Shaping the Network Society , 2007, The Geospatial Web.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Divakar Yadav,et al.  Users search trends on WWW and their analysis , 2010, IITM '10.

[15]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Craig A. Knoblock,et al.  Learning Approximate Thematic Maps from Labeled Geospatial Data , 2004 .

[17]  Mirna Adriani,et al.  Identifying location in indonesian documents for geographic information retrieval , 2007, GIR '07.

[18]  Oren Etzioni,et al.  Machine Reading , 2006, AAAI.

[19]  Jochen L. Leidner,et al.  Detecting geographical references in the form of place names and associated spatial natural language , 2011, SIGSPACIAL.

[20]  Christina Feilmayr,et al.  Text Mining-Supported Information Extraction: An Extended Methodology for Developing Information Extraction Systems , 2011, 2011 22nd International Workshop on Database and Expert Systems Applications.

[21]  Hanan Samet,et al.  STEWARD: architecture of a spatio-textual search engine , 2007, GIS.

[22]  Avi Arampatzis,et al.  The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet , 2007, Int. J. Geogr. Inf. Sci..

[23]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.