Discovery of Environmental Nodes in the Web

Analysis and processing of environmental information is considered of utmost importance for humanity. This article addresses the problem of discovery of web resources that provide environmental measurements. Towards the solution of this domain-specific search problem, we combine state-of-the-art search techniques together with advanced textual processing and supervised machine learning. Specifically, we generate domain-specific queries using empirical information and machine learning driven query expansion in order to enhance the initial queries with domain-specific terms. Multiple variations of these queries are submitted to a general-purpose web search engine in order to achieve a high recall performance and we employ a post processing module based on supervised machine learning to improve the precision of the final results. In this work, we focus on the discovery of weather forecast websites and we evaluate our technique by discovering weather nodes for south Finland.

[1]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[2]  Oren Etzioni,et al.  Dynamic Reference Sifting: A Case Study in the Homepage Domain , 1997, Comput. Networks.

[3]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[4]  Yiannis Kompatsiaris,et al.  AQUAM: automatic query formulation architecture for mobile applications , 2008, MUM '08.

[5]  Qiang Wang,et al.  Ontology-Based Focused Crawling , 2009, 2009 International Conference on Information, Process, and Knowledge Management.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Emanuele Pianta,et al.  KX: A Flexible System for Keyphrase eXtraction , 2010, *SEMEVAL.

[8]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Hsinchun Chen,et al.  MetaSpider: Meta-searching and categorization on the Web , 2001, J. Assoc. Inf. Sci. Technol..

[11]  David Hawking,et al.  Focused Crawling in Depression Portal Search: A Feasibility Study , 2004, ADCS.

[12]  Hong-Gee Kim,et al.  An ontology-based approach to learnable focused crawling , 2008, Inf. Sci..

[13]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[14]  Yasuhiko Kitamura,et al.  Keyword Spices: A New Method for Building Domain-Specific Web Search Engines , 2001, IJCAI.

[15]  Toru Ishida,et al.  Domain-specific Web search with keyword spices , 2004, IEEE Transactions on Knowledge and Data Engineering.