Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence

This work investigates the use of focussed crawling techniques for the discovery of environmental multimedia Web resources that provide air quality measurements and forecasts. Focussed crawlers automatically navigate the hyperlinked structure of the Web and select the hyperlinks to follow by estimating their relevance to a given topic, based on evidence obtained from the already downloaded pages. Given that air quality measurements and particularly air quality forecasts are presented not only in textual form, but are most commonly encoded as multimedia, mainly in the form of heatmaps, we propose the combination of textual and visual evidence for predicting the benet of fetching an unvisited Web resource. First, text classication is applied to select the relevant hyperlinks based on their anchor text, a surrounding text window, and URL terms. Further hyperlinks are selected by combining their text classication score with an image classication score that indicates the presence of heatmaps in their source page. A pilot evaluation indicates that the combination of textual and visual evidence results in improvements in the crawling precision over the use of textual features alone.

[1]  Yiannis Kompatsiaris,et al.  Discovery of environmental resources based on heatmap recognition , 2013, 2013 IEEE International Conference on Image Processing.

[2]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  R. Bone Discovery , 1938, Nature.

[5]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[6]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[7]  Ioannis Kompatsiaris,et al.  Discovery, Analysis, and Retrieval of Multimodal Environmental Information , 2015 .

[8]  Yiannis Kompatsiaris,et al.  Discovery of Environmental Nodes in the Web , 2012, IRFC.

[9]  David Hawking,et al.  Focused crawling for both topical relevance and quality of medical information , 2005, CIKM '05.

[10]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[11]  Shih-Fu Chang,et al.  Overview of the MPEG-7 standard , 2001, IEEE Trans. Circuits Syst. Video Technol..

[12]  Anastasios Bassoukos,et al.  A method for the inverse reconstruction of environmental data applicable at the Chemical Weather portal , 2010 .

[13]  Toru Ishida,et al.  Domain-specific Web search with keyword spices , 2004, IEEE Transactions on Knowledge and Data Engineering.

[14]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[15]  David Hawking,et al.  Focused Crawling in Depression Portal Search: A Feasibility Study , 2004, ADCS.

[16]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Kostas Karatzas URBAN AIR QUALITY MANAGEMENT AND INFORMATION SYSTEMS IN EUROPE: LEGAL FRAMEWORK AND INFORMATION ACCESS , 2000 .

[19]  Yiannis Kompatsiaris,et al.  Content-based binary image retrieval using the adaptive hierarchical density histogram , 2011, Pattern Recognit..

[20]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..