Focussed crawling of environmental Web resources based on the combination of multimedia evidence

Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.

[1]  Yiannis Kompatsiaris,et al.  Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence , 2014, EMR@ICMR.

[2]  David Hawking,et al.  Focused crawling for both topical relevance and quality of medical information , 2005, CIKM '05.

[3]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[4]  Anastasios Bassoukos,et al.  A method for the inverse reconstruction of environmental data applicable at the Chemical Weather portal , 2010 .

[5]  Yiannis Kompatsiaris,et al.  Discovery of Environmental Nodes in the Web , 2012, IRFC.

[6]  Filippo Menczer,et al.  Exploration versus Exploitation in Topic Driven Crawlers , 2002, WebDyn@WWW.

[7]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[8]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[11]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[12]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[13]  David Hawking,et al.  Focused Crawling in Depression Portal Search: A Feasibility Study , 2004, ADCS.

[14]  Thomas C. Henderson,et al.  Raster Map Image Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[16]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[17]  Yiannis Kompatsiaris,et al.  Content-based binary image retrieval using the adaptive hierarchical density histogram , 2011, Pattern Recognit..

[18]  Shih-Fu Chang,et al.  Overview of the MPEG-7 standard , 2001, IEEE Trans. Circuits Syst. Video Technol..

[19]  Yiannis Kompatsiaris,et al.  Discovery of environmental resources based on heatmap recognition , 2013, 2013 IEEE International Conference on Image Processing.

[20]  Dong Wang,et al.  THU and ICRC at TRECVID 2007 , 2007, TRECVID.

[21]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[22]  Ioannis Kompatsiaris,et al.  Discovery, Analysis, and Retrieval of Multimodal Environmental Information , 2015 .

[23]  Chew Lim Tan,et al.  Text/Graphics Separation in Maps , 2001, GREC.

[24]  Paul Over,et al.  TRECVID 2007--Overview , 2007, TRECVID.

[25]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[26]  Kostas Karatzas,et al.  Chapter Fourteen Computational Air Quality Modelling , 2008 .

[27]  Toru Ishida,et al.  Domain-specific Web search with keyword spices , 2004, IEEE Transactions on Knowledge and Data Engineering.