iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.

[1]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[2]  Verónika Peralta,et al.  A framework for analysis of data freshness , 2004, IQIS '04.

[3]  Yannis Stavrakas,et al.  The ARCOMEM Architecture for Social- and Semantic-Driven Web Archiving , 2014, Future Internet.

[4]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[5]  Michael Chau,et al.  Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method , 2004, JCDL.

[6]  José Martins,et al.  TwitterEcho: a distributed focused crawler to support open research with twitter data , 2012, WWW.

[7]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[8]  Thomas Risse,et al.  What Do You Want to Collect from the Web ? ? , 2014 .

[9]  Jialun Qin,et al.  Building domain-specific Web collections for scientific digital libraries: a meta-search enhanced focused crawling method , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[10]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[11]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[12]  Fotis Psallidas,et al.  Soc Web: Efficient Monitoring of Social Network Activities , 2013, WISE.

[13]  Marc Ehrig,et al.  Ontology-focused crawling of Web documents , 2003, SAC '03.

[14]  Edward A. Fox,et al.  A study of automation from seed URL generation to focused web archive development: the CTRnet context , 2012, JCDL '12.

[15]  Joaquim Macedo,et al.  Time-Aware Focused Web Crawling , 2014, ECIR.

[16]  Paul Lindner Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? , 2016 .

[17]  Yiannis Kompatsiaris,et al.  Sensing Trending Topics in Twitter , 2013, IEEE Transactions on Multimedia.

[18]  Thomas Risse,et al.  The iCrawl Wizard - Supporting Interactive Focused Crawl Specification , 2015, ECIR.

[19]  Michael L. Nelson,et al.  Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? , 2012, TPDL.

[20]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[21]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[22]  Xavier Tannier Extracting News Web Page Creation Time with DCTFinder , 2014, LREC.

[23]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[24]  Farookh Khadeer Hussain,et al.  SOF: a semi‐supervised ontology‐learning‐based focused crawler , 2013, Concurr. Comput. Pract. Exp..