Bootstrapping Domain-Specific Content Discovery on the Web

The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites , DISCO aims to discover a large collection of relevant websites . DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.

[1]  Vassilis Kostakos,et al.  CrisisTracker: Crowdsourced social media curation for disaster awareness , 2013, IBM J. Res. Dev..

[2]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[3]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[4]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[5]  Juliana Freire,et al.  Interactive Exploration for Domain Discovery on the Web , 2016 .

[6]  Juliana Freire,et al.  A First Study on Temporal Dynamics of Topics on the Web , 2016, WWW.

[7]  A. Azzouz 2011 , 2020, City.

[8]  Juliana Freire,et al.  Learning to Discover Domain-Specific Web Content , 2018, WSDM.

[9]  Soumen Chakrabarti,et al.  Surfing the Web Backwards , 1999, Comput. Networks.

[10]  Juliana Freire,et al.  Finding seeds to bootstrap focused crawlers , 2015, World Wide Web.

[11]  Xuezhi Wang,et al.  Relevant Document Discovery for Fact-Checking Articles , 2018, WWW.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[14]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[15]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[16]  Arkaitz Zubiaga,et al.  Detection and Resolution of Rumours in Social Media , 2017, ACM Comput. Surv..

[17]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[18]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[19]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[20]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[21]  Anirban Dasgupta,et al.  The discoverability of the web , 2007, WWW '07.

[22]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[23]  Filippo Menczer,et al.  Hoaxy: A Platform for Tracking Online Misinformation , 2016, WWW.

[24]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[25]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[26]  Katherine A. Heller,et al.  Bayesian Sets , 2005, NIPS.

[27]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[28]  Srinivas Bangalore,et al.  Crawling Back and Forth: Using Back and Out Links to Locate Bilingual Sites , 2011, IJCNLP.

[29]  Luciano Barbosa,et al.  Harvesting Forum Pages from Seed Sites , 2017, ICWE.

[30]  Tsuyoshi Murata,et al.  Finding Related Web Pages Based on Connectivity Information from a Search Engine , 2001, WWW Posters.

[31]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[32]  Roi Blanco,et al.  Focused Crawling for Structured Data , 2014, CIKM.

[33]  Divesh Srivastava,et al.  DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web , 2015, Proc. VLDB Endow..