论文信息 - Bootstrapping Domain-Specific Content Discovery on the Web

Bootstrapping Domain-Specific Content Discovery on the Web

The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites , DISCO aims to discover a large collection of relevant websites . DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.

[1] Vassilis Kostakos,et al. CrisisTracker: Crowdsourced social media curation for disaster awareness , 2013, IBM J. Res. Dev..

[2] Charles L. A. Clarke,et al. Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[3] Pradeep Ravikumar,et al. Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[4] Hans-Peter Kriegel,et al. Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[5] Juliana Freire,et al. Interactive Exploration for Domain Discovery on the Web , 2016 .

[6] Juliana Freire,et al. A First Study on Temporal Dynamics of Topics on the Web , 2016, WWW.

[7] A. Azzouz. 2011 , 2020, City.

[8] Juliana Freire,et al. Learning to Discover Domain-Specific Web Content , 2018, WSDM.

[9] Soumen Chakrabarti,et al. Surfing the Web Backwards , 1999, Comput. Networks.

[10] Juliana Freire,et al. Finding seeds to bootstrap focused crawlers , 2015, World Wide Web.

[11] Xuezhi Wang,et al. Relevant Document Discovery for Fact-Checking Articles , 2018, WWW.

[12] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13] Juliana Freire,et al. An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[14] Matt J. Kusner,et al. From Word Embeddings To Document Distances , 2015, ICML.

[15] Sameer Singh,et al. Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[16] Arkaitz Zubiaga,et al. Detection and Resolution of Rumours in Social Media , 2017, ACM Comput. Surv..

[17] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[18] Soumen Chakrabarti,et al. Accelerated focused crawling through online relevance feedback , 2002, WWW.

[19] Monika Henzinger,et al. Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[20] Hugo Zaragoza,et al. The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[21] Anirban Dasgupta,et al. The discoverability of the web , 2007, WWW '07.

[22] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[23] Filippo Menczer,et al. Hoaxy: A Platform for Tracking Online Misinformation , 2016, WWW.

[24] Philip S. Yu,et al. Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[25] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[26] Katherine A. Heller,et al. Bayesian Sets , 2005, NIPS.

[27] Ravi Kumar,et al. Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[28] Srinivas Bangalore,et al. Crawling Back and Forth: Using Back and Out Links to Locate Bilingual Sites , 2011, IJCNLP.

[29] Luciano Barbosa,et al. Harvesting Forum Pages from Seed Sites , 2017, ICWE.

[30] Tsuyoshi Murata,et al. Finding Related Web Pages Based on Connectivity Information from a Search Engine , 2001, WWW Posters.

[31] Bernhard Schölkopf,et al. Support Vector Method for Novelty Detection , 1999, NIPS.

[32] Roi Blanco,et al. Focused Crawling for Structured Data , 2014, CIKM.

[33] Divesh Srivastava,et al. DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web , 2015, Proc. VLDB Endow..