论文信息 - Which webpage should we crawl first? Social media-based webpage source importance guidance

Which webpage should we crawl first? Social media-based webpage source importance guidance

Social media has proven to be an important and rich asset for collecting webpages about events. Ensuring full Web archive coverage of an event is not an easy task, for several reasons. First, events differ in impact and importance. Big events tend to last for a long time, impact multiple places, and even spark a range of debates about diverse topics. Second, to build a Web collection that fully covers an event requires sampling an unbiased set of webpages from the WWW (which is huge, heterogeneous, and dynamically changing). The size of the WWW makes difficult finding and unbiased set of webpages by manual techniques for collecting, curating, and sampling. Fortunately, focused crawlers have proven effective in automating and accelerating the process of collecting webpages, starting from a set of seed URLs. However, the ability of the focused crawler to find relevant and diverse webpages depends on the quality (content quality and linking structure quality) and the broad coverage (seed URLs from different webpage sources and publishing venues/genres) of the seed URLs.

Edward A. Fox | Mohamed M. Farag