A method for language-specific Web crawling and its evaluation

Many countries have created Web archiving projects aiming at long-term preservation of Web information, which is now considered precious in cultural and social aspects. However, because of its borderless character, the Web poses obstacles to comprehensively gathering information originating in a specific nation or culture. This paper proposes an efficient method for selectively collecting Web pages written in a specific language. First, a linguistic graph analysis of real Web data obtained from a large crawl is conducted in order to derive a crawling guideline, which makes use of language attributes per Web server. The guideline then is formed into a few variations of link selection strategies. Simulation-based evaluation reveals that one of the strategies, which carefully accepts newly discovered Web servers, shows superior results in terms of harvest rate/coverage and runtime efficiency. © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(2): 10–20, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.20693