Many countries have created Web archiving projects aiming at long-term preservation of Web information, which is now considered precious in cultural and social aspects. However, because of its borderless character, the Web poses obstacles to comprehensively gathering information originating in a specific nation or culture. This paper proposes an efficient method for selectively collecting Web pages written in a specific language. First, a linguistic graph analysis of real Web data obtained from a large crawl is conducted in order to derive a crawling guideline, which makes use of language attributes per Web server. The guideline then is formed into a few variations of link selection strategies. Simulation-based evaluation reveals that one of the strategies, which carefully accepts newly discovered Web servers, shows superior results in terms of harvest rate/coverage and runtime efficiency. © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(2): 10–20, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.20693
[1]
Marco Gori,et al.
Focused Crawling Using Context Graphs
,
2000,
VLDB.
[2]
Ricardo A. Baeza-Yates,et al.
Crawling a country: better strategies than breadth-first for web page ordering
,
2005,
WWW '05.
[3]
Rayid Ghani,et al.
Mining the web to create minority language corpora
,
2001,
CIKM '01.
[4]
Martin van den Berg,et al.
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
,
1999,
Comput. Networks.
[5]
Soumen Chakrabarti,et al.
Accelerated focused crawling through online relevance feedback
,
2002,
WWW.
[6]
W. B. Cavnar,et al.
N-gram-based text categorization
,
1994
.
[7]
Ario Ohsato,et al.
A language and character set determination method based on N-gram statistics
,
2002,
TALIP.