论文信息 - Simulation Study of Language Specific Web Crawling

Simulation Study of Language Specific Web Crawling

The Web has been recognized as an important part of our cultural heritage. Many nations started archiving national web spaces for future generations. A key technology for data acquisition employed by these archiving projects is web crawling. Crawling cultural and/or linguistic specific resources from the borderless Web raises many challenging issues. In this paper, we propose the language specific web crawling and evaluate the language specific crawling strategies on the web crawling simulator.

Masaru Kitsuregawa | Takayuki Tamura | Kulwadee Somboonviwat

[1] Soumen Chakrabarti,et al. Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[2] Donald Perlis,et al. Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition , 2002 .

[3] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[4] Dr P M E De Bra. Searching for Arbitrary Information in the WWW : the Fish − Search for Mosaic , 1994 .

[5] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[6] Andrew McCallum,et al. Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[7] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[8] Filippo Menczer,et al. Crawling the Web , 2004, Web Dynamics.