On URL Normalization
暂无分享,去创建一个
Since syntactically different URLs could represent the same resource in WWW, there are on-going efforts to define the URL normalization in the standard communities. This paper considers the three additional URL normalization steps beyond ones specified in the standard URL normalization. The idea behind our work is that in the URL normalization we want to minimize false negatives further while allowing false positives in a limited level. Two metrics are defined to analyze the effect of each step in the URL normalization. Over 170 million URLs that were collected in the real web pages, we did an experiment, and interesting statistical results are reported in this paper.
[1] Roy T. Fielding,et al. Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.
[2] Sang Ho Lee,et al. Implementation of a Web Robot and Statistics on the Korean Web , 2003, Human.Society@Internet 2003.
[3] Marc Najork,et al. Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.