论文信息 - Finding Near-Replicas of Documents and Servers on the Web

Finding Near-Replicas of Documents and Servers on the Web

We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web – about 24 million web pages which corresponds to about 150 Gigabytes of textual information.

Hector Garcia-Molina | Narayanan Shivakumar | H. Garcia-Molina | N. Shivakumar

[1] Luis Gravano,et al. Merging Ranks from Heterogeneous Internet Sources , 1997, VLDB.

[2] Udi Manber,et al. GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[3] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[4] Rajeev Motwani,et al. Computing Iceberg Queries Efficiently , 1998, VLDB.

[5] Hector Garcia-Molina,et al. Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[6] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[7] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[8] Alon Y. Halevy,et al. Using Probabilistic Information in Data Integration , 1997, VLDB.

[9] Jon Kleinberg,et al. Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[10] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[11] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[12] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[13] H. Garcia-Molina,et al. Computing Iceberg Queries E ciently , 1998 .