论文信息 - Differences and Identities in Document Retrieval in an Annotation Environment

Differences and Identities in Document Retrieval in an Annotation Environment

Digital annotation of web pages presents two types of problems which are unknown to traditional annotation and which are connected to the dynamicity and the openness of the Web. The first problem is related to the possibility of replicating a document over multiple sites, so that it can be retrieved over the Web at different URLs or with different queries. This poses the need to associate to a web page all the annotations pertaining to its content, even if they were created while accessing the same content under a different URL. The second problem is related to the dynamics of individual HTML pages that often consist of insertions, deletions or movement of page segments. Annotations related to portions of the page that have moved within the page itself should be retrieved and shown to the user. To reduce the impact of these phenomena on the usefulness of the annotation process, our annotation system MADCOW incorporates two algorithms which assess the identity of two pages under two different URLs, and the differences between two versions of a page under the same URL, taking the proper actions in order to retrieve all the pertaining annotations.

[1] Stefano Levialdi,et al. Storing and Retrieving Multimedia Web Notes , 2005, DNIS.

[2] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[3] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[4] Hector Garcia-Molina,et al. Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[5] Stefano Levialdi,et al. An Analysis and Case Study of Digital Annotation , 2003, DNIS.

[6] Stefano Levialdi,et al. MADCOW: a multimedia digital annotation system , 2004, AVI.

[7] Stefano Levialdi,et al. Storing and retrieving multimedia web notes , 2006, Int. J. Comput. Sci. Eng..

[8] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[10] M. Sanderson,et al. Duplicate Detection in the Reuters Collection , 1997 .

[11] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.