论文信息 - Detecting Near-Duplicate Relations in User Generated Forum Content

Detecting Near-Duplicate Relations in User Generated Forum Content

A webforum is a large database of community knowledge, with information of the most recent events and developments. Unfortunately this knowledge is presented in a format easily understood by humans but not automatically by machines. However, from observing several forums for a long time it seems obvious that there are several distinct types of postings and relations between them. One often occurring and very annoying relation between two contributions is the near-duplicate relation. In this paper we propose a work to detect and utilize contribution relations, concentrating on near-duplication. We propose ideas on how to calculate similarity, build groups of similar threads and thus make near-duplicates in forums evident. One of the core theses is, that it is possible to apply information from forum and thread structure to improve existing near-duplicate detection approaches. In addition, the proposed work shows the qualitative and quantitative results of applying such principles, thereby finding out which features are really useful in the near-duplicate detection process. Also proposed are several sample applications, which benefit from forum near-duplicate detection.

Alexander Löser | Klemens Muthmann

[1] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[3] Wei-Ying Ma,et al. Building implicit links from content for forum search , 2006, SIGIR.

[4] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[5] Tobun Dorbin Ng,et al. Analyzing content development and visualizing social interactions in Web forum , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[6] Monika Henzinger,et al. Detecting the origin of text segments efficiently , 2009, WWW '09.

[7] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[8] Chen Lin,et al. Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications , 2009, SIGIR.

[9] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[10] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11] Panagiotis G. Ipeirotis,et al. Duplicate Record Detection: A Survey , 2007 .

[12] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[13] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[14] Gilad Mishne,et al. Finding high-quality content in social media , 2008, WSDM '08.