Detecting Near-Duplicate Relations in User Generated Forum Content

A webforum is a large database of community knowledge, with information of the most recent events and developments. Unfortunately this knowledge is presented in a format easily understood by humans but not automatically by machines. However, from observing several forums for a long time it seems obvious that there are several distinct types of postings and relations between them. One often occurring and very annoying relation between two contributions is the near-duplicate relation. In this paper we propose a work to detect and utilize contribution relations, concentrating on near-duplication. We propose ideas on how to calculate similarity, build groups of similar threads and thus make near-duplicates in forums evident. One of the core theses is, that it is possible to apply information from forum and thread structure to improve existing near-duplicate detection approaches. In addition, the proposed work shows the qualitative and quantitative results of applying such principles, thereby finding out which features are really useful in the near-duplicate detection process. Also proposed are several sample applications, which benefit from forum near-duplicate detection.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[3]  Wei-Ying Ma,et al.  Building implicit links from content for forum search , 2006, SIGIR.

[4]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[5]  Tobun Dorbin Ng,et al.  Analyzing content development and visualizing social interactions in Web forum , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[6]  Monika Henzinger,et al.  Detecting the origin of text segments efficiently , 2009, WWW '09.

[7]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[8]  Chen Lin,et al.  Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications , 2009, SIGIR.

[9]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[10]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[14]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.