Aggregating sentence-level features for Chinese near-duplicate document detection

Detecting near-duplicate documents efficiently is an indispensable capability for many applications, such as searching engines, information retrieval systems, and recommendation systems. In this paper, we propose a novel content presentation method for near-duplicate document detection from a large collection of Chinese documents. The proposed method, called multi-aggregation fingerprint (MAF), consists of sentence-level feature extraction and multi-feature aggregation. Compared with terms, sentences are more representative and contain more abundant and integrated information. Thus, we extract the crucial information of sentences to form the sentence features. To improve the accuracy and efficiency of near-duplicate document detection, we exploit both holistic characteristics of sentence features in the dataset and the statistic information of sentence features belonging to a document. Accordingly, we split the sentence feature space based on the distribution of features in the dataset. Each sentence feature is assigned to the nearest partition of the feature space, and multiple sentence features are aggregated into a compact and global fingerprint. Experimental results show the proposed MAF method can produce competitive results on the Chinese document dataset.

[1]  Dmitri Loguinov,et al.  Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.

[2]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[3]  Yi Yu,et al.  Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm , 2015 .

[4]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[5]  Maosong Sun,et al.  Semi-Supervised SimHash for Efficient Document Similarity Search , 2011, ACL.

[6]  Jongik Kim,et al.  Efficient Exact Similarity Searches Using Multiple Token Orderings , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[7]  C. V. Guru Rao,et al.  XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning , 2015 .

[8]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[9]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[10]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[11]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[12]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[13]  Sung-Ryul Kim,et al.  Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection , 2014, Int. J. Distributed Sens. Networks.

[14]  Abdur Chowdhury,et al.  Lexicon randomization for near-duplicate detection with I-Match , 2007, The Journal of Supercomputing.

[15]  Jenq-Haur Wang,et al.  Exploiting Sentence-Level Features for Near-Duplicate Document Detection , 2009, AIRS.

[16]  Yang Yang,et al.  Online system for detection of Chinese near-duplicate documents , 2012, 2012 6th International Conference on New Trends in Information Science, Service Science and Data Mining (ISSDM2012).

[17]  Shie-Jue Lee,et al.  Detecting near-duplicate documents using sentence-level features and supervised learning , 2013, Expert Syst. Appl..

[18]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[19]  Shengli Wu,et al.  Detecting Near-Duplicate Documents Using Sentence Level Features , 2015, DEXA.

[20]  James W. Cooper,et al.  A novel method for detecting similar documents , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[21]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.