论文信息 - Batch Text Similarity Search with MapReduce

Batch Text Similarity Search with MapReduce

Batch text similarity search aims to find the similar texts according to users' batch text queries. It is widely used in the real world such as plagiarism check, and attracts more and more attention with the emergence of abundant texts on the web. Existing works, such as FuzzyJoin, can neither support the variation of thresholds, nor support the online batch text similarity search. In this paper, a two-stage algorithm is proposed. It can effectively resolve the problem of batch text similarity search based on inverted index structures. Experimental results on real datasets show the efficiency and expansibility of our method.

Li Ju | Rui Li | Zhuo Peng | Zhiwei Yu | Chaokun Wang

[1] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[2] Jeffrey Xu Yu,et al. Efficient similarity joins for near duplicate detection , 2008, WWW.

[3] Sunita Sarawagi,et al. Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[4] Raghav Kaushik,et al. Efficient exact set-similarity joins , 2006, VLDB.

[5] James Lewis,et al. Data and text mining Text similarity : an alternative way to search MEDLINE , 2006 .

[6] Jun Zhang,et al. Simlarity Search for Web Services , 2004, VLDB.

[7] Jimmy J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[8] Chen Li,et al. Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[9] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[10] Surajit Chaudhuri,et al. A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11] Christian Böhm,et al. Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[12] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.