SimSeerX: a similar document search engine

The need to find similar documents occurs in many settings, such as in plagiarism detection or research paper recommendation. Manually constructing queries to find similar documents may be overly complex, thus motivating the use of whole documents as queries. This paper introduces SimSeerX, a search engine for similar document retrieval that receives whole documents as queries and returns a ranked list of similar documents. Key to the design of SimSeerX is that is able to work with multiple similarity functions and document collections. We present the architecture and interface of SimSeerX, show its applicability with 3 different similarity functions and demonstrate its scalability on a collection of 3.5 million academic documents.

[1]  C. Lee Giles,et al.  Near duplicate detection in an academic digital library , 2013, ACM Symposium on Document Engineering.

[2]  Nivio Ziviani,et al.  Retrieving Similar Documents from the Web , 2003, J. Web Eng..

[3]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[4]  Madian Khabsa,et al.  A Web Service for Scholarly Big Data Information Extraction , 2014, 2014 IEEE International Conference on Web Services.

[5]  Ian H. Witten,et al.  Subject metadata support powered by Maui , 2010, JCDL '10.

[6]  Laurence T. Yang,et al.  Query by document via a decomposition-based two-level retrieval approach , 2011, SIGIR.

[7]  Ali Dasdan,et al.  Automatic retrieval of similar content using search engine query interface , 2009, CIKM.

[8]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[9]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[10]  Krishnan Ramanathan,et al.  Similar Document Search and Recommendation , 2012 .

[11]  Norman Meuschke,et al.  Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence , 2011, DocEng '11.

[12]  Maria Soledad Pera,et al.  BReK12: a book recommender for K-12 users , 2012, SIGIR '12.

[13]  Robert E. Mercer,et al.  Investigating keyphrase indexing with text denoising , 2012, JCDL '12.

[14]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.