A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[3]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[4]  Nivio Ziviani,et al.  Retrieving Similar Documents from the Web , 2003, J. Web Eng..

[5]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[6]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[7]  W. Harkness Properties of the extended hypergeometric distribution , 1965 .

[8]  Maarten de Rijke,et al.  Hypergeometric language models for republished article finding , 2011, SIGIR '11.

[9]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[10]  Bilal Zaka Empowering Plagiarism Detection with a Web Services Enabled Collaborative Network , 2009 .

[11]  Paul J. Werbos,et al.  The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting , 1994 .

[12]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[13]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[16]  Felipe Bravo-Marquez,et al.  Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval , 2010, SPIRE.

[17]  Adele E. Howe,et al.  Using web helper agent profiles in query generation , 2003, AAMAS '03.

[18]  S. V. Nagaraj Web Caching and Its Applications , 2004 .

[19]  Luis A. Guerrero,et al.  DOCODE-Lite: A Meta-Search Engine for Document Similarity Retrieval , 2010, KES.

[20]  King-Lup Liu,et al.  Evaluation of Result Merging Strategies for Metasearch Engines , 2005, WISE.

[21]  Gianni Amati Information Theoretic Approach to Information Extraction , 2006, FQAS.