Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval

The retrieval of similar documents in the Web from a given document is different in many aspects from information retrieval based on queries generated by regular search engine users. In this work, a new method is proposed for Web similarity document retrieval based on generative language models and meta search engines. Probabilistic language models are used as a random query generator for the given document. Queries are submitted to a customizable set of Web search engines. Once all results obtained are gathered, its evaluation is determined by a proposed scoring function based on the Zipf law. Results obtained showed that the proposed methodology for query generation and scoring procedure solves the problem with acceptable levels of precision.

[1]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[2]  Jacques Savoy,et al.  Approaches to collection selection and results merging for distributed information retrieval , 2001, CIKM '01.

[3]  Gianni Amati Information Theoretic Approach to Information Extraction , 2006, FQAS.

[4]  Adele E. Howe,et al.  Using web helper agent profiles in query generation , 2003, AAMAS '03.

[5]  Bilal Zaka Empowering Plagiarism Detection with a Web Services Enabled Collaborative Network , 2009 .

[6]  S. V. Nagaraj Web Caching and Its Applications , 2004 .

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  S. V. Nagaraj Web Caching And Its Applications (Kluwer International Series in Engineering and Computer Science) , 2004 .

[9]  Michael Rovatsos,et al.  Capturing agent autonomy in roles and XML , 2003, AAMAS '03.

[10]  W. Harkness Properties of the extended hypergeometric distribution , 1965 .

[11]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[12]  Nivio Ziviani,et al.  Retrieving Similar Documents from the Web , 2003, J. Web Eng..

[13]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[14]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[15]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .