Individual Query Cardinality Estimation using Multiple Query Combinations on a Search Engine's Corpus

Most modern search engines feature keyword based search interfaces. These interfaces are usually found on websites belonging to enterprises or governments or sites related to news articles, blogs and social media that contain a large corpus of documents. These collections of documents are not easily indexed by web search engines, and are considered as hidden web databases. These databases provide opportunities for data analysis for many third-parties through their keyword search interfaces. A significant amount of research has already been carried out on analyzing and extracting aggregate information about these hidden document corpora. But most of these research focus on the high level big-picture information of the database. Not enough focus has been done on extracting analytical information which is specific to individual queries. This paper focuses on that analysis gap and takes ideas from other existing research to formulate a query cardinality estimation technique i.e. the count of documents matching a query in the document corpus of a search engine. We experimentally assess the effectiveness of our method by building a search engine on the Reuters-21578 document corpus. For a given keyword the corresponding documents' count is estimated only by sending search queries using the interface.

[1]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[2]  Sandeep Tata,et al.  SQAK: doing more with keywords , 2008, SIGMOD Conference.

[3]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[4]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[5]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[6]  Wei Wang,et al.  Keyword-based search and exploration on databases , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[7]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[8]  Xuemin Lin,et al.  Keyword search on structured and semi-structured data , 2009, SIGMOD Conference.

[9]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[10]  Korris Fu-Lai Chung,et al.  Improving weak ad-hoc queries using wikipedia asexternal corpus , 2007, SIGIR.

[11]  Gautam Das,et al.  Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation , 2011, SIGMOD '11.

[12]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[14]  Atsushi Fujii,et al.  Approximating Document Frequency for Self-Index based Top-k Document Retrieval , 2015, 2015 IEEE 29th International Conference on Advanced Information Networking and Applications Workshops.

[15]  Nick Koudas,et al.  Measure-driven Keyword-Query Expansion , 2009, Proc. VLDB Endow..

[16]  Yufei Tao,et al.  Finding frequent co-occurring terms in relational keyword search , 2009, EDBT '09.

[17]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[19]  Anirban Chakraborty,et al.  Improving IR Performance from OCRed Text using Cooccurrence , 2013, FIRE.

[20]  Fernando Diaz,et al.  UMass Robust 2005: Using Mixtures of Relevance Models for Query Expansion , 2005, TREC.

[21]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[22]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[23]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[24]  Gautam Das,et al.  Turbo-charging hidden database samplers with overflowing queries and skew reduction , 2010, EDBT '10.