Discovering the Skyline of Web Databases

Many web databases are "hidden" behind proprietary search interfaces that enforce the top-$k$ output constraint, i.e., each query returns at most $k$ of all matching tuples, preferentially selected and returned according to a proprietary ranking function. In this paper, we initiate research into the novel problem of skyline discovery over top-$k$ hidden web databases. Since skyline tuples provide critical insights into the database and include the top-ranked tuple for every possible ranking function following the monotonic order of attribute values, skyline discovery from a hidden web database can enable a wide variety of innovative third-party applications over one or multiple web databases. Our research in the paper shows that the critical factor affecting the cost of skyline discovery is the type of search interface controls provided by the website. As such, we develop efficient algorithms for three most popular types, i.e., one-ended range, free range and point predicates, and then combine them to support web databases that feature a mixture of these types. Rigorous theoretical analysis and extensive real-world online and offline experiments demonstrate the effectiveness of our proposed techniques and their superiority over baseline solutions.

[1]  Christian Buchta,et al.  On the Average Number of Maxima in a Set of Vectors , 1989, Inf. Process. Lett..

[2]  Gautam Das,et al.  Turbo-charging hidden database samplers with overflowing queries and skew reduction , 2010, EDBT '10.

[3]  Abolfazl Asudeh,et al.  Crowdsourcing Pareto-Optimal Object Finding By Pairwise Comparisons , 2014, CIKM.

[4]  Man Lung Yiu,et al.  Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data , 2007, VLDB.

[5]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[6]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[7]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[8]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Hongjun Lu,et al.  Stabbing the sky: efficient skyline computation over sliding windows , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Bernhard Seeger,et al.  Efficient Computation of Reverse Skyline Queries , 2007, VLDB.

[11]  Fan Wang,et al.  Stratified sampling for data mining on the deep web , 2010, 2010 IEEE International Conference on Data Mining.

[12]  Dimitrios Gunopulos,et al.  Anytime Measures for Top-k Algorithms , 2007, VLDB.

[13]  Wolf-Tilo Balke,et al.  Efficient Distributed Skylining for Web Information Systems , 2004, EDBT.

[14]  Fan Wang,et al.  Effective and efficient sampling methods for deep web aggregation queries , 2011, EDBT/ICDT '11.

[15]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[16]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[17]  Xuemin Lin,et al.  Selecting Stars: The k Most Representative Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[18]  Jan Chomicki,et al.  Skyline with presorting , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[19]  Bernhard Seeger,et al.  An optimal and progressive algorithm for skyline queries , 2003, SIGMOD '03.

[20]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[21]  Xin Jin,et al.  Optimal Algorithms for Crawling a Hidden Database in the Web , 2012, Proc. VLDB Endow..

[22]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[23]  Jing Yuan,et al.  Efficient Top-k Query Algorithms Using K-Skyband Partition , 2009, Infoscale.

[24]  Gautam Das,et al.  On Skyline Groups , 2014, IEEE Trans. Knowl. Data Eng..

[25]  David Wai-Lok Cheung,et al.  Progressive skylining over Web-accessible databases , 2006, Data Knowl. Eng..

[26]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.