Top-k vectorial aggregation queries in a distributed environment

Given a large set of objects in a distributed database, the goal of a top-k query is to determine the top-k scoring objects and return them to the user. Efficient top-k ranking over distributed databases has been the focus of recent research, with most current algorithms operating on the assumption that each node holds a single or small subset of each object's numerical attributes. However, in many important setups each node might hold instead a full d-dimensional vector of numerical attributes for each object. Examples include website activity in distributed servers, sales statistics for a retail chain, or share price information in different stock markets. For these setups, we define a novel ranking problem, top-kvectorial aggregation queries, where each object's score is determined by first aggregating the attribute vectors held for it and then applying the scoring function over the aggregated vector. Our communication-efficient algorithm uses a blend of geometric and skyline related machinery, some of which is newly developed, as well as an algorithmic framework for defining generic local constraints. Whereas previous algorithms have reduced data sharing by defining local thresholds for each attribute, such tailored solutions might perform poorly. Experimental results on real-world data demonstrate that our algorithm maintains low latency, with a communication cost up to four orders of magnitude lower than that of existing solutions.

[1]  Christos Doulkeridis,et al.  Skyline-based Peer-to-Peer Top-k Query Processing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  J. Hellerstein,et al.  A Wakeup Call for Internet Monitoring Systems : The Case for Distributed Triggers , 2004 .

[3]  Yuguo Chen,et al.  Efficient maintenance of materialized top-k views , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[4]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[5]  Patrick Valduriez,et al.  Best Position Algorithms for Top-k Queries , 2007, VLDB.

[6]  Ling Huang,et al.  Communication-Efficient Online Detection of Network-Wide Anomalies , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[7]  Seung-won Hwang,et al.  Boolean + ranking: querying a database by k-constrained optimization , 2006, SIGMOD Conference.

[8]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Jiawei Han,et al.  Top-K aggregation queries over large networks , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[10]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[11]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[12]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.

[13]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[14]  Jan Chomicki,et al.  Skyline with presorting , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[15]  Ashwin Lall,et al.  Randomized Multi-pass Streaming Skyline Algorithms , 2009, Proc. VLDB Endow..

[16]  Wolf-Tilo Balke,et al.  Progressive distributed top-k retrieval in peer-to-peer networks , 2005, 21st International Conference on Data Engineering (ICDE'05).

[17]  Jiawei Han,et al.  ARCube: supporting ranking aggregate queries in partially materialized data cubes , 2008, SIGMOD Conference.

[18]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[19]  Gerhard Weikum,et al.  MINERVAinfinity: A Scalable Efficient Peer-to-Peer Search Engine , 2005, Middleware.

[20]  R. Varshney,et al.  Supporting top-k join queries in relational databases , 2011 .

[21]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[22]  Christos Doulkeridis,et al.  On efficient top-k query processing in highly distributed environments , 2008, SIGMOD Conference.

[23]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[24]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[25]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[26]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[27]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[28]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[29]  Jarek Gryz,et al.  Maximal Vector Computation in Large Data Sets , 2005, VLDB.

[30]  Kevin Chen-Chuan Chang,et al.  Supporting ad-hoc ranking aggregates , 2006, SIGMOD Conference.

[31]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[32]  Jiawei Han,et al.  Progressive and selective merge: computing top-k with ad-hoc ranking functions , 2007, SIGMOD '07.

[33]  Gerhard Weikum,et al.  Distributed top-k aggregation queries at large , 2009, Distributed and Parallel Databases.

[34]  Divesh Srivastava,et al.  Ranked join indices , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[35]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[36]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[37]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[38]  Wolf-Tilo Balke,et al.  Towards efficient multi-feature queries in heterogeneous environments , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[39]  Sebastian Michel,et al.  Algebraic query optimization for distributed top-k queries , 2007, Informatik - Forschung und Entwicklung.

[40]  Dimitrios Gunopulos,et al.  Anytime Measures for Top-k Algorithms , 2007, VLDB.

[41]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[42]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[43]  Walid G. Aref,et al.  Joining Ranked Inputs in Practice , 2002, VLDB.

[44]  Hua-Gang Li,et al.  Progressive ranking of range aggregates , 2007, Data Knowl. Eng..

[45]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[46]  Assaf Schuster,et al.  Shape Sensitive Geometric Monitoring , 2008, IEEE Transactions on Knowledge and Data Engineering.

[47]  Sebastian Michel,et al.  Top-k aggregation queries in large-scale distributed systems , 2007 .

[48]  Hua-Gang Li,et al.  Efficient Processing of Distributed Top-k Queries , 2005, DEXA.

[49]  Luis Gravano,et al.  Top-k selection queries over relational databases: Mapping strategies and performance evaluation , 2002, TODS.

[50]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[51]  Jarek Gryz,et al.  Algorithms and analyses for maximal vector computation , 2007, The VLDB Journal.

[52]  Sebastian Michel Top-k Aggegation Queries in Large-Scale Distributed Systems , 2009, BTW.