Efficient Processing of Ranking Queries in Novel Applications

Ranking queries, which return only a subset of results matching a user query, have been studied extensively in the past decade due to their importance in a wide range of applications. In this thesis, we study ranking queries in novel environments and settings where they have not been considered so far. With the advancements in sensor technologies, these small devices are today present in all corners of human life. Millions of them are deployed in various places and are sending data on a continuous basis. These sensors which before mainly monitored environmental phenomena or production chains, have now found their way into our daily lives as well; health monitoring being a plausible example of how much we rely on continuous observation of measurements. As the Web technology evolves and facilitates data stream transmissions, sensors do not remain the sole producers of data in form of streams. The Web 2.0 has escalated the production of user-generated content which appear in form of annotated posts in a Weblog (blog), pictures and videos, or small textual snippets reflecting the current activity or status of users and can be regarded as natural items of a temporal stream. A major part of this thesis is devoted to developing novel methods which assist in keeping track of this ever increasing flow of information with continuous monitoring of ranking queries over them, particularly when traditional approaches fail to meet the newly raised requirements. We consider the ranking problem when the information flow is not synchronized among its sources. This is a recurring situation, since sensors are run by different organizations, measure moving entities, or are simply represented by users which are inherently not synchronizable. Our methods are in particular designed for handling unsynchronized streams, calculating an object's score based on both its currently observed contribution to the registered queries as well as the contribution it might have in future. While this uncertainty in score calculation causes linear growth in the space necessary for providing exact results, we are able to define criteria which allows for evicting unpromising objects as early as possible. We also leverage statistical properties that reflect the correlation between multiple streams to predict the future to provide better bounds for the best possible contribution of an object, consequently limiting the necessary storage dramatically. To achieve this, we make use of small statistical synopses that are periodically refreshed during runtime. Furthermore, we consider user generated queries in the context of Web 2.0 applications which aim at filtering data streams in forms of textual documents, based on personal interests. In this case, the dimensionality of the data, the large cardinality of the subscribed queries, as well as the desire for consuming recent information, raise new challenges. We develop new approaches which efficiently filter the information and provide real-time updates to the user subscribed queries. Our methods rely on a novel ordering of user queries in traditional inverted lists which allows the system to effectively prune those queries for which a new piece of information is of no interest. Finally, we investigate high quality search in user generated content in Web 2.0 applications in form of images or videos. These resources are inherently dispersed all over the globe, therefore can be best managed in a purely distributed peer-to-peer network which eliminates single points of failure. Search in such a huge repository of high dimensional data involves evaluating ranking queries in form of nearest neighbor queries. Therefore, we study ranking queries in high dimensional spaces, where the index of the objects is maintained in a purely distributed fashion. Our solution meets the two major requirements of a viable solution in distributing the index and evaluating ranking queries: the underlying peer-to-peer network remains load balanced, and efficient query evaluation is feasible as similar objects are assigned to nearby peers.

[1]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[2]  Divyakant Agrawal,et al.  Content-Based Similarity Search over Peer-to-Peer Systems , 2004, DBISP2P.

[3]  Kyriakos Mouratidis,et al.  Continuous Nearest Neighbor Queries over Sliding Windows , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  Hector Garcia-Molina,et al.  Index structures for information filtering under the vector space model , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[5]  Karl Aberer,et al.  The gist of everything new: personalized top-k processing over web 2.0 streams , 2010, CIKM.

[6]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[7]  Sergei Vassilvitskii,et al.  Top-k aggregation using intersections of ranked inputs , 2009, WSDM '09.

[8]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[9]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[10]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[11]  Yuguo Chen,et al.  Efficient maintenance of materialized top-k views , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[12]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[13]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM 2004.

[14]  Karl Aberer,et al.  P-Grid: A Self-Organizing Access Structure for P2P Information Systems , 2001, CoopIS.

[15]  Beng Chin Ooi,et al.  VBI-Tree: A Peer-to-Peer Framework for Supporting Multi-Dimensional Indexing Schemes , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[16]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[17]  Panagiotis Papadimitratos,et al.  Efficient and Robust Secure Aggregation for Sensor Networks , 2007, 2007 3rd IEEE Workshop on Secure Network Protocols.

[18]  Jennifer Widom,et al.  CQL: A Language for Continuous Queries over Streams and Relations , 2003, DBPL.

[19]  Divyakant Agrawal,et al.  On Hit Inflation Techniques and Detection in Streams of Web Advertising Networks , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[20]  Gerhard Weikum,et al.  MINERVAinfinity: A Scalable Efficient Peer-to-Peer Search Engine , 2005, Middleware.

[21]  Divyakant Agrawal,et al.  Reverse Nearest Neighbor Queries for Dynamic Databases , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[22]  Srikanta Tirthapura,et al.  Sketching asynchronous streams over a sliding window , 2006, PODC '06.

[23]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[24]  Michael Stonebraker,et al.  Aurora: a data stream management system , 2003, SIGMOD '03.

[25]  Feifei Li,et al.  Characterizing and Exploiting Reference Locality in Data Stream Applications , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Theoni Pitoura,et al.  Load Distribution Fairness in P2P Data Management Systems , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[27]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[28]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[29]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[30]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[31]  Beng Chin Ooi,et al.  Approximate NN queries on Streams with Guaranteed Error/performance Bounds , 2004, VLDB.

[32]  Karl Aberer,et al.  idMesh: graph-based disambiguation of linked data , 2009, WWW '09.

[33]  B. Hamber Publications , 1998, Weed Technology.

[34]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[35]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[36]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[37]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[38]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[39]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[40]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[41]  Anna C. Gilbert,et al.  QuickSAND: Quick Summary and Analysis of Network Data , 2001 .

[42]  Karl Aberer,et al.  Top-k/w publish/subscribe: finding k most relevant publications in sliding time window w , 2008, DEBS.

[43]  Thomas S. Huang,et al.  Supporting Ranked Boolean Similarity Queries in MARS , 1998, IEEE Trans. Knowl. Data Eng..

[44]  Karl Aberer,et al.  LSH At Large - Distributed KNN Search in High Dimensions , 2008, WebDB.

[45]  BabaogluOzalp,et al.  Gossip-based aggregation in large dynamic networks , 2005 .

[46]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[47]  Randolph Y. Wang,et al.  SkipIndex : Towards a Scalable Peer-to-Peer Index Service for High Dimensional Data , 2004 .

[48]  Suman Nath,et al.  Environmental Monitoring 2.0 , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[49]  Scott Shenker,et al.  Making gnutella-like P2P systems scalable , 2003, SIGCOMM '03.

[50]  Christos Doulkeridis,et al.  Peer-to-Peer Similarity Search in Metric Spaces , 2007, VLDB.

[51]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[52]  Ashwin Machanavajjhala,et al.  P-ring: an efficient and robust P2P range index structure , 2007, SIGMOD '07.

[53]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[54]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[55]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[56]  Kyriakos Mouratidis,et al.  An Incremental Threshold Method for Continuous Text Search Queries , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[57]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[58]  Yuguo Chen,et al.  On joining and caching stochastic streams , 2005, SIGMOD '05.

[59]  Klemens Böhm,et al.  Efficient Evaluation of Nearest-Neighbor Queries in Content-Addressable Networks , 2005, From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments.

[60]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[61]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[62]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[63]  Hector Garcia-Molina,et al.  Index structures for selective dissemination of information under the Boolean model , 1994, TODS.

[64]  Manolis Koubarakis,et al.  Information filtering and query indexing for an information retrieval model , 2009, TOIS.

[65]  Pavel Zezula,et al.  A Content-Addressable Network for Similarity Search in Metric Spaces , 2005, DBISP2P.

[66]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[67]  James P. Callan,et al.  Document filtering with inference networks , 1996, SIGIR '96.

[68]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[69]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[70]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[71]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[72]  Theoni Pitoura,et al.  Replication, Load Balancing and Efficient Range Query Processing in DHTs , 2006, EDBT.

[73]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[74]  Jörg Ott,et al.  Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication , 2012, SIGCOMM 2012.

[75]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[76]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[77]  Graham Cormode,et al.  Time-decaying aggregates in out-of-order streams , 2008, PODS.

[78]  Beng Chin Ooi,et al.  Efficiently Processing Continuous k-NN Queries on Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[79]  Ittai Abraham,et al.  Advances in metric embedding theory , 2006, STOC '06.

[80]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[81]  Karl Aberer,et al.  Distributed similarity search in high dimensions using locality sensitive hashing , 2009, EDBT '09.

[82]  Dimitrios Gunopulos,et al.  Ad-hoc Top-k Query Answering for Data Streams , 2007, VLDB.

[83]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[84]  Karl Aberer,et al.  Self-Organizing Schema Mappings in the GridVine Peer Data Management System , 2007, VLDB.

[85]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[86]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[87]  Andrew Heybey,et al.  Tribeca: A System for Managing Large Databases of Network Traffic , 1998, USENIX Annual Technical Conference.

[88]  Jennifer Widom,et al.  Exploiting k-constraints to reduce memory overhead in continuous queries over data streams , 2004, TODS.

[89]  Panagiotis Papapetrou,et al.  Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[90]  Farnoush Banaei Kashani,et al.  SWAM: a family of access methods for similarity-search in peer-to-peer data networks , 2004, CIKM '04.

[91]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[92]  Margo I. Seltzer,et al.  Beyond Relational Databases , 2005, ACM Queue.

[93]  S. Muthukrishnan,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD '00.

[94]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[95]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[96]  A. Barrett Network Flows and Monotropic Optimization. , 1984 .

[97]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[98]  Amit Singh,et al.  High dimensional reverse nearest neighbor queries , 2003, CIKM '03.

[99]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[100]  Philippe Flajolet,et al.  Loglog Counting of Large Cardinalities (Extended Abstract) , 2003, ESA.

[101]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[102]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[103]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[104]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[105]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[106]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[107]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[108]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[109]  Patrick Valduriez,et al.  Proceedings of the 2004 ACM SIGMOD international conference on Management of data , 2004, SIGMOD 2004.

[110]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[111]  Alistair Moffat,et al.  The design of a high performance information filtering system , 1996, SIGIR '96.

[112]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[113]  Karl Aberer,et al.  Evaluating top-k queries over incomplete data streams , 2009, CIKM.