A Publish/Subscribe Approach to Processing Continuous Queries over Sensor Streams

With technological advances, the sources of available information have become more and more diverse. Recently, a new source of information has gained growing importance: sensor data. Sensors are devices sensing their environment in various ways and reporting in general a numeric result. A sensor continuously reports values, thus the flow of information is also continuous, like a stream. As the field has developed, the usage paradigm has shifted from stand-alone sensors to interconnected sensors, or sensor networks. Sensors became more complex, generating larger quantities of data and having wireless communication modules for transmitting their data. Initially, data from sensor networks was first stored, and then processed. Thus, classical database technologies could be used. However, the focus has soon shifted towards reacting to sensor data in real time. A user query reacting in real time to a stream of data is called a continuous query, and to answer such a query requires that it is continuously processed, as new values appear in the sensor stream. As sensor networks and sensor based applications become more popular, users identified the need to query sensor data pertaining to different sensor networks. This setting, of interconnected sensor networks, consists of more powerful computational devices, connected with a wired communication, which can process and relay sensor data. Users can launch queries at any node to query sensor events coming from any part of the interconnected network. In this setting, the number of data sources (sensors) is orders of magnitude smaller than the number of user queries, which themselves are orders of magnitude smaller than the full content of the (sensor) data streams, and the communication becomes by far the greatest communication bottleneck. In this thesis, we present our research for reducing communication cost generated by applications accessing large scale interconnected sensor networks. Our first contribution is a probabilistic algorithm for detecting and exploiting subsumption of queries over correlated data sources. This technique reduces the communication traffic generated by query forwarding in an interconnected sensor network, by filtering out queries subsumed by a set of existing queries. In addition, this reduces the number of results that need to be transmitted. We propose an efficient forwarding algorithm of the elements of the result sets, by employing a publish/subscribe data dissemination. To support the general setting of distributed data sources in an interconnected sensor network, we propose a Filter-Split-Forward approach that adapts set subsumption to the case of join queries over distributed data sources. We base our approach on the concept of filter-split-forward phases for efficient query filtering and placement inside the network, and an efficient, publish/subscribe forwarding of matching events. We also propose distributed adaptations of state of the art solutions for continuous query processing over multiple data sources. We adapt these techniques to require only local interactions between nodes, without relying on global knowledge or a centralized server. We show how our approach achieves lower traffic through query subsumption and efficient event dissemination. In many applications using sensor data, users are only interested in the most relevant events. To that end, we present our solutions for processing top-k queries over distributed sensor data streams in the presence of query subsumption. We analyze the impact of query subsumption on top-k processing. We propose different strategies for incorporating query subsumption into top-k processing, in order to obtain sufficiently accurate result sets, while keeping network traffic low. We show that the best tradeoff is achieved by updating throughout the network the values of k for the queries resulting from splitting a query between nodes and also for the set of queries subsuming a query. By this work we contribute a framework for increasing the efficiency of continuous query processing over distributed data sources for a wide range of applications, such as environmental and living spaces monitoring, network and traffic monitoring, and in general for all sensor enhanced monitoring applications.