Distributed Sampling Storage for Statistical Analysis of Massive Sensor Data

Cyber-physical systems interconnect the cyber world with the physical world in which sensors are massively networked to monitor the physical world. Various services are expected to be able to use sensor data reflecting the physical world with information technology. Given this expectation, it is important to simultaneously provide timely access to massive data and reduce storage costs. We propose a data storage scheme for storing and querying massive sensor data. This scheme is scalable by adopting a distributed architecture, fault-tolerant even without costly data replication, and enables users to efficiently select multi-scale random data samples for statistical analysis. We implemented a prototype system based on our scheme and evaluated its sampling performance. The results show that the prototype system exhibits lower latency than a conventional distributed storage system.

[1]  Suman Nath,et al.  Managing Massive Time Series Streams with MultiScale Compressed Trickles , 2009, Proc. VLDB Endow..

[2]  Chris Jermaine,et al.  Maintaining very large random samples using the geometric file , 2008, The VLDB Journal.

[3]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[4]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[5]  Edward A. Lee Cyber Physical Systems: Design Challenges , 2008, 2008 11th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC).

[6]  Michael Stonebraker,et al.  Aurora: a data stream management system , 2003, SIGMOD '03.

[7]  Ping Xu,et al.  Random sampling from hash files , 1990, SIGMOD '90.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.