Towards Prediction with Partial Data in Sensor-based Big Data Applications

Many emerging big data applications such as in smart electric grids, transportation, avionics, manufacturing, and remote medical and environment monitoring involve sensors for tracking, monitoring, and control. These sensors are generally located at geographically dispersed locations and expected to periodically send back acquired information to centrally located nodes or processing centers. In many cases, the data from sensors is not available at central nodes at a frequency that is required for fast and real-time modeling and decision-making. For example, while many of these sensors are capable of collecting information at a high speed, logging data every minute or so, the physical limitations, specially latency, of the transmission networks limit the frequency at which data from sensors can be transmitted back to the central nodes. Also, consumers may limit frequent transmission of information from sensors located at their premises for security and privacy concerns. Finally, the data may not reach the central nodes due to faults in the sensors or transmission systems. All these scenarios raise the issue of data veracity in big data applications. While volume, variety, and velocity aspects of big data have been the focus of much recent research, veracity has received less attention. In this paper, we propose a novel solution to the problem of making short term predictions (up to a few hours ahead) in absence of real-time data from sensors. A key implication of our work is that by using real-time data from only a small subset of influential sensors, we are able to make predictions for all sensors. We thus reduce unnecessary transmissions from sensors and provide a practical solution to data veracity in many sensor based big data applications. We use real-world electricity consumption data from smart meters to empirically demonstrate the usefulness of our method. Keywords—data veracity, short-term prediction, prediction

[1]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[2]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[5]  D. Heckerman,et al.  Autoregressive Tree Models for Time-Series Analysis , 2002, SDM.

[6]  Chee-Yee Chong,et al.  Sensor networks: evolution, opportunities, and challenges , 2003, Proc. IEEE.

[7]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[8]  Antonio Ortega,et al.  A distributed wavelet compression algorithm for wireless multihop sensor networks using lifting , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  David M Kreindler,et al.  The effects of the irregular sample and missing data in time series analysis. , 2006, Nonlinear dynamics, psychology, and life sciences.

[10]  Yan Liu,et al.  Temporal causal modeling with graphical granger methods , 2007, KDD '07.

[11]  Yan Liu,et al.  Spatial-temporal causal modeling for climate change attribution , 2009, KDD.

[12]  Patrick D. McDaniel,et al.  Security and Privacy Challenges in the Smart Grid , 2009, IEEE Security & Privacy.

[13]  Peter Tiño,et al.  Uncovering delayed patterns in noisy and irregularly sampled time series: An astronomy application , 2009, Pattern Recognit..

[14]  Ali Shojaie,et al.  Discovering graphical Granger causality using the truncating lasso penalty , 2010, Bioinform..

[15]  F. Bouhafs,et al.  Links to the Future: Communication Requirements and Challenges in the Smart Grid , 2012, IEEE Power and Energy Magazine.

[16]  Yan Liu,et al.  Granger Causality Analysis in Irregular Time Series , 2012, SDM.

[17]  Ian Richardson,et al.  Smart meter data: Balancing consumer privacy concerns with legitimate applications , 2012 .

[18]  Johanna L. Mathieu,et al.  Understanding the Effect of Baseline Modeling Implementation Choices on Analysis of Demand Response Performance , 2012 .

[19]  Ugur Demiryurek,et al.  Utilizing Real-World Transportation Data for Accurate Traffic Prediction , 2012, 2012 IEEE 12th International Conference on Data Mining.

[20]  Cees T. A. M. de Laat,et al.  Addressing big data issues in Scientific Data Infrastructure , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[21]  Peter Sanders,et al.  Communication efficient algorithms for fundamental big data problems , 2013, 2013 IEEE International Conference on Big Data.

[22]  Ram Rajagopal,et al.  Demand response targeting using big data analytics , 2013, 2013 IEEE International Conference on Big Data.

[23]  Natasha Balac,et al.  Large Scale predictive analytics for real-time energy management , 2013, 2013 IEEE International Conference on Big Data.

[24]  Eric Bouillet,et al.  MiSTRAL: An architecture for low-latency analytics on MasSive time series , 2013, 2013 IEEE International Conference on Big Data.

[25]  Simon A. Dobson,et al.  Compression in wireless sensor networks , 2013 .

[26]  Yogesh L. Simmhan,et al.  Scalable prediction of energy consumption using incremental time series clustering , 2013, 2013 IEEE International Conference on Big Data.

[27]  Murtuza Jadliwala,et al.  On the scalable collection of metering data in smart grids through message concatenation , 2013, 2013 IEEE International Conference on Smart Grid Communications (SmartGridComm).

[28]  Yogesh L. Simmhan,et al.  Empirical Comparison of Prediction Methods for Electricity Consumption Forecasting , 2014 .

[29]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[30]  Yogesh L. Simmhan,et al.  Holistic Measures for Evaluating Prediction Models in Smart Grids , 2014, IEEE Transactions on Knowledge and Data Engineering.