Dominant Data Set Selection Algorithms for Electricity Consumption Time-Series Data Analysis Based on Affine Transformation

In the explosive growth of time-series data (TSD), the scale of TSD suggests that the scale and capability of many Internet of Things (IoT)-based applications has already been exceeded. Moreover, redundancy persists in TSD due to the correlation between information acquired via different sources. In this article, we propose a cohort of dominant data set selection algorithms for electricity consumption TSD with a focus on discriminating the dominant data set that is a small data set but capable of representing the kernel information carried by TSD with an arbitrarily small error rate less than <inline-formula> <tex-math notation="LaTeX">$\varepsilon $ </tex-math></inline-formula>. Furthermore, we prove that the selection problem of the minimum dominant data set is an NP-complete problem. The affine transformation model is introduced to define the linear correlation relationship between TSD objects. Our proposed framework consists of the scanning selection algorithm with <inline-formula> <tex-math notation="LaTeX">$O({n^{3}})$ </tex-math></inline-formula> time complexity and the greedy selection algorithm with <inline-formula> <tex-math notation="LaTeX">$O({n^{4}})$ </tex-math></inline-formula> time complexity, which are, respectively, proposed to select the dominant data set based on the linear correlation distance between TSD objects. The proposed algorithms are evaluated on the real electricity consumption data of Harbin city in China. The experimental results show that the proposed algorithms not only reduce the size of the extracted kernel data set but also ensure the TSD integrity in terms of accuracy and efficiency.

[1]  Eamonn J. Keogh,et al.  Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping , 2013, TKDD.

[2]  Jianzhong Li,et al.  Drawing dominant dataset from big sensory data in wireless sensor networks , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[3]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[4]  Yingshu Li,et al.  Research progress in the complexity theory and algorithms of big-data computation , 2016 .

[5]  Tang Chang-jie,et al.  A Compression Algorithm for Multi-Streams Based on Wavelets and Coincidence , 2007 .

[6]  José Antonio Lozano,et al.  An efficient approximation to the K-means clustering for massive data , 2017, Knowl. Based Syst..

[7]  Jianzhong Li,et al.  O(ε)-Approximation to physical world by sensor networks , 2013, 2013 Proceedings IEEE INFOCOM.

[8]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[9]  Jianzhong Li,et al.  (ε, δ)-Approximate Aggregation Algorithms in Dynamic Sensor Networks , 2012, IEEE Transactions on Parallel and Distributed Systems.

[10]  Kuang-Ching Wang,et al.  Review of Internet of Things (IoT) in Electric Power and Energy Systems , 2018, IEEE Internet of Things Journal.

[11]  Jin-Yi Cai,et al.  Progress in Computational Complexity Theory , 2005, Journal of Computer Science and Technology.

[12]  T. Lai,et al.  A STEPWISE REGRESSION METHOD AND CONSISTENT MODEL SELECTION FOR HIGH-DIMENSIONAL SPARSE LINEAR MODELS , 2011 .

[13]  Jianzhong Li,et al.  Sampling Based ( , δ)-Approximate Aggregation Algorithm in Sensor Networks , 2009 .

[14]  Tamer A. ElBatt On the trade-offs of cooperative data compression in wireless sensor networks with spatial correlations , 2009, IEEE Transactions on Wireless Communications.

[15]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[16]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[17]  Lei Yu,et al.  Bernoulli sampling based (ε, δ)-approximate aggregation in large-scale sensor networks , 2010, INFOCOM 2010.

[18]  Stephen P. Boyd,et al.  Introduction to Applied Linear Algebra , 2018 .

[19]  Yi Wu Network Big Data: A Literature Survey on Stream Data Mining , 2014, J. Softw..

[20]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[21]  Jianjun Zhou,et al.  Research Progress of Stream Data Query in Network Space , 2015 .

[22]  Jianzhong Li,et al.  A New Compression Method with Fast Searching on Large Databases , 1987, VLDB.

[23]  Karl Aberer,et al.  Fast Distributed Correlation Discovery Over Streaming Time-Series Data , 2015, CIKM.

[24]  Qiang Fu,et al.  YADING: Fast Clustering of Large-Scale Time Series Data , 2015, Proc. VLDB Endow..

[25]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[26]  Luis Castano-Londono,et al.  Data Reduction in a Low-Cost Environmental Monitoring System Based on LoRa for WSN , 2019, IEEE Internet of Things Journal.

[27]  T. J. Pearson,et al.  A method for the estimation of the significance of cross-correlations in unevenly sampled red-noise time series , 2014, 1408.6265.

[28]  Jing-Shiang Hwang,et al.  A stepwise regression algorithm for high-dimensional variable selection , 2015 .

[29]  Karl Aberer,et al.  AFFINITY: Efficiently querying statistical measures on time-series data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[30]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[31]  Francisco Martínez-Álvarez,et al.  Big Data Analytics for Discovering Electricity Consumption Patterns in Smart Cities , 2018 .

[32]  Ian F. Akyildiz,et al.  Collaborative Data Compression Using Clustered Source Coding for Wireless Multimedia Sensor Networks , 2010, 2010 Proceedings IEEE INFOCOM.

[33]  Jianzhong Li,et al.  Approximate Physical World Reconstruction Algorithms in Sensor Networks , 2014, IEEE Transactions on Parallel and Distributed Systems.

[34]  Jianjun Zhou,et al.  Design of Electric Energy Acquisition System on Hadoop , 2015 .

[35]  Jianzhong Li,et al.  Extracting Kernel Dataset from Big Sensory Data in Wireless Sensor Networks , 2017, IEEE Transactions on Knowledge and Data Engineering.

[36]  Yu-Chee Tseng,et al.  Data Compression by Temporal and Spatial Correlations in a Body-Area Sensor Network: A Case Study in Pilates Motion Recognition , 2011, IEEE Transactions on Mobile Computing.

[37]  Jie Liu,et al.  Fast approximate correlation for massive time-series data , 2010, SIGMOD Conference.

[38]  Antonios Deligiannakis,et al.  Data Reduction Techniques in Sensor Networks , 2005, IEEE Data Eng. Bull..

[39]  Zhipeng Cai,et al.  Approximate aggregation for tracking quantiles and range countings in wireless sensor networks , 2015, Theor. Comput. Sci..

[40]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[41]  Haibo Hu,et al.  Energy-Efficient Monitoring of Spatial Predicates over Moving Objects , 2005, IEEE Data Eng. Bull..

[42]  Ertuğrul Çam,et al.  Forecasting electricity consumption: A comparison of regression analysis, neural networks and least squares support vector machines , 2015 .