Scalable Forecasting Techniques Applied to Big Electricity Time Series

This paper presents different scalable methods to predict time series of very long length such as time series with a high sampling frequency. The Apache Spark framework for distributed computing is proposed in order to achieve the scalability of the methods. Namely, the existing MLlib machine learning library from Spark has been used. Since MLlib does not support multivariate regression, the forecasting problem has been split into h forecasting subproblems, where h is the number of future values to predict. Then, representative forecasting methods of different nature have been chosen such as models based on trees, two ensembles techniques (gradient-boosted trees and random forests), and a linear regression as a reference method. Finally, the methodology has been tested on a real-world dataset from the Spanish electricity load data with a ten-minute frequency.

[1]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[2]  José Cristóbal Riquelme Santos,et al.  An Approach to Silhouette and Dunn Clustering Indices Applied to Big Data in Spark , 2016, CAEPIA.

[3]  Davide Anguita,et al.  Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[4]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[5]  Van Nguyen,et al.  An Algorithm for Non-deterministic Object Distribution in P Systems and Its Implementation in Hardware , 2009, Workshop on Membrane Computing.

[6]  Eamonn J. Keogh,et al.  Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping , 2013, TKDD.

[7]  Francisco Martínez-Álvarez,et al.  A Survey on Data Mining Techniques Applied to Electricity-Related Time Series Forecasting , 2015 .

[8]  Jon Atli Benediktsson,et al.  On Understanding Big Data Impacts in Remotely Sensed Image Classification Using Support Vector Machine Methods , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[9]  Alicia Troncoso Lora,et al.  Finding Electric Energy Consumption Patterns in Big Time Series Data , 2016, DCAI.

[10]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[11]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Alicia Troncoso Lora,et al.  A Nearest Neighbours-Based Algorithm for Big Time Series Data Forecasting , 2016, HAIS.

[13]  Qiang Fu,et al.  YADING: Fast Clustering of Large-Scale Time Series Data , 2015, Proc. VLDB Endow..

[14]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[15]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[16]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[17]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[18]  Athanasios V. Vasilakos,et al.  Machine learning on big data: Opportunities and challenges , 2017, Neurocomputing.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  George E. P. Box,et al.  Time Series Analysis: Box/Time Series Analysis , 2008 .

[21]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[22]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[23]  José A. Lozano,et al.  A Recursive k-means Initialization Algorithm for Massive Data , 2015 .

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Lin Li,et al.  Risk adjustment of patient expenditures: A big data analytics approach , 2013, 2013 IEEE International Conference on Big Data.