A novel spark-based multi-step forecasting algorithm for big data time series

Abstract This paper presents different scalable methods for predicting big time series, namely time series with a high frequency measurement. Methods are also developed to deal with arbitrary prediction horizons. The Apache Spark framework is proposed for distributed computing in order to achieve the scalability of the methods. Prediction methods have been developed using Spark’s MLlib library for machine learning. Since the library does not support multivariate regression, the prediction problem is formulated as h prediction sub-problems, where h is the number of future values to predict, that is, the prediction horizon. Furthermore, different kinds of representative methods have been chosen, such as decision trees, two tree-based ensemble techniques (Gradient-Boosted and Random Forest) and a linear regression method as a reference method for comparisons. Finally, the methodology has been tested in a real time series of electrical demand in Spain, with a time interval of ten minutes between measurements.

[1]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[2]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[3]  Francisco Martinez Alvarez,et al.  Energy Time Series Forecasting Based on Pattern Sequence Similarity , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4]  Athanasios V. Vasilakos,et al.  Machine learning on big data: Opportunities and challenges , 2017, Neurocomputing.

[5]  James W. Taylor Density forecasting for the efficient balancing of the generation and consumption of electricity , 2006 .

[6]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[7]  Ping-Feng Pai,et al.  Support Vector Machines with Simulated Annealing Algorithms in Electricity Load Forecasting , 2005 .

[8]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[9]  Alicia Troncoso Lora,et al.  Finding Electric Energy Consumption Patterns in Big Time Series Data , 2016, DCAI.

[10]  H. Guirguis,et al.  Further Advances in Forecasting Day-Ahead Electricity Prices Using Time Series Models , 2004 .

[11]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Mohammed E. El-Telbany,et al.  Short-term forecasting of Jordanian electricity demand using particle swarm optimization , 2008 .

[15]  J. Ramos,et al.  Electricity Market Price Forecasting Based on Weighted Nearest Neighbors Techniques , 2007, IEEE Transactions on Power Systems.

[16]  Dong-Xiao Niu,et al.  Support Vector Machine Model in Electricity Load Forecasting , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[17]  Pekka Malo,et al.  Evaluating Multivariate GARCH Models in the Nordic Electricity Markets , 2006 .

[18]  Lin Li,et al.  Risk adjustment of patient expenditures: A big data analytics approach , 2013, 2013 IEEE International Conference on Big Data.

[19]  Irena Koprinska,et al.  Combining pattern sequence similarity with neural networks for forecasting electricity demand time series , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[20]  Jon Atli Benediktsson,et al.  On Understanding Big Data Impacts in Remotely Sensed Image Classification Using Support Vector Machine Methods , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[21]  Francisco Martínez-Álvarez,et al.  A Survey on Data Mining Techniques Applied to Electricity-Related Time Series Forecasting , 2015 .

[22]  Shu Fan,et al.  Forecasting Electricity Demand by Hybrid Machine Learning Model , 2006, ICONIP.

[23]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[24]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[25]  Alicia Troncoso Lora,et al.  A Nearest Neighbours-Based Algorithm for Big Time Series Data Forecasting , 2016, HAIS.

[26]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Qiang Fu,et al.  YADING: Fast Clustering of Large-Scale Time Series Data , 2015, Proc. VLDB Endow..

[29]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[30]  José Cristóbal Riquelme Santos,et al.  An Approach to Silhouette and Dunn Clustering Indices Applied to Big Data in Spark , 2016, CAEPIA.

[31]  Short–run electricity demand forecasts in Maharashtra , 2002 .

[32]  Davide Anguita,et al.  Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[33]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[34]  Marco van Akkeren,et al.  A GARCH forecasting model to predict day-ahead electricity prices , 2005, IEEE Transactions on Power Systems.