Effective Prediction of Missing Data on Apache Spark over Multivariable Time Series

More massive volume of data are generated in many areas than ever before. However, the missing of some values in collected data always occurs in practice and challenges extracting maximal value from these large scale data sets. Nevertheless, in multivariable time series, most of the existing methods either might be infeasible or could be inefficient to predict the missing data. In this paper, we have taken up the challenge of missing data prediction in multivariable time series by employing improved matrix factorization techniques. Our approaches are optimally designed to largely utilize both the internal patterns of each time series and the information of time series across multiple sources. Based on the idea, we have imposed three different regularization terms to constrain the objective functions of matrix factorization and built five corresponding models. Extensive experiments on real-world data sets and synthetic data set demonstrate that the proposed approaches can effectively improve the performance of missing data prediction in multivariable time series. Furthermore, we have also demonstrated how to take advantage of the high processing power of Apache Spark to perform missing data prediction in large scale multivariable time series.

[1]  Zhao Zhang,et al.  Kira: Processing Astronomy Imagery Using Big Data Technology , 2020, IEEE Transactions on Big Data.

[2]  Hanghang Tong,et al.  Fast Mining of a Network of Coevolving Time Series , 2015, SDM.

[3]  Guy R. Newsham,et al.  Building-level occupancy data to improve ARIMA-based electricity use forecasts , 2010, BuildSys '10.

[4]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[5]  Gehao Sheng,et al.  Improving Power Grid Monitoring Data Quality: An Efficient Machine Learning Framework for Missing Data Prediction , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[6]  Christos Faloutsos,et al.  Fast algorithms for time series mining , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[7]  Michael R. Lyu,et al.  Effective missing data prediction for collaborative filtering , 2007, SIGIR.

[8]  Paolo Frasconi,et al.  Predicting Metal-Binding Sites from Protein Sequence , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Chen Peng,et al.  Time series forecasting via weighted combination of trend and seasonality respectively with linearly declining increments and multiple sine functions , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[10]  Gehao Sheng,et al.  An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment , 2017, J. Signal Process. Syst..

[11]  Shie-Jue Lee,et al.  Time series forecasting with missing values , 2015, 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom).

[12]  Mohamed Chaouch,et al.  Clustering-Based Improvement of Nonparametric Functional Time Series Forecasting: Application to Intra-Day Household-Level Load Curves , 2014, IEEE Transactions on Smart Grid.

[13]  Alan Bundy,et al.  Dynamic Time Warping , 1984 .

[14]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[15]  Muhammad Tayyab Asif,et al.  Low-dimensional models for missing data imputation in road networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Fang Chen,et al.  Discovering Congestion Propagation Patterns in Spatio-Temporal Traffic Data , 2017, IEEE Transactions on Big Data.

[17]  R. S. H. Istepanian,et al.  The potential of Internet of m-health Things “m-IoT” for non-invasive glucose level sensing , 2011, 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[18]  Christophe Rigotti,et al.  Swap Randomization of Bases of Sequences for Mining Satellite Image Times Series , 2015, ECML/PKDD.

[19]  Philip S. Yu,et al.  Dimensionality Reduction and Filtering on Time Series Sensor Streams , 2013, Managing and Mining Sensor Data.

[20]  Yohsuke Kinouchi,et al.  Neural networks for event extraction from time series: a back propagation algorithm approach , 2005, Future Gener. Comput. Syst..

[21]  Enrico Zio,et al.  Reconstruction of missing data in multidimensional time series by fuzzy similarity , 2015, Appl. Soft Comput..

[22]  Sadique Sheik,et al.  Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring , 2015 .

[23]  Chao Liu,et al.  Recommender systems with social regularization , 2011, WSDM '11.

[24]  Meinard Müller,et al.  Dynamic Time Warping , 2008 .

[25]  Aruna Tiwari,et al.  Fuzzy Based Scalable Clustering Algorithms for Handling Big Data Using Apache Spark , 2016, IEEE Transactions on Big Data.

[26]  Min Chen,et al.  iDoctor: Personalized and professionalized medical recommendations based on hybrid matrix factorization , 2017, Future Gener. Comput. Syst..

[27]  Jilles Vreeken,et al.  Linear-time Detection of Non-linear Changes in Massively High Dimensional Time Series , 2015, SDM.

[28]  Hans De Sterck,et al.  Algorithmic Acceleration of Parallel ALS for Collaborative Filtering: Speeding up Distributed Big Data Recommendation in Spark , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[29]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[30]  Paolo Frasconi,et al.  Short-Term Traffic Flow Forecasting: An Experimental Comparison of Time-Series Analysis and Supervised Learning , 2013, IEEE Transactions on Intelligent Transportation Systems.

[31]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[32]  Shaojie Tang,et al.  Time series matrix factorization prediction of internet traffic matrices , 2012, 37th Annual IEEE Conference on Local Computer Networks.

[33]  Aristides Gionis,et al.  Correlating financial time series with micro-blogging activity , 2012, WSDM '12.

[34]  Mu-Yen Chen,et al.  A high-order fuzzy time series forecasting model for internet stock trading , 2014, Future Gener. Comput. Syst..