CorClustST - Correlation-based clustering of big spatio-temporal datasets

Abstract Increasing amounts of high-velocity spatio-temporal data reinforce the need for clustering algorithms which are effective for big data processing and data reduction. As currently applied spatio-temporal clustering algorithms have certain drawbacks regarding the comparability of the results, we propose an alternative spatio-temporal clustering technique which is based on empirical spatial correlations over time. As a key feature, CorClustST makes it easily possible to compare and interpret clustering results for different scenarios such as multiple underlying variables or varying time frames. In a test case, we show that the clustering strategy successfully identifies increasing spatial correlations of wind power forecast errors in Europe for longer forecast horizons. An extension of the clustering algorithm is finally presented which allows for a large-scale parallel implementation and helps to circumvent memory limitations. The proposed method will especially be helpful for researchers who aim to preprocess big spatio-temporal datasets and who intend to compare clustering results and spatial dependencies for different scenarios.

[1]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  D. Heinemann,et al.  Time-consistent calibration of short-term regional wind power ensemble forecasts , 2015 .

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  B. Everitt,et al.  Cluster Analysis: Everitt/Cluster Analysis , 2011 .

[6]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[7]  Anton H. Westveld,et al.  Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation , 2005 .

[8]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[9]  Henrik Madsen,et al.  Spatio‐temporal analysis and modeling of short‐term wind power forecast errors , 2011 .

[10]  R. L. Thorndike Who belongs in the family? , 1953 .

[11]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[12]  Tilmann Gneiting,et al.  Probabilistic forecasts, calibration and sharpness Series B Statistical methodology , 2007 .

[13]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[14]  A. Raftery,et al.  Using Bayesian Model Averaging to Calibrate Forecast Ensembles , 2005 .

[15]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[16]  Yen-Jen Oyang,et al.  A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory , 2001, PKDD.

[17]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[18]  Ulrich Focken,et al.  Short-term prediction of the aggregated power output of wind farms—a statistical analysis of the reduction of the prediction error by spatial smoothing effects , 2002 .

[19]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[20]  Richard Taylor Interpretation of the Correlation Coefficient: A Basic Review , 1990 .

[21]  Shrideep Pallickara,et al.  On the performance of high dimensional data clustering and classification algorithms , 2013, Future Gener. Comput. Syst..

[22]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[23]  António Couto,et al.  Weather dependent estimation of continent-wide wind power generation based on spatio-temporal clustering , 2017 .

[24]  Slava Kisilevich,et al.  Spatio-temporal clustering , 2010, Data Mining and Knowledge Discovery Handbook.

[25]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[26]  H. Madsen,et al.  Forecasting Electricity Spot Prices Accounting for Wind Power Predictions , 2013, IEEE Transactions on Sustainable Energy.

[27]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[28]  H. Madsen,et al.  Predictive Densities for Day-Ahead Electricity Prices Using Time-Adaptive Quantile Regression , 2014 .

[29]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[30]  Sanjay Garg,et al.  Development and validation of OPTICS based spatio-temporal clustering technique , 2016, Inf. Sci..

[31]  M. Lange On the Uncertainty of Wind Power Predictions—Analysis of the Forecast Accuracy and Statistical Distribution of Errors , 2005 .