Efficient Learning from Massive Spatial-Temporal Data Through Selective Support Vector Propagation

In the proposed approach, learning from large spatial-temporal data streams is addressed using the sequential training of support vector machines (SVM) on a series of smaller spatial data subsets collected over shorter periods. A set of representatives are selected from support vectors corresponding to an SVM trained with data of a limited spatial-temporal coverage. These representatives are merged with newly arrived data also corresponding to a limited spacetime segment. A new SVM is learned using both sources. Relying on selected representatives instead of propagating all support vectors to the next iteration allows efficient learning of semi-global SVMs in a non-stationary series consisting of correlated spatial datasets. The proposed method is evaluated on a challenging geoinformatics problem of aerosol retrieval from Terra satellite based Multi-angle Imaging Spectro Radiometer instrument. Regional features were discovered that allowed spatial partitioning of continental US to several semi-global regions. Developed semi-global SVM models were reused for efficient estimation of aerosol optical depth from radiances with a high level of accuracy on data cycles spanning several months. The obtained results provide evidence that SVMs trained as proposed have an extended spatial and temporal range of applicability as compared to SVM models trained on samples collected over shorter periods. In addition, the computational cost of training a semi-global SVM with selective support vector propagation (SSVP) was much lower than when training a global model using spatial observations from the entire period.

[1]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Dominic Mazzoni,et al.  Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors , 2003, ICML.

[4]  Zoran Obradovic,et al.  Performance Controlled Data Reduction for Knowledge Discovery in Distributed Databases , 2000, PAKDD.

[5]  Silvia Nittel,et al.  Parallelizing clustering of geoscientific data sets using data streams , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[6]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[7]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[9]  Graham W. Bothwell,et al.  The Multi-angle Imaging SpectroRadiometer science data system, its products, tools, and performance , 2002, IEEE Trans. Geosci. Remote. Sens..

[10]  Zoran Obradovic,et al.  Towards Efficient Learning of Neural Network Ensembles from Arbitrarily Large Datasets , 2004, ECAI.

[11]  Bernhard Schölkopf,et al.  Improving the accuracy and speed of support vector learning machines , 1997, NIPS 1997.

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[14]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[15]  Federico Girosi,et al.  Reducing the run-time complexity of Support Vector Machines , 1999 .