Divide and Recombine Approaches for Fitting Smoothing Spline Models with Large Datasets

ABSTRACT Spline smoothing is a widely used nonparametric method that allows data to speak for themselves. Due to its complexity and flexibility, fitting smoothing spline models is usually computationally intensive which may become prohibitive with large datasets. To overcome memory and CPU limitations, we propose four divide and recombine (D&R) approaches for fitting cubic splines with large datasets. We consider two approaches to divide the data: random and sequential. For each approach of division, we consider two approaches to recombine. These D&R approaches are implemented in parallel without communication. Extensive simulations show that these D&R approaches are scalable and have comparable performance as the method that uses the whole data. The sequential D&R approaches are spatially adaptive which lead to better performance than the method that uses the whole data when the underlying function is spatially inhomogeneous.

[1]  Bowei Xi,et al.  Large complex data: divide and recombine (D&R) with RHIPE , 2012 .

[2]  Ryan Hafen,et al.  Visualization Databases for the Analysis of Large Complex Datasets , 2009, AISTATS.

[3]  S. Wood mgcv:Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation , 2012 .

[4]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[5]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[6]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .

[7]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[8]  Guang Cheng,et al.  Computational Limits of A Distributed Algorithm for Smoothing Spline , 2015, J. Mach. Learn. Res..

[9]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[10]  G. Wahba,et al.  Hybrid Adaptive Splines , 1997 .

[11]  Yuedong Wang,et al.  ASSIST: A Suite of S functions Implementing Spline smoothing Techniques , 2014 .

[12]  Samuel C. Kou,et al.  Smoothers and the Cp, Generalized Maximum Likelihood, and Extended Exponential Criteria , 2002 .

[13]  Wensheng Guo,et al.  DATA DRIVEN ADAPTIVE SPLINE SMOOTHING , 2010 .

[14]  Chong Gu,et al.  Smoothing spline Gaussian regression: more scalable computation via efficient approximation , 2004 .

[15]  G. Wahba A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem , 1985 .

[16]  G. Wahba Spline models for observational data , 1990 .

[17]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[18]  Douglas W. Nychka,et al.  Splines as Local Smoothers , 1995 .

[19]  Junqing Wu,et al.  Nonparametric Regression With Basis Selection From Multiple Libraries , 2013, Technometrics.

[20]  Jianhua Z. Huang,et al.  Efficient computation of smoothing splines via adaptive basis sampling , 2015 .