A comparison between Kriging and radial basis function networks for nonlinear prediction

Predictions by Kriging and radial basis function (RBF) networks with gaussian Kernels are compared. Kriging is a semi–parametric approach that does not rely on any specific model structure, which makes it much more flexible than approaches based on parametric behavioural models. On the other hand, accurate predictions are obtained for short training sequences, which is not the case for nonparametric prediction methods based on neural networks. Examples are presented to illustrate the effectiveness of the method . 1. INTRODUCTION We consider the situation where the relationship beetween the input and output sequences fxkg andfykg of a SISO systemS is dominated by nonlinear characteristics. When using a parametric nonlinear model, first one has to choose a suitable model structure [1]. Second, once a structure has been chosen, one has to estimate its parameters. Traditional parametric representations for nonlinear unknow structures are the Volterra, Wiener or NARMAX (Nonlinear AutoRegressive Moving Average model with eXogenous inputs) models. They generally involve a very large number of unknown parameters [2, 3], so that one has to collect a large amount of data (training sequence) to be able to estimate these parameters. Linear prediction by Kriging can be considered as a general statistical tool for modeling spatial observations, w ith or without observation errors [4, 5, 6]. Kriging is based on a semi–parametric model which allows much more flexibility than parametric models, since no specific model structure is used : the model contains a linear regression part (parametric) and a non–parametric part considered as the realizatio n of a random process. Assuming that the process is Gaussian, the parameters of its covariance matrix can be estimated by maximum likelihood. It happens that the choice of the linear regression has little influence on the predictive propertie s of the model see, e.g., [7]. The memory length mx of the input thus corresponds to the only important prior choice concerning the unknown structure, and a prior over–estimation of mx only results in heavier computations. An alternative approach to parametric, or semi–parametric , models is to use a nonparametric model. Nonparametric approaches based on neural networs, for instance Radial Basis Function (RBF) networks, are becoming more and more popular. A reason is that, in principle, neural networks can approximate any continuous behaviour with arbitrary preci sion [8]. However, besides the problem of having to choose the structure of the network, remains the major issue of choosing a suitable (and long enough) training sequence [9] . If the linear part of the Kriging model is reduced to a constant term, Kriging corresponds to a RBF network with gaussian kernels, with the centers of the kernels correspon ding to the inputs in the training data set, see section 3.2. 2. PREDICTION BY KRIGING Let fxkg andfykg be the input and output sequences of a systemS, which are observed for k = 1; : : : ; n. Prediction by Kriging consists in interpolating these data by construc ting the best linear unbiased predictor at new unsampled values ofx, that isxn+1;xn+2; : : : 2.1. Prediction When there are no observation errors, the observations yk are modelled byyk = fT (xk) + z(xk) ; wherexk denotes the vector formed by lagged scalar inputs xk = (xk ; xk 1; : : : ; xk mx+1)T ; f(xk) is the linear regressor with 2 IRp the vector of unknown parameters, and z(xk) is a realization of a stochastic process. Whenf(x) = 1, which is often suitable, the method is usually called simple Kriging [10]. We shall see in section 4 that the choice of the regressor is not crucial. The processz( ) is assumed to have zero mean and covariance defined by Efz(x)z(x0)g = W (x;x0) : We assume spatial stationarity, that is W (x;x0) = V (x x0) = 2 zR(x x0) ; with R(x) = R( x). A typical choice is R(x x0) = exp mx Xi=1 ijxi x0ij i! : (1) The functionR(:) is continous at 0, which corresponds to a process continuous in the mean–square sense. The choice of the functional form of the covariance is important, since it influences the predictive ability of the method. The form (1) allows enough flexibility through the parameters i and i, which correspond respectively to a correlation and smoothness parameter (see [11]). Other covariance functions are considered, e.g., in [12]. Let yn denote the vector of observations in the training sample, yn = (y1; : : : ; yn)T ; and defineFn as Fn = 0B@ fT (x1) .. fT (xn) 1CA : We predicty(x) at a given value of x by ŷ(x) = cT (x)yn. One can show [12] that minimizing the mean–square error of this linear predictor under the unbiasedness condition fT (x) = cT (x)Fn ; one gets ŷ(x) = fT (x) ̂ + rT (x)V 1 n (yn Fn ̂) ; (2) whereVn = 2 zRn is the covariance matrix for zn = (z(x1); : : : ; z(xn))T , with [Rn]ij = R(xi xj) ; (3) r(x) = Efz(x)zng, that is[r(x)]i = 2 zR(x xi) ; and where ̂ = (FTnR 1 n Fn) 1FTnR 1 n yn (4) is the Least–Squares estimator for . This predictor is a perfect interpolator:̂ y(xk) = yk, k = 1; : : : ; n. The mean– square error for the prediction is 2(x) = 2 z [fT (x) rT (x)] O FTn Fn Vn 1 f(x) r(x) : It satisfies 2(xk) = 0, k = 1; : : : ; n. Assuming a normal distribution for the process z(x), confidence intervals can be constructed for the prediction. For instance, a 95% confidence interval is given by Probfy(x) 2 [ŷ(x) 1:96 (x); ŷ(x)+1:96 (x)]g ' 0:95 : When observation errors are present, the observations are modelled as yk = fT (xk) + z(xk) + k ; with f kg an i.i.d. sequence of errors with zero mean and variance 2 and z( ) a stochastic process independent of f kg. DefineVn = 2 In+ 2 zRn, with In then-dimensional identity matrix andRn given by (3). The prediction at x is then still given by (2). When 2 6= 0, this predictor is not a perfect interpolator. We assume in the rest of this paper that observation errors are negligible. 2.2. Estimation The prediction̂y(x) depends on the parameters i and i in the covariance function (1). The case i = 1, i = 1; : : : ;mx, corresponds to the product of Ornstein–Uhlenbeck processe s, which are continuous but not differentiable everywhere. Wh en i = 2, i = 1; : : : ;mx, the process has infinitely differentiable paths (in the mean–square sense). A classical assumption is i 2 [1; 2], i = 1; ;mx. Assuming that the stochastic process z( ) is Gaussian, one can estimate the i’s and i’s by maximum likelihood, together with and 2 z . Elementary calculations give: f̂; ̂g = arg min f 2IR+mx ; 2[1;2]mxg[n ln(̂2 z) + ln det(Rn)] ; (5) wherê2 z = 1 n (yn Fn ̂)TR 1 n (yn Fn ̂) ; and ̂ given by (4) respectively correspond to the maximum likelihood estimators of 2 z and . Numerical optimization methods are required for the solution of (5). The problem is sometimes difficult (see e.g. [13]), but numerical simulations show that a precise determination of the estimates is not necessary to get an accurate prediction. In particular, local optima are generally acce ptable. It is recommended in practice to impose constraints on , such as i > 0 to preserve the positive–definite character ofRn during the optimization. Freezing the i’s at 2 is often acceptable. 3. RADIAL BASIS FUNCTION NETWORK 3.1. Structure of RBF network A radial basis function network consists of an input layer of source nodes, a single hidden layer of nonlinear processing units, and an output layer of linear weights, as depicted in Fig. 1.