Nonlinear gated experts for time series: discovering regimes and avoiding overfitting

In the analysis and prediction of real-world systems, two of the key problems are nonstationarity (often in the form of switching between regimes) and overfitting (particularly serious for noisy processes). This article addresses these problems using gated experts, consisting of a (nonlinear) gating network, and several (also nonlinear) competing experts. Each expert learns to predict the conditional mean, and each expert adapts its width to match the noise level in its regime. The gating network learns to predict the probability of each expert, given the input. This article focuses on the case where the gating network bases its decision on information from the inputs. This can be contrasted to hidden Markov models where the decision is based on the previous state(s) (i.e. on the output of the gating network at the previous time step), as well as to averaging over several predictors. In contrast, gated experts soft-partition the input space, only learning to model their region. This article discusses the underlying statistical assumptions, derives the weight update rules, and compares the performance of gated experts to standard methods on three time series: (1) a computer-generated series, obtained by randomly switching between two nonlinear processes; (2) a time series from the Santa Fe Time Series Competition (the light intensity of a laser in chaotic state); and (3) the daily electricity demand of France, a real-world multivariate problem with structure on several time scales. The main results are: (1) the gating network correctly discovers the different regimes of the process; (2) the widths associated with each expert are important for the segmentation task (and they can be used to characterize the sub-processes); and (3) there is less overfitting compared to single networks (homogeneous multilayer perceptrons), since the experts learn to match their variances to the (local) noise levels. This can be viewed as matching the local complexity of the model to the local complexity of the data.

[1]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[2]  G. Yule On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers , 1927 .

[3]  R. Quandt The Estimation of the Parameters of a Linear Regression System Obeying Two Separate Regimes , 1958 .

[4]  W. R. Buckland,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1973 .

[5]  W. R. Buckland,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1974 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  H. Tong,et al.  Threshold Autoregression, Limit Cycles and Cyclical Data , 1980 .

[8]  R. Engle Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation , 1982 .

[9]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[10]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[11]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[12]  T. Bollerslev,et al.  Generalized autoregressive conditional heteroskedasticity , 1986 .

[13]  D. E. Rumelhart,et al.  chapter Parallel Distributed Processing, Exploration in the Microstructure of Cognition , 1986 .

[14]  A. Lapedes,et al.  Nonlinear Signal Processing Using Neural Networks , 1987 .

[15]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[17]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[18]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[19]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  Martin Casdagli,et al.  Nonlinear prediction of chaotic time series , 1989 .

[22]  James D. Hamilton A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle , 1989 .

[23]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[24]  David E. Rumelhart,et al.  Predicting the Future: a Connectionist Approach , 1990, Int. J. Neural Syst..

[25]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[26]  James D. Hamilton Analysis of time series subject to changes in regime , 1990 .

[27]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[28]  Esther Levin Modeling Time Varying Systems Using Hidden Control Neural Architecture , 1990, NIPS.

[29]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[30]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[31]  Andrew L. Rukhin,et al.  Tools for statistical inference , 1991 .

[32]  David Zipser,et al.  UNSUPERVISED DISCOVERY OF SPEECH SEGMENTS USING RECURRENT NETWORKS , 1991 .

[33]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[34]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[35]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[36]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[37]  R. Chou,et al.  ARCH modeling in finance: A review of the theory and empirical evidence , 1992 .

[38]  Steven J. Nowlan,et al.  Mixtures of Controllers for Jump Linear and Non-Linear Plants , 1993, NIPS.

[39]  C. Granger,et al.  Modelling Nonlinear Economic Relationships , 1995 .

[40]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[41]  David A. Nix,et al.  Learning Local Error Bars for Nonlinear Regression , 1994, NIPS.

[42]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[43]  A. Weigend,et al.  Predictions with Confidence Intervals ( Local Error Bars ) , 1994 .

[44]  Blake LeBaron,et al.  Evaluating Neural Network Predictors by Bootstrapping , 1994 .

[45]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[46]  S. Srihari Mixture Density Networks , 1994 .

[47]  Steve R. Waterhouse,et al.  Non-linear Prediction of Acoustic Vectors Using Hierarchical Mixtures of Experts , 1994, NIPS.

[48]  Yves Chauvin,et al.  Backpropagation: the basic theory , 1995 .

[49]  Marie Cottrell,et al.  Neural modeling for time series: A statistical stepwise method for weight elimination , 1995, IEEE Trans. Neural Networks.

[50]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[51]  Ashok N. Srivastava,et al.  Predicting conditional probability distributions: a connectionist approach , 1995, Int. J. Neural Syst..

[52]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[53]  Klaus-Robert Müller,et al.  Annealed Competition of Experts for a Segmentation and Classification of Switching Dynamics , 1996, Neural Computation.