Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u 7→ y by simply simulating a linear continuous-time state-space representation ẋ = Ax+ Bu, y = Cx+Du. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices A that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use handcrafted features on 100x shorter sequences.

[1]  Chris Eliasmith,et al.  Parallelizing Legendre Memory Unit Training , 2021, ICML.

[2]  J. Lambert Numerical Methods for Ordinary Differential Equations , 1991 .

[3]  Huaguang Zhang,et al.  A Comprehensive Review of Stability Analysis of Continuous-Time Recurrent Neural Networks , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[4]  C. Ré,et al.  HiPPO: Recurrent Memory with Optimal Polynomial Projections , 2020, NeurIPS.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[8]  Ed H. Chi,et al.  AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks , 2019, ICLR.

[9]  Edward De Brouwer,et al.  GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series , 2019, NeurIPS.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Patrick Kidger,et al.  Neural Rough Differential Equations for Long Time Series , 2021, ICML.

[12]  V. Battaglia,et al.  Numerical Methods for Ordinary Differential Equations , 2018 .

[13]  Omri Azencot,et al.  Lipschitz Recurrent Neural Networks , 2020, ICLR.

[14]  Il Memming Park,et al.  Gated Recurrent Units Viewed Through the Lens of Continuous Time Dynamical Systems , 2019, Frontiers in Computational Neuroscience.

[15]  Radu Grosu,et al.  Liquid Time-constant Networks , 2020, AAAI.

[16]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[17]  Quoc V. Le,et al.  Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[18]  David Duvenaud,et al.  Latent Ordinary Differential Equations for Irregularly-Sampled Time Series , 2019, NeurIPS.

[19]  Yuichi Nakamura,et al.  Approximation of dynamical systems by continuous time recurrent neural networks , 1993, Neural Networks.

[20]  Rene F. Swarttouw,et al.  Orthogonal polynomials , 2020, NIST Handbook of Mathematical Functions.

[21]  Victor Shoup,et al.  A computational introduction to number theory and algebra , 2005 .

[22]  Vladlen Koltun,et al.  Trellis Networks for Sequence Modeling , 2018, ICLR.

[23]  I. Gohberg,et al.  On a new class of structured matrices , 1999 .

[24]  Siddhartha Mishra,et al.  UnICORNN: A recurrent model for learning very long time dependencies , 2021, ICML.

[25]  Geoffrey I. Webb,et al.  Time series extrinsic regression , 2020, Data mining and knowledge discovery.

[26]  Clément Pernet Computing with Quasiseparable Matrices , 2016, ISSAC.

[27]  Mathias Lechner,et al.  Learning Long-Term Dependencies in Irregularly-Sampled Time Series , 2020, NeurIPS.

[28]  Robert L. Williams,et al.  Linear State-Space Control Systems , 2007 .

[29]  Thomas S. Huang,et al.  Dilated Recurrent Neural Networks , 2017, NIPS.

[30]  Lior Horesh,et al.  Recurrent Neural Networks in the Eye of Differential Equations , 2019, ArXiv.

[31]  Yu Zhang,et al.  Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.

[32]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[33]  G. Arfken Mathematical Methods for Physicists , 1967 .

[34]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Chris Eliasmith,et al.  Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks , 2019, NeurIPS.

[36]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[37]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[38]  Mathias Lechner,et al.  Closed-form Continuous-Depth Models , 2021, ArXiv.

[39]  Atri Rudra,et al.  A Two-pronged Progress in Structured Dense Matrix Vector Multiplication , 2018, SODA.

[40]  Xiang Chen,et al.  Performance Recovery in Digital Implementation of Analogue Systems , 2007, SIAM J. Control. Optim..

[41]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[42]  Mark Hoogendoorn,et al.  CKConv: Continuous Kernel Convolution For Sequential Data , 2021, ArXiv.

[43]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Jared A. Dunnmon,et al.  Weak supervision as an efficient approach for automated seizure detection in electroencephalography , 2020, npj Digital Medicine.

[45]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Siddhartha Mishra,et al.  Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies , 2020, ICLR.

[47]  M. Anshelevich,et al.  Introduction to orthogonal polynomials , 2003 .

[48]  Yann Ollivier,et al.  Can recurrent neural networks warp time? , 2018, ICLR.

[49]  Tri Dao,et al.  Catformer: Designing Stable Transformers via Sensitivity Analysis , 2021, ICML.

[50]  阿夫肯 Mathematical Methods for Physicists; A Comprehensive Guide , 2014 .

[51]  Mark Hoogendoorn,et al.  FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes , 2021, ArXiv.

[52]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[53]  Karl J. Friston,et al.  Dynamic causal modelling , 2003, NeuroImage.

[54]  Matthew W. Hoffman,et al.  Improving the Gating Mechanism of Recurrent Neural Networks , 2019, ICML.

[55]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[56]  Terry Lyons,et al.  Neural Controlled Differential Equations for Irregular Time Series , 2020, NeurIPS.

[57]  Joseph Picone,et al.  The Temple University Hospital Seizure Detection Corpus , 2018, Front. Neuroinform..

[58]  Venkatesh Saligrama,et al.  RNNs Incrementally Evolving on an Equilibrium Manifold: A Panacea for Vanishing and Exploding Gradients? , 2019, ICLR.