Model-Attentive Ensemble Learning for Sequence Modeling

Medical time-series datasets have unique characteristics that make prediction tasks challenging. Most notably, patient trajectories often contain longitudinal variations in their input-output relationships, generally referred to as temporal conditional shift. Designing sequence models capable of adapting to such time-varying distributions remains a prevailing problem. To address this we present Model-Attentive Ensemble learning for Sequence modeling (MAES). MAES is a mixture of time-series experts which leverages an attention-based gating mechanism to specialize the experts on different sequence dynamics and adaptively weight their predictions. We demonstrate that MAES significantly out-performs popular sequence models on datasets subject to temporal shift.

[1]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Romain Pirracchio,et al.  Mortality Prediction in the ICU Based on MIMIC-II Results from the Super ICU Learner Algorithm (SICULA) Project , 2016 .

[4]  Heiko Paulheim,et al.  Ensembles of Recurrent Neural Networks for Robust Time Series Forecasting , 2017, SGAI Conf..

[5]  Mihaela van der Schaar,et al.  Stepwise Model Selection for Sequence Prediction via Deep Kernel Learning , 2020, AISTATS.

[6]  Luís Torgo,et al.  Arbitrage of forecasting experts , 2018, Machine Learning.

[7]  Roger G. Mark,et al.  Reproducibility in critical care: a mortality prediction case study , 2017, MLHC.

[8]  Jenna Wiens,et al.  Patient Risk Stratification with Time-Varying Parameters: A Multitask Learning Approach , 2016, J. Mach. Learn. Res..

[9]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[10]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[11]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pablo Barceló,et al.  On the Turing Completeness of Modern Neural Network Architectures , 2019, ICLR.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Walter Karlen,et al.  Granger-causal Attentive Mixtures of Experts , 2018, ArXiv.

[15]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[16]  Shih-Fu Chang,et al.  CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation , 2019, ArXiv.

[17]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[18]  Bumshik Lee,et al.  Combining LSTM Network Ensemble via Adaptive Weighting for Improved Time Series Forecasting , 2018, Mathematical Problems in Engineering.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Jeffrey Dean,et al.  Scalable and accurate deep learning with electronic health records , 2018, npj Digital Medicine.