Greedy Layer-Wise Training of Long Short Term Memory Networks

Recent developments in Recurrent Neural Networks (RNNs) such as Long Short Term Memory (LSTM) have shown promising potential for modeling sequential data. Nevertheless, training LSTM is not trivial when there are multiple layers in the deep architectures. This difficulty originates from the initialization method of LSTM, where gradient-based optimization often appears to converge to poor local solutions. In this paper, we explore an unsupervised pretraining mechanism for LSTM initialization, following the philosophy that the unsupervised pretraining plays the role of a regularizer to guide the subsequent supervised training. We propose a novel encoder-decoder-based learning framework to initialize a multi-layer LSTM in a greedy layer-wise manner in which each added LSTM layer is trained to retain the main information in the previous representation. A multi-layer LSTM trained with our method outperforms the one trained with random initialization, with clear advantages on several tasks. Moreover, the multi-layer LSTMs converge 4 times faster with our greedy layer-wise training method.