Hierarchical Subsampling Networks

So far we have focused on recurrent neural networks with a single hidden layer (or set of disconnected hidden layers, in the case of bidirectional or multidirectional networks). As discussed in Section 3.2, this structure is in principle able to approximate any sequence-to-sequence function arbitrarily well, and should therefore be sufficient for any sequence labelling task. In practice however, it tends to struggle with very long sequences. One problem is that, because the entire network is activated at every step of the sequence, the computational cost can be prohibitively high. Another is that the information tends to be more spread out in longer sequences, and sequences with longer range interdependencies are generally harder to learn from.