An Investigation on the Effectiveness of Multimodal Fusion and Temporal Feature Extraction in Reactive and Spontaneous Behavior Generative RNN Models for Listener Agents