Dynamic Spatio-Temporal Modular Network for Video Question Answering