Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning