Tackling Vision Language Tasks Through Learning Inner Monologues