Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective