MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition