The existing approaches for crowd counting usually estimate a density map with deep convolutional neural network to obtain the crowd counts. Influenced by the background noises, some approaches may result in incorrect pedestrian heads recognition. Therefore, some approaches try to estimate an attention map to mask background noises. However, since the background noises are complex and stochastic, single attention is of incompetence to recognize them. Consequently, we proposed softer, and more reasonable Cascade Residual Attention Network (CRANet), which cascades several effective residual attention modules to mask background noises. Also, due to pixel-level isolation of Euclidean loss, we designed a novel Pyramid Structural Similarity Loss to train our CRANet. The proposed approach was evaluated on three crowd datasets. Experimental results demonstrated that our approach achieves the state-of-the-art.