Recently, many deep learning based researches are conducted to explore the potential quality improvement of compressed videos. These methods mostly utilize either the spatial or temporal information to perform frame-level video enhancement. However, they fail in combining different spatial-temporal information to adaptively utilize adjacent patches to enhance the current patch and achieve limited enhancement performance especially on scene-changing and strong-motion videos. To overcome these limitations, we propose a patch-wise spatial-temporal quality enhancement network which firstly extracts spatial and temporal features, then recalibrates and fuses the obtained spatial and temporal features. Specifically, we design a temporal and spatial-wise attention-based feature distillation structure to adaptively utilize the adjacent patches for distilling patch-wise temporal features. For adaptively enhancing different patch with spatial and temporal information, a channel and spatial-wise attention fusion block is proposed to achieve patch-wise recalibration and fusion of spatial and temporal features. Experimental results demonstrate our network achieves peak signal-to-noise ratio improvement, 0.55 – 0.69 dB compared with the compressed videos at different quantization parameters, outperforming state-of-the-art approach.