CSANet for Video Semantic Segmentation With Inter-Frame Mutual Learning

Video semantic segmentation aims atgenerating temporal consistent segmentation results and is still a very challenging task in the deep learning era. In this work, we improve prior approaches from two aspects. On the network architecture level, we present the cross and self-attention network (CSANet). As opposed to prior methods, CSANet not only propagates temporal features from adjacent frames, but is also designed to aggregate spatial context within the current frame, which is shown to effectively improve the consistency and robustness of the extracted deep features. On the loss function level, we further propose the inter-frame mutual learning strategy which ensures the cross-attention module to focus on semantically correlated context regions, allowing the segmentation results at different frames to be collaboratively improved. By combining the above two novel designs, we show that our proposed method is able to deliver state-of-the-art performance on the Cityscapes and CamVid benchmarks.