Real Image Super-Resolution Using Token Based Contextual Attention

Current state-of-the-art (SOTA) image super-resolution (SR) methods rely heavily on deep neural network (DNN), and many of them use attentions to regulate feature channels. While these models perform well on benchmark datasets where low-resolution (LR) images are constructed from high-resolution (HR) references with known blur kernel, real image SR is more challenging when the LR-HR pair are both collected from real cameras with complex blur kernel and noise statistics. Besides, current methods are trained in small image patches where channel attentions are calculated based on statistics of the full patch. This leads to impressive performance when the test image is small or chopped to small patches, but performs poorly when tested on full size real images. To alleviate these issues, we propose a new token based attention module with innovative contextual encoding to enable SR models to be robust to image patch sizes at testing. The dot-product attention between different tokens can efficiently describe the affinity relationship for different regions in an image. Together with the proximity relationship considered by contextual encoding, it leads to better global SR effects for full size images. Comprehensive experiments illustrate the superior performance of the proposed scheme.