Rethinking Training Strategy in Stereo Matching

In stereo matching, various learning-based approaches have shown impressive performance in solving traditional difficulties on multiple datasets. While most progress is obtained on a specific dataset with a dataset-specific network design, the performance on the single dataset and cross dataset affected by training strategy is often ignored. In this article, we analyze the relationship between different training strategies and performance by retraining some representative state-of-the-art methods (e.g., geometry and context network (GC-Net), pyramid stereo matching network (PSM-Net), and guided aggregation network (GA-Net), etc.). According to our research, it is surprising that the performance of networks on single or cross datasets is significantly improved by pre-training and data augmentation without any particular structure acquirement. Based on this discovery, we improve our previous non-local context attention network (NLCA-Net) to NLCA-Net v2 and train it with the novel strategy and rethink the training strategy of stereo matching concurrently. The quantitative experiments demonstrate that: 1) our model is capable of reaching top performance on both the single dataset and the multiple datasets with the same parameters in this study, which also won the 2nd place in the stereo task of the ECCV Robust vision Challenge 2020 (RVC 2020); and 2) on small datasets (e.g., KITTI, ETH3D, and Middlebury), the model's generalization and robustness are significantly affected by pre-training and data augmentation, even exceeding the network structure's influence in some cases. These observations present a challenge to the conventional wisdom of network architectures in this stage. We expect these discoveries to encourage researchers to rethink the current paradigm of ``excessive attention on the performance of a single small dataset'' in stereo matching.