In this paper, we propose an iterative feature matching framework for self-supervised depth estimation in indoor scenes. Conventional methods usually leverage the structure-from-motion supervision to help the photometric optimization escape from the local minima, which have complex ego-motion and large regions with non-texture or repeated-texture. However, the supervision is limited as the reconstruction is usually sparse. To address this, we propose an iterative feature matching framework called IFMNet to jointly learn depths and search for correspondences. With the predicted depths from the previous iteration, we present an online optimized grid searching algorithm to find more accurate correspondences. Given these new correspondences, we compute the triangulated depths and improve the depth network with adaptive bin-wise online hard example mining. Experimental results on the NYU Depth V2 and SceneNet datasets verify the effectiveness of our approach.