Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos