Learning semantic visual vocabularies using diffusion distance