Joint Learning of Localized Representations from Medical Images and Reports