Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations