Scene Text Retrieval via Joint Text Detection and Similarity Learning