Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting