Street View Text Recognition With Deep Learning for Urban Scene Understanding in Intelligent Transportation Systems

Understanding the surrounding scenes is one of the fundamental tasks in intelligent transportation systems (ITS), especially in unpredictable driving scenes or in developing regions/cities without digital maps. Street view is the most common scene during driving. Since streets are often full of shops with signboards, scene text recognition over the shop sign images in street views is of great significance and utility to urban scene understanding in ITS. To advance research in this field, (1) we build ShopSign, which is a large-scale scene text dataset of Chinese shop signs in street views. It contains 25,770 natural scene images, and 267,049 text instances. The images in ShopSign were captured in different scenes, from downtown to developing regions, and across 8 provinces and 20 cities in China, using more than 50 different mobile phones. It is very sparse and imbalanced in nature. (2) we carry out a comprehensive empirical study on the performance of state-of-the-art DL based scene text reading algorithms on ShopSign and three other Chinese scene text datasets, which has not been addressed in the literature before. Through comparative analysis, we demonstrate that language has a critical influence on scene text detection. Moreover, by comparing the accuracy of four scene text recognition algorithms, we show that there is a very large room for further improvements in street view text recognition to fit real-world ITS applications.