Real Time object detection and multilingual speech synthesis

We introduce a technique for real time deep learning based image detection with multilingual neural text-to-speech (TTS) synthesis; to generate different voices from a single model. In this work, we show improvement to the existing single lingual approach for a single-model based neural text to speech synthesis. This model, constructed with higher performance building block of a neural network (Inception4 model), demonstrated high level significant audio signal quality improvement on the images detected in real life. We show that a single deep learning model with a single neural TTS system can generate multiple languages with unique voices and display them in real life environment. We adopted transfer learning method for the image detection and recognition purpose and retrained the top layer of the model. This work introduces user friendly image-to-speech tracking system for easy navigation and valuable information extraction especially for the visually impaired people with multiple language capabilities.