论文信息 - Real Time object detection and multilingual speech synthesis

Real Time object detection and multilingual speech synthesis

We introduce a technique for real time deep learning based image detection with multilingual neural text-to-speech (TTS) synthesis; to generate different voices from a single model. In this work, we show improvement to the existing single lingual approach for a single-model based neural text to speech synthesis. This model, constructed with higher performance building block of a neural network (Inception4 model), demonstrated high level significant audio signal quality improvement on the images detected in real life. We show that a single deep learning model with a single neural TTS system can generate multiple languages with unique voices and display them in real life environment. We adopted transfer learning method for the image detection and recognition purpose and retrained the top layer of the model. This work introduces user friendly image-to-speech tracking system for easy navigation and valuable information extraction especially for the visually impaired people with multiple language capabilities.

Okeke Stephen | Mangal Sain | Deepanjali Mishra

[1] Srikanth Ronanki,et al. Median-based generation of synthetic speech durations using a non-parametric approach , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[3] Julia Hirschberg,et al. Progress in speech synthesis , 1997 .

[4] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[5] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[6] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[7] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[8] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[9] Xiang Zhang,et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[10] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.