We have been developing a practical speech enhancement system that supports for laryngectomee. By interviewing users we captured essential issues, such as “utilization of existing device”, “the appearance needs to be inconspicuous”, and “the device should be easy to use”. Considering those user's needs, we plan to use smart phone platform and develop speech enhancement application so that the users are just ordinary looking, and there is no need to buy any additional device. In order to realize such system, the key concept of our proposed system performs lip-reading and speech synthesis. In this study, we examined a lip-reading method that can recognize by registering the words that you want to speak and that is optimized for the user using a small amount of data. 36 viseme images were converted into very small data using VAE(Variational Auto Encoder), then the training data for word recognition model was generated. Viseme is a group of phonemes with identical appearance on the lips. Our viseme sequence representation with VAE was used to be able to adapt users with very small amount of training data set. Word recognition experiment using VAE encoder and CNN was performed with 20 Japanese words. The experimental result showed 65% recognition accuracy, and 100% including 1st and 2nd candidates. The lip-reading type speech enhancement seems appropriate for embedding mobile devices in consideration of both usability and small vocabulary recognition accuracy.
[1]
Saitoh Takeshi,et al.
SSSD: Japanese Speech Scene Database by Smart Device for Visual Speech Recognition
,
2018
.
[2]
Davis E. King.
Max-Margin Object Detection
,
2015,
ArXiv.
[3]
Takeshi Saitoh,et al.
SSSD: Speech Scene database by Smart Device for Visual Speech Recognition
,
2018,
2018 24th International Conference on Pattern Recognition (ICPR).
[4]
Pattie Maes,et al.
AlterEgo: A Personalized Wearable Silent Speech Interface
,
2018,
IUI.
[5]
Hideki Kawahara,et al.
STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds
,
2006
.
[6]
Josephine Sullivan,et al.
One millisecond face alignment with an ensemble of regression trees
,
2014,
2014 IEEE Conference on Computer Vision and Pattern Recognition.
[7]
J. M. Gilbert,et al.
Silent speech interfaces
,
2010,
Speech Commun..
[8]
Shimon Whiteson,et al.
LipNet: End-to-End Sentence-level Lipreading
,
2016,
1611.01599.