Language Feature Vectors for Resource Constraint Speech Recognition

Deep Neural Networks (DNNs) are a key element of stateof-the-art speech recognition systems. Being a data-driven method, they require a significant amount of training data. There exist scenarios in which such an amount of data is not available for a particular language. Building systems for such resource constrained tasks requires special techniques. One common method is to use data from multiple languages to train the acoustic model. But there are limitations on knowledge transfer between different languages. By the use of Language Feature Vectors (LFVs), we try to mitigate these limitations by providing language information to DNNs. Similar to i-Vectors for speaker adaptation, LFVs enable DNNs to better capture and adapt to inter language characteristics. Previous experiments have shown that providing LFVs to DNNs improved system performance. In this paper, we show that by adding LFVs the performance gap between monoand multilingual systems decreases.

[1]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Khe Chai Sim,et al.  An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Steve Renals,et al.  Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[4]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[5]  Sebastian Stüker,et al.  Acoustic modelling for under-resourced languages , 2009 .

[6]  Pietro Laface,et al.  On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[7]  Markus Müller,et al.  Using language adaptive deep neural networks for improved multilingual speech recognition , 2015, IWSLT.

[8]  Odette Scharenborg,et al.  Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. , 2010, The Journal of the Acoustical Society of America.

[9]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[10]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[11]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[12]  Tanja Schultz,et al.  Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[13]  Mattias Heldner,et al.  The fundamental frequency variation spectrum , 2008 .

[14]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Florian Metze,et al.  Models of tone for tonal and non-tonal languages , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Sebastian Stüker,et al.  Language Adaptive DNNs for Improved Low Resource Speech Recognition , 2016, INTERSPEECH.

[18]  Roberto Gretter Euronews: a multilingual benchmark for ASR and LID , 2014, INTERSPEECH.

[19]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[20]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Satoshi Nakamura,et al.  Unsupervised Phoneme Segmentation of Previously Unseen Languages , 2016, INTERSPEECH.

[22]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[23]  Florian Metze,et al.  Multilingual deep bottle neck features: a study on language selection and training techniques , 2014, IWSLT.

[24]  Florian Metze,et al.  Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[25]  Finn Dag Buø,et al.  JANUS 93: towards spontaneous speech translation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Sebastian Stüker,et al.  Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .