Using articulatory features and inferred phonological segments in zero resource speech processing

Unsupervised discovery of subword units is an important problem in recognition and synthesis of zero-resource languages, in which phonesets may not be known and the only resource that may be available is speech. We use techniques that we have recently developed for building synthetic voices for very low resource languages without a written form to discover such units. We use Articulatory Features trained on labeled speech in a higher resource language to infer phonological segments of varying granularity. We use both the raw Articulatory Features and the Articulatory Features of the inferred units as framebased representations of speech. We evaluate our techniques on minimal pair ABX discrimination within and across speakers. In addition, to exploit the duration information we get from the inferred phonological units, we also present evaluation results on Mel Cepstral Distortion, an objective metric of speech synthesis quality. We evaluate our techniques on multiple databases of English, and also on Tsonga and Indic languages, in which we apply the above methods cross-lingually.

[1]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[2]  Giorgio Metta,et al.  An auto-encoder based approach to unsupervised learning of subword units , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[4]  E. Paulus,et al.  Speech Signal Processing , 1997, The Electrical Engineering Handbook - Six Volume Set.

[5]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[6]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion based on GMM and straight , 2001, INTERSPEECH.

[7]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8]  Alan W. Black,et al.  Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[10]  Tanja Schultz,et al.  Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion , 2008, SLTU.

[11]  Louis ten Bosch,et al.  Discovering an optimal set of minimally contrasting acoustic speech units: a point of focus for whole-word pattern matching , 2010, INTERSPEECH.

[12]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[13]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[14]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[15]  Alan W. Black,et al.  Text to speech in new languages without a standardized orthography , 2013, SSW.

[16]  Alta de Waal,et al.  A smartphone-based ASR data collection tool for under-resourced languages , 2014, Speech Commun..

[17]  Bajibabu Bollepalli,et al.  Modelling a Noisy-channel for Voice Conversion Using Articulatory Features , 2012, INTERSPEECH.

[18]  Aren Jansen,et al.  Weak top-down constraints for unsupervised acoustic model training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[20]  Guillaume Aimetti,et al.  The emergence of words: Modelling early language acquisition with a dynamic systems perspective , 2009, EpiRob.

[21]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[22]  Alan W. Black,et al.  Bootstrapping Text-to-Speech for speech processing in languages without an orthography , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Hynek Hermansky,et al.  Evaluating speech features with the minimal-pair ABX task (II): resistance to noise , 2014, INTERSPEECH.

[24]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[25]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[26]  H. Timothy Bunnell,et al.  Articulatory features for expressive speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).