Word Discovering in Low-Resources Languages Through Cross-Lingual Phonemes

An approach for discovering word units in an unknown language under zero resources conditions is presented in this paper. The method is based only on acoustic similarity, combining a cross-lingual phoneme recognition, followed by an identification of consistent strings of phonemes. To this end, a 2-phases algorithm is proposed. The first phase consists of an acoustic-phonetic decoding process, considering a universal set of phonemes, not related with the target language. The goal is to reduce the search space of similar segments of speech, avoiding the quadratic search space if all-to-all speech files are compared. In the second phase, a further refinement of the founded segments is done by means of different approaches based on Dynamic Time Warping. In order to include more hypotheses than only those that correspond to perfect matching in terms of phonemes, an edit distance is calculated for the purpose to also incorporate hypotheses under a given threshold. Three frame representations are studied: raw acoustic features, autoencoders and phoneme posteriorgrams. This approach has been evaluated on the corpus used in Zero resources speech challenge 2017.

[1]  Giampiero Salvi,et al.  Word Discovery with Beta Process Factor Analysis , 2012, INTERSPEECH.

[2]  Louis ten Bosch,et al.  Adaptive non-negative matrix factorization in a computational model of language acquisition , 2009, INTERSPEECH.

[3]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[4]  Giampiero Salvi,et al.  Pattern discovery in continuous speech using Block Diagonal Infinite HMM , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[8]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[9]  Alexander I. Rudnicky,et al.  OOV Word Detection using Hybrid Models with Mixed Types of Fragments , 2012, INTERSPEECH.

[10]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[11]  Alan W. Black,et al.  Using articulatory features and inferred phonological segments in zero resource speech processing , 2015, INTERSPEECH.

[12]  Lorenzo Rosasco,et al.  Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders , 2015, INTERSPEECH.

[13]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  Giorgio Metta,et al.  An auto-encoder based approach to unsupervised learning of subword units , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[18]  Aren Jansen,et al.  A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[19]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[20]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21]  O. Räsänen A computational model of word segmentation from continuous speech using transitional probabilities of atomic acoustic events , 2011, Cognition.