Intrinsic Spectral Analysis for Zero and High Resource Speech Recognition

The constraints of the speech production apparatus imply that our vocalizations are approximately restricted to a lowdimensional manifold embedded in a high-dimensional space. Manifold learning algorithms provide a means to recover the approximate embedding from untranscribed data and enable use of the manifold’s intrinsic distance metric to characterize acoustic similarity for downstream automatic speech applications. In this paper, we consider a previously unevaluated nonlinear outof-sample extension for intrinsic spectral analysis (ISA), investigating its performance in both unsupervised and supervised tasks. In the zero resource regime, where the lack of transcribed resources forces us to rely solely on the phonetic salience of the acoustic features themselves, ISA provides substantial gains relative to canonical acoustic front-ends. When large amounts of transcribed speech for supervised acoustic model training are also available, we find that the data-driven intrinsic spectrogram matches the performance of and is complementary to these signal processing derived counterparts.

[1]  Y. Attikiouzel,et al.  Dimension and structure of the speech space , 1992 .

[2]  Hynek Hermansky,et al.  Exploiting contextual information for improved phoneme recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Frédéric Bimbot,et al.  Audio keyword extraction by unsupervised word discovery , 2009, INTERSPEECH.

[4]  Alex Seungryong Park,et al.  Unsupervised pattern discovery in speech: applications to word acquisition and speaker segmentation , 2006 .

[5]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Aren Jansen,et al.  Intrinsic Fourier Analysis on the Manifold of Speech Sounds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[8]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[9]  Patrick J. Wolfe,et al.  Approximate intrinsic fourier analysis of speech , 2009, INTERSPEECH.

[10]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Fabio Valente,et al.  Combination of Acoustic Classifiers Based on Dempster-Shafer Theory of Evidence , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Yun Tang,et al.  A study of using locality preserving projections for feature extraction in speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[14]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.