Nonlinear Dimensionality Reduction Methods for Use with Automatic Speech Recognition

For nearly a century, researchers have investigated and used mathematical techniques for reducing the dimensionality of vector valued data used to characterize categorical data with the goal of preserving “information” or discriminability of the different categories in the reduced dimensionality data. The most established techniques are Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) (Jolliffe, 1986; Wang & Paliwal, 2003). Both PCA and LDA are based on linear, i.e. matrix multiplication, transformations. For the case of PCA, the transformation is based on minimizing mean square error between original data vectors and data vectors that can be estimated from the reduced dimensionality data vectors. For the case of LDA, the transformation is based on minimizing a ratio of “between class variance” to “within class variance” with the goal of reducing data variation in the same class and increasing the separation between classes. There are newer versions of these methods such as Heteroscedastic Discriminant Analysis (HDA) (Kumar & Andreou, 1998; Saon et al., 2000). However, in all cases certain assumptions are made about the statistical properties of the original data (such as multivariate Gaussian); even more fundamentally, the transformations are restricted to be linear. In this chapter, a class of nonlinear transformations is presented both from a theoretical and experimental point of view. Theoretically, the nonlinear methods have the potential to be more “efficient” than linear methods, that is, give better representations with fewer dimensions. In addition, some examples are shown from experiments with Automatic Speech Recognition (ASR) where the nonlinear methods in fact perform better, resulting in higher ASR accuracy than obtained with either the original speech features, or linearly reduced feature sets. Two nonlinear transformation methods, along with several variations, are presented. In one of these methods, referred to as nonlinear PCA (NLPCA), the goal of the nonlinear transformation is to minimize the mean square error between features estimated from reduced dimensionality features and original features. Thus this method is patterned after PCA. In the second method, referred to as nonlinear LDA (NLDA), the goal of the nonlinear transformation is to maximize discriminability of categories of data. Thus the method is patterned after LDA. In all cases, the dimensionality reduction is accomplished with a Neural Network (NN), which internally encodes data with a reduced number of dimensions. The differences in the methods depend on error criteria used to train the network, the architecture of the network, and the extent to which the reduced dimensions are “hidden” in the neural network.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Stephen A. Zahorian,et al.  A neural network based nonlinear feature transformation for speech recognition , 2008, INTERSPEECH.

[3]  Kuldip K. Paliwal,et al.  Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition , 2003, Pattern Recognit..

[4]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Stephen A. Zahorian,et al.  Whole word phonetic displays for speech articulation training , 2006 .

[6]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[7]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[8]  Stephen A. Zahorian,et al.  Phone classification with segmental features and a binary-pair partitioned neural network classifier , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Stephen A. Zahorian,et al.  Dimensionality reduction of speech features using nonlinear principal components analysis , 2007, INTERSPEECH.

[10]  Stephen A. Zahorian,et al.  Signal modeling for isolated word recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[11]  Stephen A. Zahorian,et al.  Vowel classification for computer-based visual feedback for speech training for the hearing impaired , 2002, INTERSPEECH.

[12]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[14]  Stephen A. Zahorian,et al.  Acoustic-phonetic transformations for improved speaker-independent isolated word recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[16]  David L. Donoho,et al.  Aide-Memoire . High-Dimensional Data Analysis : The Curses and Blessings of Dimensionality , 2000 .

[17]  N. Deshmukh,et al.  Decision Tree-Based State Tying For Acoustic Modeling , 1996 .

[18]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[19]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[20]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[21]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[22]  Stephen A. Zahorian,et al.  Neural Network Based Nonlinear Discriminant Analysis for Speech Recognition , 2009 .

[23]  李幼升,et al.  Ph , 1989 .

[24]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[25]  Stephen A. Zahorian,et al.  Dimensionality reduction methods for HMM phonetic recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[27]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.