Learning distances to improve phoneme classification

In this work we aim to learn a Mahalanobis distance to improve the performance of phoneme classification using the standard 39-dimensional MFCC features. To learn and to evaluate the performance of our distance, we use the simple k-nearest-neighbors (k-NN) classifier. Although this classifier exhibits low performance relative to state-of-the-art phoneme classifiers, it can be used to determine a distance metric that is applicable to many other better-performing machine learning methods. We devise a novel optimization method that minimizes the error function of the k-NN classifier with respect to the covariance matrix of the Mahalanobis distance, based on finite-difference stochastic approximation (FDSA) gradient estimates combined with a random perturbation term to avoid local minima. We apply our method to the problem of phoneme classification with the k-NN classifier and show that our learned distance provides performance improvement of up to 8:19% over the standard k-NN classifier, and additionally outperforms other state-of-the-art distance learning methods by approximately 4 percentage points. We also find that the computational complexity of our method, while not optimal, is better than other distance learning methods. The performance improvements for individual phoneme classes are given. The distances learned are applicable to other scale-variant machine learning methods, such as support vector machines, multidimensional scaling, and maximum variance unfolding, as well as others.

[1]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[2]  Steve J. Young,et al.  The HTK tied-state continuous speech recogniser , 1993, EUROSPEECH.

[3]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[4]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[5]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[6]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[7]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[8]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[9]  Susanna Terracini n-Body Problem , 2009, Encyclopedia of Complexity and Systems Science.

[10]  Samy Bengio,et al.  An Online Algorithm for Large Scale Image Similarity Learning , 2009, NIPS.

[11]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[12]  Gang Wei,et al.  Speech emotion recognition based on HMM and SVM , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[13]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[14]  William B. March,et al.  Linear-time Algorithms for Pairwise Statistical Problems , 2009, NIPS.

[15]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[16]  Alexander G. Gray,et al.  Learning the Intrinsic Dimensions of the Timit Speech Database with Maximum Variance Unfolding , 2009, 2009 IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop.