Conventional speech recognition systems relying on exemplar-based sparse representation require huge size exemplars collection to represent the linguistic units. Recent work demonstrates that despite of consistent improvement in automatic speech recognition performance, increasing the size of exemplars collection after a certain (very large) dimension leads to only minor improvements [1]. This observation suggests the need for a better procedure to find a limited size collection of exemplars that can be used for sparse representation. In the present study, the exemplars are neural network sub-word conditional posterior probabilities. In this context, we study the application of dictionary learning for sparse modeling. We demonstrate that the posterior exemplars live in a low-dimensional manifold that can be modeled as a union of subspaces. Furthermore, we evaluate the performance of dictionary learning for exemplar-based speech recognition to compare and contrast it with the traditional exemplars collection approach. I. UNION OF SUBSPACES MODEL Dictionary learning for sparse representation relies on the assumption that the data can be modeled as a union of subspaces. In this section, we provide supporting evidence that the neural netwrok exemplars conform to this model. We perform a simple experiment of template matching using dynamic time warping (DTW) for 75 wordsvocabulary set of Phonebook database [2]. Exemplars here are in form of (deep) neural network based phone posterior probabilities [3]. Out of 11 utterances for each word, we keep 4 utterances as training templates and use the rest for testing. The 4 utterances in the training set were used to create 15 combinatorial, 1 to 4-sparse templates for DTW matching by averaging after alignment. 1-sparse templates :{TU1 , TU2 , TU3 , TU4} 2-sparse templates :{TU1U2 , TU1U3 , TU1U4 , TU2U3 , TU3U4} 3-sparse templates :{TU1U2U3 , TU2U3U4 , TU1U2U4 , TU1U3U4} 4-sparse templates :{TU1U2U3U4} (1) We then quantify the DTW distance of the test utterances with the new constructed templates. The weighted symmetric Kullback-Leibler (KL) divergence is used as the distance measure as it was shown to be an “optimal” metric for neural network exemplars [4]. The smaller distance indicates better characterization of the test templates using the training data. This experiment is run for all test data and the results are listed in Table I. We observe that only 4.9% of test utterances have the least characterization error using a single closest template (DTW assumption). Moreover, only 9.7% are best characterized by the template obtained from averaging the full training set (KL-hidden Markov model HMM assumption [3]). On the other hand, all remaining 85.4% of the utterances have the least characterization error using the templates which are obtained as a combination of a few (2 or 3) training exemplars. This observation confirm the hypothesis of the effectiveness of the union of subspace approach to model the neural network exemplars. Another experiment was conducted on Numbers database [5], where we have huge amount of training data. Instead of k-sparse template matching using DTW as in Phonebook experiment, here we learn dictionaries of size of order ∼1000 columns from training data. We perform sparse recovery of test data using these dictionaries and analyse the support-size (number of nonzeros coefficients) of sparse representation. The results are illustrated in Figure 1. We observe that 31% of the test exemplars are represented by one dictionary column whereas 69% are characterized by a linear combination of very small number of columns corresponding to their sparse representation. For dictionary learning and sparse recovery, online dictionary learning algorithm [6] and lasso solver [7] were used respectively. II. DICTIONARY LEARNING Once we confirm that the union of subspace model holds for neural network exemplars, we demonstrate experimentally that dictionary learning improves characterisation of the feature space as compared to a simple collection of all exemplars of the training set while its cardinality is still far smaller than the collection size. In isolated word recognition experiment on Phonebook 75-vocabulary dataset, a single exemplar is used as a warm start for dictionary initializing. The remaining 3 exemplars in the training set are then used for updating the dictionary columns using online dictionary learning algorithm [6]. Alternatively, 4 training exemplars are concatenated to form a dictionary for sparse representation. A similar comparison was done for connected digit recognition on Numbers database, where we can either learn word-specific dictionaries or we can directly represent each word using all training exemplars [8]. The results are listed in Table II. We can see that the dictionary learning procedure is quite effective; it can benefit from the abundance of the training data, while it enables us to keep the dimensionality of the exemplar space small and at the same time improve the performance. This observation confirms that dictionary learning is a more efficient way for sparse representation than exemplar collection. 0 1 2 3 4 5 6 7 8 9 10
[1]
Hong C. Leung,et al.
PhoneBook: a phonetically-rich isolated-word telephone-speech database
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[2]
Louis ten Bosch,et al.
Using sparse representations for exemplar based continuous digit recognition
,
2009,
2009 17th European Signal Processing Conference.
[3]
Hervé Bourlard,et al.
Using KL-based acoustic models in a large vocabulary recognition task
,
2008,
INTERSPEECH.
[4]
R. Tibshirani,et al.
Least angle regression
,
2004,
math/0406456.
[5]
Ronald A. Cole,et al.
New telephone speech corpora at CSLU
,
1995,
EUROSPEECH.
[6]
Guillermo Sapiro,et al.
Online Learning for Matrix Factorization and Sparse Coding
,
2009,
J. Mach. Learn. Res..
[7]
Hervé Bourlard,et al.
Posterior features for template-based ASR
,
2011,
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[8]
Tuomas Virtanen,et al.
Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition
,
2011,
IEEE Transactions on Audio, Speech, and Language Processing.