An iterative approach to decision tree training for context dependent speech synthesis

In speech synthesis with sparse training data, phonetic decision trees are frequently used for balance between model complexity and available data. The traditional training procedure is that decision trees are constructed after parameters for each phones optimized in the EM algorithm. This paper proposes an iterative re-optimization algorithm in which the decision tree is re-learned after every iteration of the EM algorithm. The performance of the new procedure is compared with the original procedure by training parameters for MFCC and F0 features using an EDHMM model with data from The Boston University Radio Speech corpus. A convergence proof is presented, and experimental tests demonstrate that iterative re-optimization generates statistically significant test corpus log-likelihood improvements.