Speech processing with linear and neural network models

This dissertation investigates some aspects of speech processing using linear models and single hidden layer neural networks. The study is divided into two parts which focus on speech modelling and speech classiication respectively. The rst part of the dissertation examines linear and nonlinear vocal tract models for synthesising high quality speech with adjustable pitch. A source-lter framework for analysis and synthesis is used, in which the source is a representation of the glottal volume velocity waveform. Two families of linear model are considered, ARX (autoregressive with external input) and OE (output error). Their performance in estimating vocal tract transfer functions is compared on synthetic speech data, and the diierence is explained in terms of the parameter estimation procedure, the frequency distribution of bias in the estimate and the assumptions about the spectrum of the noise in the vocal tract system. The noise spectrum for ARX models is shown to be perceptually signiicant for speech synthesis applications because it exploits auditory masking. Methods for improving poor quality syntheses from OE models are proposed. Nonlinear vocal tract models, implemented as feed-forward or recurrent neural networks, are investigated. Methods for initialising networks from linear models are developed. A modiied recurrent architecture is introduced which permits initialisation from ARX models. The use of regularization, for imposing continuity between models of adjacent speech segments, and learning rate adaptation, for improving back-propagation training, are discussed. For synthesising real speech utterances, an audio tape demonstrates that ARX models produce the highest quality synthetic speech and that the quality is maintained when pitch modiications are applied. The second part of the dissertation studies the operation of recurrent neural networks in classifying patterns of correlated feature vectors. Such patterns are typical of speech classiication tasks. The operation of a hidden node with a recurrent connection is explained in terms of a decision boundary which changes position in feature space. The feedback is shown to delay switching from one class to another and to smooth output decisions for sequences of feature vectors from the same class. For networks trained with constant class targets, a sequence of feature vectors from the same class tends to drive the operation of hidden nodes into saturation. It is demonstrated that saturation de-nes limits on the position of the decision boundary resulting in context-sensitive and context-insensitive regions of the feature space. While saturation persists, it is shown that networks have reduced sensitivity to the order …

[1]  Hiroya Fujisaki,et al.  Proposal and evaluation of models for the glottal source waveform , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  L. Ljung,et al.  Design variables for bias distribution in transfer function estimation , 1986, The 23rd IEEE Conference on Decision and Control.

[3]  L. Ljung,et al.  Overtraining, Regularization, and Searching for Minimum in Neural Networks , 1992 .

[4]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[5]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[6]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7]  Raymond L. Watrous,et al.  Complete gradient optimization of a recurrent network applied to /b/,/d/,/g/ discrimination , 1988 .

[8]  H. Fujisaki,et al.  System identification of the speech production process based on a state-space representation , 1984 .

[9]  M. M. Thomson A new method for determining the vocal tract transfer function and its excitation from voiced speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Kristina M. Johnson,et al.  An asymptotic singular value decomposition analysis of nonlinear multilayer neural networks , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[11]  D. Childers,et al.  Two-channel speech analysis , 1986, IEEE Trans. Acoust. Speech Signal Process..

[12]  Joachim Denzler,et al.  Going back to the source: inverse filtering of the speech signal with ANNs , 1993, EUROSPEECH.

[13]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[14]  Eric A. Wan,et al.  Neural network classification: a Bayesian interpretation , 1990, IEEE Trans. Neural Networks.

[15]  Mark A. Clements,et al.  Synthesizing styled speech using the Klatt synthesizer , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Hervé Bourlard,et al.  Neural nets and hidden Markov models: Review and generalizations , 1991, Speech Commun..

[17]  K. Etemad Phoneme recognition based on multi-resolution and non-causal context , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[18]  Arun D. Kulkarni Solving ill-posed problems with artificial neural networks , 1991, Neural Networks.

[19]  J. C. Steinberg,et al.  Factors Governing the Intelligibility of Speech Sounds , 1945 .

[20]  Tony Robinson,et al.  Speech synthesis using artificial neural networks trained on cepstral coefficients , 1993, EUROSPEECH.

[21]  Douglas D. O'Shaughnessy,et al.  On 450-600 b/s natural sounding speech coding , 1993, IEEE Trans. Speech Audio Process..

[22]  L. Ljung,et al.  A system identification perspective on neural nets , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[23]  Chin-Hui Lee,et al.  On robust linear prediction of speech , 1988, IEEE Trans. Acoust. Speech Signal Process..

[24]  J. Sjöberg Non-Linear System Identification with Neural Networks , 1995 .

[25]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[26]  Daniele Sereno,et al.  Vector quantization and perceptual criteria in SVD based CELP coders , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[27]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[28]  Ren-Hua Wang,et al.  A method for robust GARMA analysis of speech , 1990, ICSLP.

[29]  J. N. Holmes,et al.  Formant synthesizers: Cascade or parallel? , 1983, Speech Commun..

[30]  P. GALLINARI,et al.  On the relations between discriminant analysis and multilayer perceptrons , 1991, Neural Networks.

[31]  David M. Howard,et al.  Methods for dynamic excitation control in parallel formant speech synthesis , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[32]  Per Hedelin A glottal LPC-vocoder , 1984, ICASSP.

[33]  Michael I. Jordan Serial Order: A Parallel Distributed Processing Approach , 1997 .

[34]  Alex Waibel,et al.  Large vocabulary recognition using linked predictive neural networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[35]  Lizhong Wu,et al.  Fully vector-quantized neural network-based code-excited nonlinear predictive speech coding , 1994, IEEE Trans. Speech Audio Process..

[36]  John Makhoul,et al.  Spectral linear prediction: Properties and applications , 1975 .

[37]  Keiichi Funaki,et al.  A speech analysis method based on a glottal source model , 1990, ICSLP.

[38]  A Löfqvist,et al.  Laryngeal vibrations: a comparison between high-speed filming and glottographic techniques. , 1983, The Journal of the Acoustical Society of America.

[39]  Ronald J. Williams,et al.  Experimental Analysis of the Real-time Recurrent Learning Algorithm , 1989 .

[40]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[41]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[42]  H. M. Teager,et al.  Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract , 1990 .

[43]  Naftali Tishby,et al.  A dynamical systems approach to speech processing , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[44]  Kwang Y. Lee,et al.  Diagonal recurrent neural networks for dynamic systems control , 1995, IEEE Trans. Neural Networks.

[45]  Torbjörn Wigren,et al.  Improvements of background sound coding in linear predictive speech coders , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[46]  Stephen A. Billings,et al.  Non-linear system identification using neural networks , 1990 .

[47]  S. Chandra,et al.  Experimental comparison between stationary and nonstationary formulations of linear prediction applied to voiced speech analysis , 1974 .

[48]  Peter Tiňo,et al.  Fixed points in two-neuron discrete time recurrent networks: stability and bifurcation considerations , 1995 .

[49]  Robert Linggard Electronic synthesis of speech , 1985 .

[50]  John H. L. Hansen,et al.  Constrained iterative speech enhancement with application to speech recognition , 1991, IEEE Trans. Signal Process..

[51]  B. Atal,et al.  Predictive coding of speech signals and subjective error criteria , 1979 .

[52]  Frank Fallside,et al.  A recurrent error propagation network speech recognition system , 1991 .

[53]  Janet Slifka,et al.  Speaker modification with LPC pole analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[54]  B. Atal,et al.  Optimizing digital speech coders by exploiting masking properties of the human ear , 1978 .

[55]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[56]  B. Atal,et al.  Strategies for improving the performance of CELP coders at low bit rates (speech analysis) , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[57]  Steve Renals,et al.  Learning Temporal Dependencies in Connectionist Speech Recognition , 1993, NIPS.

[58]  Paul H. Milenkovic,et al.  Glottal inverse filtering by joint estimation of an AR system with a linear input model , 1986, IEEE Trans. Acoust. Speech Signal Process..

[59]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[60]  Wj Fitzgerald,et al.  The application of Bayesian inference to linear prediction of speech , 1994 .

[61]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[62]  Ah Chung Tsoi,et al.  FIR and IIR Synapses, a New Neural Network Architecture for Time Series Modeling , 1991, Neural Computation.

[63]  Luís B. Almeida,et al.  Speeding up Backpropagation , 1990 .

[64]  John S. Bridle,et al.  Alpha-nets: A recurrent 'neural' network architecture with a hidden Markov model interpretation , 1990, Speech Commun..

[65]  E Abberton,et al.  First applications of a new laryngograph. , 1971, Medical & biological illustration.

[66]  J. Markel,et al.  A linear prediction vocoder simulation based upon the autocorrelation method , 1974 .

[67]  David Lowe,et al.  The optimised internal representation of multilayer classifier networks performs nonlinear discriminant analysis , 1990, Neural Networks.

[68]  Andreas S. WeigendDepartment,et al.  Predictions with Conndence Intervals (local Error Bars) 1 Obtaining Error Bars Using a Maximum Likelihood Framework 1.1 Motivation and Concept , 2007 .

[69]  D. Rumelhart,et al.  Generalization through Minimal Networks with Application to Forecasting , 1992 .

[70]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[71]  Ken-ichi Iso,et al.  Speaker-independent word recognition using a neural prediction model , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[72]  Bishnu S. Atal,et al.  A new model of LPC excitation for producing natural-sounding speech at low bit rates , 1982, ICASSP.

[73]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[74]  Francis Charpentier,et al.  Diphone synthesis using an overlap-add technique for speech waveforms concatenation , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[75]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[76]  Mahesan Niranjan CELP coding with adaptive output-error model identification , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[77]  Giovanni Soda,et al.  Local Feedback Multilayered Networks , 1992, Neural Computation.

[78]  Lokendra Shastri,et al.  Learning Phonetic Features Using Connectionist Networks , 1987, IJCAI.

[79]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[80]  Stephen A. Billings,et al.  Properties of neural networks with applications to modelling non-linear dynamical systems , 1992 .

[81]  Lyle H. Ungar,et al.  SVD-NET: an algorithm that automatically selects network structure , 1994, IEEE Trans. Neural Networks.

[82]  S. Santini,et al.  Recurrent Neural Networks Can Be Trained to Be Maximum a Posteriori Probability Classiiers , 1995 .

[83]  C. Lee Giles,et al.  Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[84]  A. P. Lobo,et al.  Evaluation of a glottal ARMA model of speech production , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[85]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[86]  Bishnu S. Atal,et al.  Optimizing LPC filter parameters for multi-pulse excitation , 1983, ICASSP.

[87]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[88]  I. A. Gerson,et al.  Techniques for improving the performance of CELP type speech coders , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[89]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[90]  Mahesan Niranjan,et al.  The use of recurrent neural networks for classification , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[91]  Hiroya Fujisaki,et al.  Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the Glottal source waveform , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[92]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[93]  Herbert Gish,et al.  An invariance property of neural networks , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[94]  M. Mathews,et al.  Pitch Synchronous Analysis of Voiced Sounds , 1961 .

[95]  Douglas D. O'Shaughnessy,et al.  Automatic and reliable estimation of glottal closure instant and period , 1989, IEEE Trans. Acoust. Speech Signal Process..

[96]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[97]  Paavo Alku An automatic method to estimate the time-based parameters of the glottal pulseform , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[98]  Mahesan Niranjan,et al.  Vocal tract modelling with recurrent neural networks , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[99]  Lennart Ljung,et al.  On Estimation of Transfer Function Error Bounds , 1991 .

[100]  J. Holmes,et al.  The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer , 1973 .

[101]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[102]  Rosario Drogo de Iacovo,et al.  Some experiments in perceptual masking of quantizing noise in analysis-by-synthesis speech coders , 1991, EUROSPEECH.

[103]  Bayya Yegnanarayana,et al.  Epoch extraction from linear prediction residual , 1978, ICASSP.

[104]  Etienne Barnard,et al.  Avoiding false local minima by proper initialization of connections , 1992, IEEE Trans. Neural Networks.

[105]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[106]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[107]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[108]  D. Lowe,et al.  Adaptive networks, dynamical systems, and the predictive analysis of time series speech analysis , 1989 .

[109]  Pierre Roussel-Ragot,et al.  Training recurrent neural networks: why and how? An illustration in dynamical process modeling , 1994, IEEE Trans. Neural Networks.

[110]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[111]  Wolfram Schiffmann,et al.  Optimization of the Backpropagation Algorithm for Training Multilayer Perceptrons , 1994 .

[112]  S. Billings,et al.  Correlation based model validity tests for non-linear models , 1986 .

[113]  Yu Hen Hu,et al.  Analyses of the hidden units of the multi-layer perceptron and its application in acoustic-to-articulatory mapping , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[114]  Anthony J. Robinson,et al.  Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System , 1995, NIPS.

[115]  Y. Chien,et al.  Pattern classification and scene analysis , 1974 .

[116]  Mark D. Hanes,et al.  Acoustic-to-phonetic mapping using recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[117]  B. Townshend,et al.  Nonlinear prediction of speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[118]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[119]  Josef Raviv,et al.  Decision making in Markov chains applied to the problem of pattern recognition , 1967, IEEE Trans. Inf. Theory.

[120]  Simon A. Barton,et al.  A Matrix Method for Optimizing a Neural Network , 1991, Neural Computation.

[121]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[122]  Shigeru Kiritani,et al.  Simultaneous high-speed digital recording of vocal fold vibration and speech signal , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[123]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .

[124]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[125]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[126]  Kenneth Abend,et al.  Compound decision procedures for pattern recognition , 1966 .

[127]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[128]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[129]  Les T. Niles,et al.  TIMIT phoneme recognition using an HMM-derived recurrent neural network , 1991, EUROSPEECH.

[130]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.