A phonetically labeled acoustic segment (PLAS) approach to speech analysis-synthesis

A phonetically labeled acoustic segment (PLAS) approach is proposed for speech analysis-synthesis. The goal is to develop a unified framework for general speech processing by means of a bidirectional context-constrained mapping between a phonetic space and an acoustic space. The PLAS analysis module is a continuous phone (phoneme) recognizer, while the PLAS synthesis module is a phonetically organized acoustic database. To regulate the proposed mapping in a phonetically structured manner, phone context-dependency was imposed in phone modeling, recognition, and synthesis. The PLAS approach was tested successfully on a database of continuously spoken Japanese utterances recorded by a single male talker. The automatic segmentation boundaries derived from modeling PLAS units agreed well with corresponding manual segmentation points, i.e. they were within a +or-20-ms interval 95% of the time. A 4% phoneme recognition error rate was obtained in a continuous recognition test. Natural-sounding speech was synthesized at an average bit rate of 55 b/s allocated to segmental information.<<ETX>>

[1]  Richard M. Schwartz,et al.  A preliminary design of a phonetic vocoder based on a diphone model , 1980, ICASSP.

[2]  Aaron E. Rosenberg Connected sentence recognition using diphone-like templates , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[3]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  Olli Ventä,et al.  Phonetic typewriter for Finnish and Japanese , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  Kazuyo Tanaka,et al.  A large vocabulary word recognition system using rule-based network representation of acoustic characteristic variations , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[7]  S. Nakajima,et al.  Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[8]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[9]  R. Nakatsu,et al.  Japanese text input system based on continuous speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11]  Lalit R. Bahl,et al.  Further results on the recognition of a continuously read natural corpus , 1980, ICASSP.

[12]  S. Roucos,et al.  Segment quantization for very-low-rate speech coding , 1982, ICASSP.

[13]  D. Wong,et al.  Very low data rate speech compression with LPC vector and matrix quantization , 1983, ICASSP.

[14]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  D. B. Paul,et al.  Speaker stress-resistant continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16]  Hsiao-Wuen Hon,et al.  Large-vocabulary speaker-independent continuous speech recognition using HMM , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[17]  K. Ohta,et al.  Phoneme based text-to-speech synthesis system , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Aaron E. Rosenberg,et al.  Demisyllable-based isolated word recognition system , 1983 .