A schema-based model for phonemic restoration

Phonemic restoration is the perceptual synthesis of phonemes when masked by appropriate replacement sounds by utilizing linguistic context. Current models attempting to accomplish acoustic restoration of phonemes, however, use only temporal continuity and produce poor restoration of unvoiced phonemes, and are also limited in their ability to restore voiced phonemes. We present a schema-based model for phonemic restoration. The model employs a missing data speech recognition system to decode speech based on intact portions and activates word templates corresponding to the words containing the masked phonemes. An activated template is dynamically time warped to the noisy word and is then used to restore the speech frames corresponding to the masked phoneme, thereby synthesizing it. The model is able to restore both voiced and unvoiced phonemes with a high degree of naturalness. Systematic testing shows that this model outperforms a Kalman-filter based model.

[1]  Shinsuke Shimojo,et al.  Visual surface representation: a critical link between lower-level and higher level vision , 1995 .

[2]  K. H. Barratt Digital Coding of Waveforms , 1985 .

[3]  Guy J. Brown,et al.  Computational auditory scene analysis: Exploiting principles of perceived continuity , 1993, Speech Commun..

[4]  Daniel P. W. Ellis,et al.  Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures , 1999, Speech Commun..

[5]  Jürgen Herre,et al.  Robust matching of audio signals using spectral flatness features , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[6]  J Verschuure,et al.  Intelligibility of interrupted meaningful and nonsense speech with and without intervening noise , 1983, Perception & psychophysics.

[7]  Alex Acero,et al.  A harmonic-model-based front end for robust speech recognition , 2003, INTERSPEECH.

[8]  B H Repp,et al.  Perceptual restoration of a “missing” speech sound: Auditory induction or illusion? , 1992, Perception & psychophysics.

[9]  V. Hardman,et al.  A survey of packet loss recovery techniques for streaming audio , 1998, IEEE Network.

[10]  R. M. Warren,et al.  Auditory induction: Reciprocal changes in alternating sounds , 1994, Perception & psychophysics.

[11]  Tamiko Azuma,et al.  Puzzle-solving science: the quixotic quest for units in speech perception , 2003, J. Phonetics.

[12]  Andrzej Drygajlo,et al.  Speaker verification in noisy environments with combined spectral subtraction and missing feature theory , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  R. M. Warren Perceptual Restoration of Missing Speech Sounds , 1970, Science.

[14]  R. M. Warren,et al.  Phonemic restorations based on subsequent context , 1974 .

[15]  Charles Henry Woolbert,et al.  Fundamentals of speech , 1934 .

[16]  A. Samuel The role of bottom-up confirmation in the phonemic restoration illusion. , 1981, Journal of experimental psychology. Human perception and performance.

[17]  Chin-Hui Lee,et al.  New model-based HMM distances with applications to run-time ASR error estimation and model tuning , 2003, INTERSPEECH.

[18]  R. M. Warren,et al.  Speech perception and phonemic restorations , 1971 .

[19]  Richard M. Stern,et al.  Classifier-based mask estimation for missing feature methods of robust speech recognition , 2000, INTERSPEECH.

[20]  Alan V. Oppenheim,et al.  Discrete-time signal processing (2nd ed.) , 1999 .

[21]  Steve Young,et al.  A review of large-vocabulary continuous-speech , 1996, IEEE Signal Process. Mag..

[22]  Edward E. Smith,et al.  An Invitation to cognitive science , 1997 .

[23]  Tomohiro Nakatani,et al.  Harmonic sound stream segregation using localization and its application to speech stream segregation , 1999, Speech Commun..

[24]  Richard M. Stern,et al.  Reconstruction of damaged spectrographic features for robust speech recognition , 2000, INTERSPEECH.

[25]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[26]  J A Bashford,et al.  Increasing the intelligibility of speech through multiple phonemic restorations. , 1990, Perception & psychophysics.

[27]  A. Samuel Lexical Activation Produces Potent Phonemic Percepts , 1997, Cognitive Psychology.

[28]  Mahbub Hassan,et al.  Internet telephony: services, technical challenges, and products , 2000, IEEE Commun. Mag..

[29]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[30]  D. Pisoni,et al.  Talker-specific learning in speech perception , 1998, Perception & psychophysics.

[31]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[32]  P. Renevey,et al.  Detection of Reliable Features for Speech Recognition in Noisy Condi-tions Using a Statistical Criterion , 2001 .

[33]  S. Goldinger Words and voices: episodic traces in spoken word identification and recognition memory. , 1996, Journal of experimental psychology. Learning, memory, and cognition.

[34]  Jose C. Principe,et al.  Neural and adaptive systems , 2000 .

[35]  Peter No,et al.  Digital Coding of Waveforms , 1986 .

[36]  W. Yost Auditory Perception: A New Analysis and Synthesis , 1999, Trends in Neurosciences.

[37]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[38]  DeLiang Wang,et al.  Monaural Speech Separation , 2002, NIPS.

[39]  Jerry D. Gibson,et al.  COMPARISON OF DISTANCE MEASURES IN DISCRETE SPECTRAL MODELING , 2000 .

[40]  Hideki Kawahara,et al.  An application of the Bayesian time series model and statistical system analysis for F0 control , 1998, Speech Commun..

[41]  Steve Young,et al.  A review of large-vocabulary continuous-speech recognition , 1996 .

[42]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[43]  Hideki Kawahara,et al.  Dynamic sound stream formation based on continuity of spectral change , 1999, Speech Commun..

[44]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[45]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[46]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[47]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[48]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[49]  G. A. Miller,et al.  The Intelligibility of Interrupted Speech , 1948 .

[50]  B. Hofmann-Wellenhof,et al.  Introduction to spectral analysis , 1986 .

[51]  Petros G. Voulgaris,et al.  On optimal ℓ∞ to ℓ∞ filtering , 1995, Autom..

[52]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[53]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[54]  DeLiang Wang,et al.  Schema-based modeling of phonemic restoration , 2003, INTERSPEECH.

[55]  A. F. Adams,et al.  The Survey , 2021, Dyslexia in Higher Education.

[56]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.