Modification of Audible and Visual Speech

Speech is one of the most common and richest methods that people use to communicate with one another. Our facility with this communication form makes speech a good interface for communicating with or via computers. At the same time, our familiarity with speech makes it difficult to generate synthetic but naturalsounding speech and synthetic but natural-looking lip-synced faces. One way to reduce the apparent unnaturalness of synthetic audible and visual speech is to modify natural (human-produced) speech. This approach relies on examples of natural speech and on simple models of how to take those examples apart and to put them back together to create new utterances. We discuss two such techniques in depth. The first technique, Mach1, changes the overall timing of an utterance, with little loss in comprehensibility and with no change in the wording of or emphasis within what was said or in the identity of the voice. This ability to speed up (or slow down) speech will make speech a more malleable channel of communication. It gives the listener control over the amount of time that she spends listening to a given oration, even if the presentation of that material is prerecorded. The second technique, Video Rewrite, synthesizes images of faces, lip synced to a given utterance. This tool could be useful for reducing the data rate for video conferencing [31], as well as for providing photorealistic avatars.

[1]  Barry Arons,et al.  The audio notebook: paper and pen interaction with structured speech , 2001, CHI.

[2]  Malcolm Slaney,et al.  MACH1: nonuniform time-scale modification of speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Malcolm Slaney,et al.  Baby Ears: a recognition system for affective vocalizations , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Tsuhan Chen,et al.  Audio-to-visual conversion for multimedia communication , 1998, IEEE Trans. Ind. Electron..

[5]  Levent M. Arslan,et al.  Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum , 1997, EUROSPEECH.

[6]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[7]  Hyung Soon Kim,et al.  Variable time-scale modification of speech using transient information , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Christoph Bregler,et al.  Video rewrite: visual speech synthesis from video , 1997, AVSP.

[9]  Malcolm Slaney,et al.  Automatic audio morphing , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Bryan Holloway,et al.  Timbre morphing of sounds with unequal numbers of features , 1995 .

[11]  Timothy F. Cootes,et al.  A unified approach to coding and interpreting face images , 1995, Proceedings of IEEE International Conference on Computer Vision.

[12]  John R. Wright,et al.  Synthesis of Speaker Facial Movement to Match Selected Speech Sequences , 1994 .

[13]  Lance Williams,et al.  Animating images with drawings , 1994, SIGGRAPH.

[14]  Jan P. H. van Santen,et al.  Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..

[15]  Christian Benoît,et al.  A 3-d model of the lips for visual speech synthesis , 1994, SSW.

[16]  Barry Arons,et al.  Interactively skimming recorded speech , 1994 .

[17]  Catherine Fulford Can learning be more efficient?: Using compressed speech audio tapes to enhance systematically designed text , 1993 .

[18]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[19]  Jan P. H. van Santen,et al.  Contextual effects on vowel duration , 1992, Speech Commun..

[20]  Thaddeus Beier,et al.  Feature-based image metamorphosis , 1992, SIGGRAPH.

[21]  P. Anandan,et al.  Hierarchical Model-Based Motion Estimation , 1992, ECCV.

[22]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  John Lewis,et al.  Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[24]  Hiroshi Harashima,et al.  A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[25]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[26]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.

[27]  J. L. Le Saint-Milon,et al.  A real-time French text-to-speech system generating high-quality synthetic speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  Lawrence Sirovich,et al.  Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  C. Mills,et al.  Listening Rate and Comprehension as a Function of Preference for and Exposure to Time-Altered Speech , 1989, Perceptual and motor skills.

[30]  Ralph R. Behnke,et al.  The Effect of Time‐Compressed Speech on Comprehensive, Interpretive, and Short‐Term Listening , 1989 .

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[32]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[33]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[34]  A. Wilgus,et al.  High quality time-scale modification for speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Edward H. Adelson,et al.  A multiresolution spline with application to image mosaics , 1983, TOGS.

[36]  C. Pollard,et al.  Center for the Study of Language and Information , 2022 .

[37]  K. Stevens Acoustic correlates of some phonetic categories. , 1979, The Journal of the Acoustical Society of America.

[38]  Frederick I. Parke,et al.  Computer generated animation of faces , 1972, ACM Annual Conference.

[39]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[40]  T. Gold Hearing , 1953, Trans. IRE Prof. Group Inf. Theory.