Interactively skimming recorded speech

Listening to a speech recording is much more difficult than visually scanning a document because of the transient and temporal nature of audio. Audio recordings capture the richness of speech, yet it is difficult to directly browse the stored information. This dissertation investigates techniques for structuring, filtering, and presenting recorded speech, allowing a user to navigate and interactively find information in the audio domain. This research makes it easier and more efficient to listen to recorded speech by using the SpeechSkimmer system. First, this dissertation describes Hyperspeech, a speech-only hypermedia system that explores issues of speech user interfaces, browsing, and the use of speech as data in an environment without a visual display. The system uses speech recognition input and synthetic speech feedback to aid in navigating through a database of digitally recorded speech. This system illustrates that managing and moving in time are crucial in speech interfaces. Hyperspeech uses manually segmented and structured speech recordings--a technique that is practical only in limited domains. Second, to overcome the limitations of Hyperspeech while retaining browsing capabilities, a variety of speech analysis and user interface techniques are explored. This research exploits properties of spontaneous speech to automatically select and present salient audio segments in a time-efficient manner. Two speech processing technologies, time compression and adaptive speech detection (to find hesitations and pauses), are reviewed in detail with a focus on techniques applicable to extracting and displaying speech information. Finally, this dissertation describes SpeechSkimmer, a user interface for interactively skimming speech recordings. SpeechSkimmer uses simple speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail. User interaction, through a manual input device, provides continuous real-time control of the speed and detail level of the audio presentation. SpeechSkimmer incorporates time-compressed speech, pause removal, automatic emphasis detection, and non-speech audio feedback to reduce the time needed to listen. This dissertation presents a multi-level structural approach to auditory skimming, and user interface techniques for interacting with recorded speech. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  D. O'Shaughnessy,et al.  Recognition of hesitations in spontaneous speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  H. Kobatake,et al.  Speech/nonspeech discrimination for speech recognition system under real life noise environments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  Michael Hawley Structure out of sound , 1993 .

[4]  Paul T. Brady,et al.  A model for generating on-off speech patterns in two-way conversation , 1969 .

[5]  Douglas A. Reynolds,et al.  A Gaussian mixture modeling approach to text-independent speaker identification , 1992 .

[6]  W. Hess,et al.  A pitch-synchronous digital feature extraction system for phonemic recognition of speech , 1976 .

[7]  J. Lynch,et al.  Speech/Silence segmentation for real-time coding via rule based adaptive endpoint detection , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Andy Hopper,et al.  The active badge location system , 1992, TOIS.

[9]  S. S. Reich,et al.  Significance of pauses for speech perception , 1980, Journal of psycholinguistic research.

[10]  R. Daniloff,et al.  The difficulty of listening to time-compressed speech. , 1968, Journal of speech and hearing research.

[11]  Donald E. Knuth,et al.  The TeXbook , 1984 .

[12]  John B. Voor,et al.  The effect of practice upon the comprehension of time‐compressed speech , 1965 .

[13]  A Wingfield,et al.  Spontaneous segmentation in normal and in time-compressed speech , 1980, Perception & psychophysics.

[14]  Paul T. Brady,et al.  A technique for investigating on-off patterns of speech , 1965 .

[15]  Richard M. Stallman EMACS the extensible, customizable self-documenting display editor , 1981 .

[16]  Mark Dolson,et al.  The Phase Vocoder: A Tutorial , 1986 .

[17]  Barry Arons Authoring and Transcription Tools for Speech-Based Hypermedia Systems , 1991 .

[18]  Mari Ostendorf,et al.  Automatic recognition of intonational features , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Michael Mills,et al.  A magnifier tool for video data , 1992, CHI.

[20]  Jakob Nielsen,et al.  Heuristic evaluation of user interfaces , 1990, CHI '90.

[21]  Michael Loren Mauldin,et al.  Information retrieval by text skimming , 1989 .

[22]  Elizabeth M. Wenzel,et al.  Localization in Virtual Acoustic Displays , 1992, Presence: Teleoperators & Virtual Environments.

[23]  R. Stallman EMACS the extensible, customizable self-documenting display editor , 1981, SIGPLAN SIGOA Symposium on Text Manipulation.

[24]  E. E. David,et al.  A Note on Pitch‐Synchronous Processing of Speech , 1956 .

[25]  W. D. Garvey The intelligibility of abbreviated speech patterns , 1953 .

[26]  Lisa J. Stifelman Not Just Another Voice Mail System , 1991 .

[27]  Julia Hirschberg,et al.  The intonational Structuring of Discourse , 1986, ACL.

[28]  Lawrence R. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1975, Bell Syst. Tech. J..

[29]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[30]  H. Van Dyke Parunak,et al.  Hypermedia Topologies and User Navigation , 1989, Hypertext.

[31]  BRIAN BUTTERWORTH,et al.  Speech and Interaction in Sound-only Communication Channels , 1977 .

[32]  Barry Arons,et al.  A Review of The Cocktail Party Effect , 1992 .

[33]  Aaron E. Rosenberg,et al.  An improved endpoint detector for isolated word recognition , 1981 .

[34]  Leslie Lamport,et al.  Latex : A Document Preparation System , 1985 .

[35]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[36]  William Buxton,et al.  The use of non-speech audio at the interface , 1988, CHI 1988.

[37]  T. Sticht,et al.  Review of research on the intelligibility and comprehension of accelerated speech. , 1969, Psychological bulletin.

[38]  J. Gruber,et al.  A Comparison of Measured and Calculated Speech Temporal Parameters Relevant to Speech Activity Detection , 1982, IEEE Trans. Commun..

[39]  J. Flanagan,et al.  Computer‐steered microphone arrays for sound transduction in large rooms , 1985 .

[40]  E. Hardam,et al.  High quality time scale modification of speech signals using fast synchronized-overlap-add algorithms , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[41]  Barry Arons,et al.  Tools for building asynchronous servers to support speech and audio applications , 1992, UIST '92.

[42]  Timothy A. Salthouse,et al.  The Skill of Typing. , 1984 .

[43]  Michael G. Lamming Towards a Human Memory Prosthesis , 1991, Operating Systems of the 90s and Beyond.

[44]  Anne Cutler,et al.  Metrical structure and the perception of time-compressed speech , 1993, EUROSPEECH.

[45]  Marc Davis,et al.  Media Streams: an iconic visual language for video annotation , 1993, Proceedings 1993 IEEE Symposium on Visual Languages.

[46]  M. G. Schachtman,et al.  Tasi quality — Effect of speech detectors and interpolation , 1962 .

[47]  Chris Schmandt,et al.  From desktop audio to mobile access: opportunities for voice in computing , 1993 .

[48]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[49]  Edward Lee Elliott Watch-grab-arrange-see : thinking with motion images via streams and collages , 1993 .

[50]  Lawrence R. Rabiner,et al.  A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition , 1976 .

[51]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[52]  Vannevar Bush,et al.  As we may think , 1945, INTR.

[53]  M. A. Bush,et al.  Training and search algorithms for an interactive wordspotting system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  D. B. Orr,et al.  TRAINABILITY OF LISTENING COMPREHENSION OF SPEEDED DISCOURSE. , 1965, Journal of educational psychology.

[55]  Julia Hirschberg,et al.  Now Let’s Talk About Now; Identifying Cue Phrases Intonationally , 1987, ACL.

[56]  Mohammad Hasan Savoji,et al.  A robust algorithm for accurate endpointing of speech signals , 1989, Speech Commun..

[57]  Barry Arons,et al.  SpeechSkimmer: interactively skimming recorded speech , 1993, UIST '93.

[58]  K. A. Ericsson,et al.  Protocol Analysis: Verbal Reports as Data , 1984 .

[59]  Sara Bly,et al.  Presenting information in sound , 1982, CHI '82.

[60]  William W. Gaver The SonicFinder: An Interface That Uses Auditory Icons , 1989, Hum. Comput. Interact..

[61]  Scott H. Foster,et al.  A Virtual Display System for Conveying Three-Dimensional Acoustic Information , 1988 .

[62]  Michael Cohen,et al.  Extending the notion of a window system to audio , 1990, Computer.

[63]  Paul T. Brady,et al.  A statistical analysis of on-off patterns in 16 conversations , 1968 .

[64]  Daniel S. Beasley,et al.  chapter 12 – Time- and Frequency-Altered Speech , 1976 .

[65]  Robert E. Kraut,et al.  Expressive richness: a comparison of speech and text as media for revision , 1991, CHI.

[66]  S E Gerber,et al.  The Limiting Effect of Discard Interval On Time-Compressed Speech , 1977, Language and Speech.

[67]  A Wingfield,et al.  Prosodic features and the intelligibility of accelerated speech: syntactic versus periodic segmentation. , 1984, Journal of speech and hearing research.

[68]  P. V. de Souza,et al.  A statistical approach to the design of an adaptive self-normalizing silence detector , 1983 .

[69]  G. Fairbanks,et al.  Method for time of frequency compression-expansion of speech , 1954 .

[70]  J. L. Wayman,et al.  High quality speech expansion, compression, and noise filtering using the sola method of time scale modification , 1989, Twenty-Third Asilomar Conference on Signals, Systems and Computers, 1989..

[71]  Jae S. Lim,et al.  Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72]  Herbert A. Leeper,et al.  Listening Rate Preference: Comparison of Two Time Alteration Techniques , 1977 .

[73]  A. Bregman,et al.  Demonstrations of auditory scene analysis : the perceptual organization of sound , 1995 .

[74]  N C Goodwin,et al.  Computer-Generated Speech and Man-Computer Interaction , 1970, Human factors.

[75]  Re. Techniques for Information Retrieval from Speech Messages , 1991 .

[76]  J.G. Wilpon,et al.  Whither speech recognition: the next 25 years , 1993, IEEE Communications Magazine.

[77]  Kent M. Pitman CREF: An Editing Facility for Managing Structured Text , 1985 .

[78]  Lynn Wilcox,et al.  HMM-based wordspotting for voice editing and indexing , 1991, EUROSPEECH.

[79]  Tomio Watanabe The adaptation of machine conversational speed to speaker utterance speed in human-machine communication , 1990, IEEE Trans. Syst. Man Cybern..

[80]  Barry Arons,et al.  A Conversational Telephone Messaging System , 1984, IEEE Transactions on Consumer Electronics.

[81]  Paul Resnick,et al.  Skip and scan: cleaning up telephone interface , 1992, CHI '92.

[82]  Nathaniel I. Durlach,et al.  On the Externalization of Auditory Images , 1992, Presence: Teleoperators & Virtual Environments.

[83]  Jakob Nielsen,et al.  Finding usability problems through heuristic evaluation , 1992, CHI.

[84]  M. J. Muller,et al.  Toward a definition of voice documents , 1990, COCS '90.

[85]  Jock D. Mackinlay,et al.  The perspective wall: detail and context smoothly integrated , 1991, CHI.

[86]  Francis F. Lee,et al.  Time Compression and Expansion of Speech by the Sampling Method , 1972 .

[87]  W. D. Garvey The intelligibility of speeded speech. , 1953, Journal of experimental psychology.

[88]  Robin Jeffries,et al.  User interface evaluation in the real world: a comparison of four techniques , 1991, CHI.

[89]  Michael Cohen,et al.  Integrating Graphic and Audio Windows , 1992, Presence: Teleoperators & Virtual Environments.

[90]  Glorianna Davenport,et al.  Cinematic primitives for multimedia , 1991, IEEE Computer Graphics and Applications.

[91]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[92]  Robert W. Donaldson,et al.  Adaptive silence deletion for speech storage and voice mail applications , 1988, IEEE Trans. Acoust. Speech Signal Process..

[93]  Dennis L. Wilson,et al.  Some improvements on the synchronized-overlap-add method of time scale modification for use in real-time speech compression and noise filtering , 1988, IEEE Trans. Acoust. Speech Signal Process..

[94]  John G. Gruber,et al.  Performance Requirements for Integrated Voice/Data Networks , 1983, IEEE J. Sel. Areas Commun..

[95]  David Malah,et al.  Time-domain algorithms for harmonic bandwidth reduction and time scaling of speech signals , 1979 .

[96]  Robert F. Rippey,et al.  Speech Compressors for Lecture Review. , 1975 .

[97]  Polle Zellweger Scripted documents: a hypermedia path mechanism , 1989, Hypertext.

[98]  Thomas G. Sticht Comprehension of Repeated Time-Compressed Recordings , 1969 .

[99]  Xuedong Huang,et al.  Method and apparatus for speech recognition , 1995 .

[100]  Lawrence R. Miner A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition , 1976 .

[101]  Ramana Rao,et al.  Semi-structured messages are surprisingly useful for computer-supported coordination , 1986, CSCW '86.

[102]  W. Levelt Speaking: From Intention to Articulation , 1990 .

[103]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[104]  Thomas F. Quatieri,et al.  Speech transformations based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[105]  G. A. Miller,et al.  The Intelligibility of Interrupted Speech , 1948 .

[106]  A. Wilgus,et al.  High quality time-scale modification for speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[107]  William W. Gaver Synthesizing auditory icons , 1993, INTERCHI.

[108]  L. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1974, The Bell System Technical Journal.

[109]  Debby Hindus,et al.  Capturing, structuring, and representing ubiquitous audio , 1993, TOIS.

[110]  Jock D. Mackinlay,et al.  A morphological analysis of the design space of input devices , 1991, TOIS.

[111]  James S. Lipscomb,et al.  Analog input device physical characteristics , 1993, SGCH.

[112]  Dirk Van Compernolle,et al.  Speech recognition in noisy environments with the aid of microphone arrays , 1989, Speech Commun..

[113]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114]  G. W. Furnas,et al.  Generalized fisheye views , 1986, CHI '86.

[115]  Thomas B. Sheridan,et al.  Defining Our Terms , 1992, Presence: Teleoperators & Virtual Environments.

[116]  Natalio Carlos Pincever If you could see what I hear : editing assistance through cinematic parsing , 1991 .

[117]  G W Heiman,et al.  Word intelligibility decrements and the comprehension of time-compressed speech , 1986, Perception & psychophysics.

[118]  Jonathan Grudin,et al.  Why CSCW applications fail: problems in the design and evaluationof organizational interfaces , 1988, CSCW '88.

[119]  R. J. Scott Time adjustment in speech synthesis. , 1967, The Journal of the Acoustical Society of America.

[120]  William W. Gaver Auditory Icons: Using Sound in Computer Interfaces , 1986, Hum. Comput. Interact..

[121]  Barry Arons,et al.  The VOX Audio Server , 1989 .

[122]  Paul Resnick HyperVoice: a phone-based CSCW platform , 1992, CSCW '92.

[123]  N. F. Maxemchuk,et al.  An experimental speech storage and editing facility , 1980, The Bell System Technical Journal.

[124]  Peter Schäuble,et al.  A system for retrieving speech documents , 1992, SIGIR '92.

[125]  Barry Arons,et al.  Techniques, Perception, and Applications of Time-Compressed Speech , 2009 .

[126]  J. G. Wilpon,et al.  An improved word-detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints , 1984, AT&T Bell Laboratories Technical Journal.

[127]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[128]  Julia Hirschberg,et al.  Intonational Features of Local and Global Discourse Structure , 1992, HLT.

[129]  D. Aaronson,et al.  Perception and immediate recall of normal and “compressed” auditory sequences , 1971 .

[130]  Y. Yatsuzuka Highly Sensitive Speech Detector and High-Speed Voiceband Data Discriminator in DSI-ADPCM Systems , 1982, IEEE Trans. Commun..

[131]  P. Noll,et al.  Wideband speech and audio coding , 1993, IEEE Communications Magazine.

[132]  Alistair D. N. Edwards,et al.  Mathematical Representations: Graphs, Curves and Formulas , 1993 .

[133]  Edward P. Neuburg,et al.  Simple pitch‐dependent algorithm for high‐quality speech‐rate changing , 1977 .

[134]  C. Schmandt,et al.  An audio and telephone server for multi-media workstations , 1988, [1988] Proceedings. 2nd IEEE Conference on Computer Workstations.

[135]  Francis Kodman,et al.  An Investigation of Word Intelligibility as a Function of Time Compression , 1957 .

[136]  Edward R. Tufte,et al.  Envisioning Information , 1990 .

[137]  Douglas C. Engelbart Authorship Provisions in Augment , 1984, COMPCON.

[138]  F. Steenkeste,et al.  Intelligibility and comprehension of French normal, accelerated and compressed speech , 1988, Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[139]  Barry Arons Hyperspeech: navigating in speech-only hypermedia , 1991, HYPERTEXT '91.

[140]  Dylan M. Jones,et al.  Voice as interface: An overview , 1991, Int. J. Hum. Comput. Interact..

[141]  Nelson,et al.  Computer Lib: You Can And Must Understand Computers Now , 1974 .

[142]  A. El-Jaroudi,et al.  Time-scale modification in medium to low rate speech coding , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[143]  Raj Reddy,et al.  Steps Toward Graceful Interaction in Spoken and Written Man-Machine Communication , 1983, Int. J. Man Mach. Stud..

[144]  Lynn Wilcox,et al.  Wordspotting for voice editing and audio indexing , 1992, CHI.

[145]  Barry Arons,et al.  VoiceNotes: a speech interface for a hand-held voice notetaker , 1993, INTERCHI.

[146]  Meera Blattner,et al.  Earcons and Icons: Their Structure and Common Design Principles , 1989, Hum. Comput. Interact..

[147]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[148]  Hyeong-Ho Lee,et al.  A Study of On-Off Characteristics of Conversational Speech , 1986, IEEE Trans. Commun..

[149]  Jacek Jankowski A new digital voice-activated switch , 1976 .

[150]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[151]  Donald Joseph Hejna,et al.  Real-time time-scale modification of speech via the synchronized overlap-add algorithm , 1990 .

[152]  Jeff Conklin,et al.  Hypertext: An Introduction and Survey , 1987, Computer.

[153]  Peter Szolovits,et al.  Research directions in computer science: an MIT perspective , 1991 .

[154]  Paul Resnick HyperVoice-groupware by telephone , 1992 .

[155]  F. Vagliani,et al.  Digital Dynamic Speech Detectors , 1978, IEEE Trans. Commun..

[156]  Lisa J. Stifelman,et al.  VoiceNotes--an application for a voice-controlled hand-held computer , 1992 .

[157]  Hoo-min D Toong,et al.  A study of time-compressed speech , 1974 .

[158]  M. Portnoff,et al.  Time-scale modification of speech based on short-time Fourier analysis , 1981 .

[159]  Richard Mander,et al.  Working with audio: integrating personal tape recorders and desktop computers , 1992, CHI '92.