SpeechSkimmer: interactively skimming recorded speech

Skimming or browsing audio recordings is much more difficult than visually scanning a document because of the temporal nature of audio. By exploiting properties of spontaneous speech it is possible to automatically select and present salient audio segments in a time-efficient manner. Techniques for segmenting recordings and a prototype user interface for skimming speech are described. The system developed incorporates time-compressed speech and pause removal to reduce the time needed to listen to speech recordings. This paper presents a multi-level approach to auditory skimming, along with user interface techniques for interacting with the audio and providing feedback. Several time compression algorithms ami an adaptive speech detection technique are also stuntnarized.

[1]  Mari Ostendorf,et al.  Automatic recognition of intonational features , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Donald Joseph Hejna,et al.  Real-time time-scale modification of speech via the synchronized overlap-add algorithm , 1990 .

[3]  John G. Gruber,et al.  Performance Requirements for Integrated Voice/Data Networks , 1983, IEEE J. Sel. Areas Commun..

[4]  Peter Schäuble,et al.  A system for retrieving speech documents , 1992, SIGIR '92.

[5]  Barry Arons,et al.  Techniques, Perception, and Applications of Time-Compressed Speech , 2009 .

[6]  Barry Arons,et al.  A Conversational Telephone Messaging System , 1984, IEEE Transactions on Consumer Electronics.

[7]  Paul Resnick,et al.  Skip and scan: cleaning up telephone interface , 1992, CHI '92.

[8]  Mohammad Hasan Savoji,et al.  A robust algorithm for accurate endpointing of speech signals , 1989, Speech Commun..

[9]  J. Gruber,et al.  A Comparison of Measured and Calculated Speech Temporal Parameters Relevant to Speech Activity Detection , 1982, IEEE Trans. Commun..

[10]  D. Aaronson,et al.  Perception and immediate recall of normal and “compressed” auditory sequences , 1971 .

[11]  Daniel S. Beasley,et al.  chapter 12 – Time- and Frequency-Altered Speech , 1976 .

[12]  J. Lynch,et al.  Speech/Silence segmentation for real-time coding via rule based adaptive endpoint detection , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  G W Heiman,et al.  Word intelligibility decrements and the comprehension of time-compressed speech , 1986, Perception & psychophysics.

[14]  Jonathan Grudin,et al.  Why CSCW Applications Fail: Problems in the Design and Evaluation of Organization of Organizational Interfaces. , 1988 .

[15]  R. J. Scott Time adjustment in speech synthesis. , 1967, The Journal of the Acoustical Society of America.

[16]  Robin Jeffries,et al.  User interface evaluation in the real world: a comparison of four techniques , 1991, CHI.

[17]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  G. W. Furnas,et al.  Generalized fisheye views , 1986, CHI '86.

[19]  William W. Gaver Auditory Icons: Using Sound in Computer Interfaces , 1986, Hum. Comput. Interact..

[20]  S E Gerber,et al.  The Limiting Effect of Discard Interval On Time-Compressed Speech , 1977, Language and Speech.

[21]  Anne Cutler,et al.  Metrical structure and the perception of time-compressed speech , 1993, EUROSPEECH.

[22]  Jock D. Mackinlay,et al.  The perspective wall: detail and context smoothly integrated , 1991, CHI.

[23]  Barry Arons Hyperspeech: navigating in speech-only hypermedia , 1991, HYPERTEXT '91.

[24]  William Buxton,et al.  The use of non-speech audio at the interface , 1988, CHI 1988.

[25]  Lynn Wilcox,et al.  Wordspotting for voice editing and audio indexing , 1992, CHI.

[26]  Barry Arons,et al.  VoiceNotes: a speech interface for a hand-held voice notetaker , 1993, INTERCHI.

[27]  Barry Arons,et al.  Tools for building asynchronous servers to support speech and audio applications , 1992, UIST '92.

[28]  P. V. de Souza,et al.  A statistical approach to the design of an adaptive self-normalizing silence detector , 1983 .

[29]  G. Fairbanks,et al.  Method for time of frequency compression-expansion of speech , 1954 .

[30]  Herbert A. Leeper,et al.  Listening Rate Preference: Comparison of Two Time Alteration Techniques , 1977 .

[31]  Re. Techniques for Information Retrieval from Speech Messages , 1991 .

[32]  Grant Fairbanks Experimental Phonetics: Selected Articles , 1966 .

[33]  Richard Mander,et al.  Working with audio: integrating personal tape recorders and desktop computers , 1992, CHI '92.

[34]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[35]  S. S. Reich,et al.  Significance of pauses for speech perception , 1980, Journal of psycholinguistic research.

[36]  D. O'Shaughnessy,et al.  Recognition of hesitations in spontaneous speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Michael Mills,et al.  A magnifier tool for video data , 1992, CHI.

[38]  Hyeong-Ho Lee,et al.  A Study of On-Off Characteristics of Conversational Speech , 1986, IEEE Trans. Commun..

[39]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[40]  A. Wilgus,et al.  High quality time-scale modification for speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Edward P. Neuburg,et al.  Simple pitch‐dependent algorithm for high‐quality speech‐rate changing , 1977 .

[42]  Lawrence R. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1975, Bell Syst. Tech. J..

[43]  Aaron E. Rosenberg,et al.  An improved endpoint detector for isolated word recognition , 1981 .