Principles of electronic speech processing with applications for people with disabilities

In the first part of this paper the principles and the state of the art of speech processing, and especially speech synthesis and recognition, are explained. Then, a speech-based human-computer dialogue system is discussed. The next section gives a brief overview of the available recommendations, guidelines and standards that are directly related with the application of speech technologies. The last part of the paper is dedicated to applications of speech technology for the disabled. The main focus is on blind and partially sighted people and those with hearing loss. Concerning the blind, many multilingual text-to-speech synthesis systems exist, and some polyglot ones, that can convert printed and electronic documents to audio, but further research is needed for structured text, tables and above all graphics to be efficiently transformed into speech. For the deaf persons, there are still big challenges in the development of adequate communication aids. Although a high-speed transformation of speech into text is possible with state-of-the-art speech recognizers (and thus a quasi real-time information transfer from a hearing to a deaf person), the automatic gesture recognition, needed for the reverse transfer, is still in research state. Other applications discussed in this paper include speech-based cursor control for those with physical disabilities, transformation of dysarthric speech into intelligible speech, voice output communication aids for the language impaired and those without speech, and accessibility options for public terminals and Automated Teller Machines through the incorporation of speech technologies. The paper concludes with an outlook and recommendations for research areas that need further study.

[1]  F. Burkhardt,et al.  An Emotion-Aware Voice Portal , 2005 .

[2]  John-Paul Hosom,et al.  Improving the intelligibility of dysarthric speech , 2007, Speech Commun..

[3]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[4]  Roger K. Moore PRESENCE: A Human-Inspired Architecture for Speech-Based Human-Machine Interaction , 2007, IEEE Transactions on Computers.

[5]  Georgios Kouroupetroglou,et al.  Text-to-speech scripting interface for appropriate vocalisation of e-texts , 2001, INTERSPEECH.

[6]  Alexandros Pino,et al.  ULYSSES: A Framework for Incorporating Multi-Vendor Components in Interpersonal Communication Applications , 2001 .

[7]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[8]  Constantine Stephanidis,et al.  Vocabulary Management in Modular Interpersonal Communication Aids , 1997 .

[9]  Beat Pfister,et al.  From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.

[10]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[11]  Andrew Sears,et al.  Speech-based cursor control , 2002, ASSETS.

[12]  Bill Z. Manaris,et al.  An Intelligent Interface for Keyboard and Mouse Control -- Providing Full Access to PC Functionality via Speech , 2001, FLAIRS Conference.

[13]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[14]  Phillip Taylor,et al.  Concept-to-speech synthesis by phonological structure matching , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[15]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[16]  J Makhoul,et al.  State of the art in continuous speech recognition. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Thierry Dutoit,et al.  The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Klaus Fellbaum,et al.  Speech Input and Output Technology - State of the Art and Selected Applications , 2003, NLDB.

[19]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[20]  Richard Sproat Multilingual Text-to-Speech Synthesis , 1997 .

[21]  Georgios Kouroupetroglou,et al.  Diction Based Prosody Modeling in Table-to-Speech Synthesis , 2005, TSD.

[22]  Georgios Kouroupetroglou,et al.  Towards the Next Generation of Computer- based Interpersonal Communication Aids , 1997 .

[23]  Sadaoki Furui,et al.  New Approach to Polyglot Synthesis: How to Speak any Language with Anyone's Voice , 2006 .

[24]  James A. Landay,et al.  The integrated communication 2 draw , 2003 .

[25]  Joakim Gustafson,et al.  The august spoken dialogue system , 1999, EUROSPEECH.

[26]  Xiao Li,et al.  The vocal joystick:: evaluation of voice-based cursor control techniques , 2006, Assets '06.

[27]  Justine Cassell,et al.  Shared reality: spatial intelligence in intuitive user interfaces , 2002, IUI '02.

[28]  Carl G. Looney,et al.  Pattern recognition using neural networks , 1997 .

[29]  Georgios Kouroupetroglou,et al.  Remote Assistive Interpersonal Communication Exploiting Component Based Development , 1998 .

[30]  Maria Roussou,et al.  Multilingual Personalized Information Objects , 2005 .

[31]  Constantinos Viglas,et al.  Managing accessible user interfaces of multi-vendor components under the ULYSSES framework for interpersonal communication applications , 2001, HCI.

[32]  Constantinos Viglas,et al.  An Open Machine Translation System for Augmentative and Alternative Communication , 2002, ICCHP.

[33]  C. Neti,et al.  Neuromorphic speech processing for noisy environments , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[34]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[35]  James A. Landay,et al.  A study of blind drawing practice: creating graphical information without the visual channel , 2000, Assets '00.

[36]  Georgios Kouroupetroglou,et al.  An Experimental Approach in Recognizing Synthesized Auditory Components in a Non-Visual Interaction with Documents , 2005 .

[37]  Andrew Sears,et al.  Speech-based cursor control using grids: modelling performance and comparisons with other solutions , 2005, Behav. Inf. Technol..

[38]  Jean-Luc Gauvain,et al.  Speech recognition for an information kiosk , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  Georgios Kouroupetroglou,et al.  Text Normalization for the Pronunciation of Non-standard Words in an Inflected Language , 2004, SETN.

[40]  Georgios Kouroupetroglou,et al.  The DEMOSTHeNES speech composer , 2001, SSW.

[41]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[42]  J. J. Odell,et al.  Architecture, User Interface, and Enabling Technology in Windows Vista's Speech Systems , 2007, IEEE Transactions on Computers.

[43]  Constantine Stephanidis,et al.  Access to lexical knowledge in modular interpersonal communication aids , 1999 .

[44]  Sadaoki Furui,et al.  Robust methods in automatic speech recognition and understanding , 2003, INTERSPEECH.

[45]  Georgios Kouroupetroglou,et al.  A Methodology for Reader's Emotional State Extraction to Augment Expressions in Speech Synthesis , 2007 .

[46]  Josef Bigün,et al.  Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition , 2007, IEEE Transactions on Computers.

[47]  Oded Ghitza,et al.  Auditory nerve representation as a front-end for speech recognition in a noisy environment , 1986 .

[48]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[49]  Daryle Gardner-Bonneau,et al.  Human Factors and Voice Interactive Systems , 1999 .

[50]  Georgios Kouroupetroglou,et al.  Experimentation on Spoken Format of Tables in Auditory User Interfaces , 2005 .

[51]  Andrew Sears,et al.  Speech-based cursor control: understanding the effects of target size, cursor speed, and command selection , 2002, Universal Access in the Information Society.

[52]  Nick Campbell,et al.  Emotional speech as an effective interface for people with special needs , 1998, Proceedings. 3rd Asia Pacific Computer Human Interaction (Cat. No.98EX110).

[53]  Sergios Theodoridis,et al.  A Bayesian Network Approach to Semantic Labelling of Text Formatting in XML Corpora of Documents , 2007, HCI.

[54]  Lee McCauley,et al.  MIKI: A Speech Enabled Intelligent Kiosk , 2006, IVA.

[55]  Patrick Steiger,et al.  MINNELLI – Experiences with an Interactive Information Kiosk for Casual Users , 1997 .

[56]  Peter Vary,et al.  Digital Speech Transmission: Enhancement, Coding and Error Concealment , 2006 .

[57]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[58]  Yukiko I. Nakano,et al.  MACK: Media lab Autonomous Conversational Kiosk , 2002 .

[59]  N. Audibert,et al.  Emotional Prosody - Does Culture Make A Difference? , 2006 .

[60]  Felix Burkhardt,et al.  Emofilt: the simulation of emotional speech by prosody-transformation , 2005, INTERSPEECH.

[61]  Thierry Dutoit,et al.  High-quality text-to-speech synthesis : an overview , 2004 .

[62]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[63]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[64]  Donald G. Childers,et al.  Speech Processing , 1999 .

[65]  Steven K. Feiner,et al.  Interaction techniques using prosodic features of speech and audio localization , 2005, IUI '05.

[66]  Andrew D. Christian,et al.  Digital smart kiosk project , 1998, CHI.

[67]  Thomas B. Moeslund,et al.  A brief overview of hand gestures used in wearable human computer interfaces , 2003 .

[68]  Alexandros Pino,et al.  A new generation of communication aids under the ULYSSES component-based framework , 2002, Assets '02.

[69]  Bill Z. Manaris,et al.  SUITEKeys: a speech understanding interface for the motor-control challenged , 1998, Assets '98.

[70]  Georgios Kouroupetroglou,et al.  Augmented Auditory Representation of e-Texts for Text-to-Speech Systems , 2001, TSD.

[71]  Shigeru Katagiri Speech Pattern Recognition using Neural Networks , 2003 .

[72]  Georgios Kouroupetroglou,et al.  Transforming Spontaneous Telegraphic Language to Well-Formed Greek Sentences for Alternative and Augmentative Communication , 2002, SETN.

[73]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .

[74]  Etsuya Shibayama,et al.  The migratory cursor: accurate speech-based cursor movement by moving multiple ghost cursors using non-verbal vocalizations , 2005, Assets '05.

[75]  Ye-Yi Wang,et al.  Spoken language understanding , 2005, IEEE Signal Processing Magazine.

[76]  Javier Latorre,et al.  A study on speaker-adaptable multilingual synthesis , 2006 .

[77]  Georgios Kouroupetroglou,et al.  Auditory Accessibility of Metadata in Books: A Design for All Approach , 2007, HCI.

[78]  Chris Baber,et al.  Speech technology for automatic teller machines: an investigation of user attitude and performance , 1998 .

[79]  Michael Picheny,et al.  Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[80]  Chin-Hui Lee,et al.  Hierarchical stochastic feature matching for robust speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[81]  Paul Lukowicz,et al.  Using multiple sensors for mobile sign language recognition , 2003, Seventh IEEE International Symposium on Wearable Computers, 2003. Proceedings..

[82]  A. Nejat Ince,et al.  Digital Speech Processing , 1992 .

[83]  Justine Cassell,et al.  Embodied conversational interface agents , 2000, CACM.

[84]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[85]  Michael F. McTear,et al.  Book Review , 2005, Computational Linguistics.

[86]  Roope Raisamo,et al.  Experiences on a multimodal information kiosk with an interactive agent , 2002, NordiCHI '02.

[87]  Georgios Kouroupetroglou,et al.  Modelling Emphatic Events from Non-Speech Aware Documents in Speech Based User Interfaces , 2003 .

[88]  James A. Landay,et al.  Sketching images eyes-free: a grid-based dynamic drawing tool for the blind , 2002, Assets '02.

[89]  Georgios Kouroupetroglou,et al.  THE BLISPHON ALTERNATIVE COMMUNICATION SYSTEM FOR THE SPEECHLESS INDIVIDUAL , 1993 .

[90]  Andrew Sears,et al.  Speech-based cursor control: a study of grid-based solutions , 2003, ASSETS.

[91]  Gerhard Rigoll,et al.  Hybrid NN/HMM acoustic modeling techniques for distributed speech recognition , 2006, Speech Commun..

[92]  Sadaoki Furui,et al.  50 Years of Progress in Speech and Speaker Recognition Research , 1970 .

[93]  Harald Romsdorfer,et al.  Text analysis and language identification for polyglot text-to-speech synthesis , 2007, Speech Commun..

[94]  Constantinos Viglas,et al.  e-AAC: Making Inter net-based Inter per sonal Communication and WWW Content Accessible for AAC Symbol User s , 2003 .

[95]  Karl-Friedrich Kraiss,et al.  Advanced Man-Machine Interaction , 2006 .

[96]  D Felix,et al.  What makes an automated teller machine usable by blind users? , 1998, Ergonomics.

[97]  Andrew D. Christian,et al.  Speak out and annoy someone: experience with intelligent kiosks , 2000, CHI.

[98]  Joakim Gustafson,et al.  Experiences from the development of August - a multi-modal spoken dialogue system , 2006 .

[99]  Robert J. K. Jacob,et al.  User interface , 2002 .

[100]  James A. Landay,et al.  The integrated communication 2 draw (IC2D): a drawing program for the visually impaired , 1999, CHI EA '99.

[101]  Phil D. Green,et al.  Automatic speech recognition with sparse training data for dysarthric speakers , 2003, INTERSPEECH.