Computational Auditory Scene Analysis and Its Application to Robot Audition: Five Years Experience

We have been engaged in research on computational auditory scene analysis to attain sophisticated robot/computer human interaction by manipulating real-world sound signals. The objective of our research is the understanding of an arbitrary sound mixture including non-speech sounds and music as well as voiced speech, obtained by robot's ears, that is, microphones embedded in the robot. We have coped with three main issues in computational auditory scene analysis, that is, sound source localization, separation, and recognition of separated sounds for a mixture of speech signals as well as polyphonic music signals. This paper overviews our results in robot audition, in particular, missing feature theory based integration of sound source separation and automatic speech recognition, and those in music information processing, in particular, drum sound equalizer

[1]  Tetsuya Ogata,et al.  Multiple moving speaker tracking by microphone array on mobile robot , 2005, INTERSPEECH.

[2]  Kiyohiro Shikano,et al.  Unsupervised speaker adaptation based on HMM sufficient statistics in various noisy environments , 2003, INTERSPEECH.

[3]  Israel Cohen,et al.  Microphone array post-filtering for non-stationary noise suppression , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Masataka Goto,et al.  An Audio-based Real-time Beat Tracking System for Music With or Without Drum-sounds , 2001 .

[5]  Hideki Asoh,et al.  Sound source localization and signal separation for office robot "JiJo-2" , 1999, Proceedings. 1999 IEEE/SICE/RSJ. International Conference on Multisensor Fusion and Integration for Intelligent Systems. MFI'99 (Cat. No.99TH8480).

[6]  Jean Rouat,et al.  Enhanced robot audition based on microphone array source separation with post-filter , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[7]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[8]  Hiroshi G. Okuno,et al.  Comparing features for forming music streams in automatic music transcription , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hiroshi G. Okuno,et al.  Automatic transformation of environmental sounds into sound-imitation words based on Japanese syllable structure , 2003, INTERSPEECH.

[10]  Tetsuya Ogata,et al.  Dynamic help generation by estimating user²s mental model in spoken dialogue systems , 2006, INTERSPEECH.

[11]  Tatsuya Kawahara,et al.  User Modeling in Spoken Dialogue Systems to Generate Flexible Guidance , 2004, User Modeling and User-Adapted Interaction.

[12]  Jon Barker,et al.  Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise , 2001, INTERSPEECH.

[13]  Takuya Yoshioka,et al.  Common Acoustical Pole Estimation from Multi-Channel Musical Audio Signals , 2006, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[14]  Tetsuya Ogata,et al.  Extracting multi-modal dynamics of objects using RNNPB , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Hiromasa Fujihara,et al.  F0 Estimation Method for Singing Voice in Polyphonic Audio Signal Based on Statistical Vocal Model and Viterbi Search , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Masahiro Fujita,et al.  A small biped entertainment robot exploring attractive applications , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[17]  Jean Rouat,et al.  Localization of simultaneous moving sound sources for mobile robot using a frequency- domain steered beamformer approach , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[18]  Tetsuya Ogata,et al.  Missing-Feature based Speech Recognition for Two Simultaneous Speech Signals Separated by ICA with a pair of Humanoid Ears , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Tetsuya Ogata,et al.  Distance-Based Dynamic Interaction of Humanoid Robot with Multiple People , 2005, IEA/AIE.

[20]  Hiroaki Kitano,et al.  Epipolar geometry based sound localization and extraction for humanoid audition , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[21]  Sadaoki Furui,et al.  Noise‐robust speech recognition using multi‐band spectral features , 2004 .

[22]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[23]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[24]  S. P. Mudur,et al.  Three-dimensional computer vision: a geometric viewpoint , 1993 .

[25]  Hiroshi G. Okuno,et al.  Robust Tracking of Multiple Sound Sources by Spatial Integration of Room And Robot Microphone Arrays , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[26]  Tetsuya Ogata,et al.  Real-Time Robot Audition System That Recognizes Simultaneous Speech in The Real World , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Tetsuya Ogata,et al.  Genetic Algorithm-Based Improvement of Robot Hearing Capabilities in Separating and Recognizing Simultaneous Speech Signals , 2006, IEA/AIE.

[28]  T. Senior,et al.  Electromagnetic and Acoustic Scattering by Simple Shapes , 1969 .

[29]  Hiroshi G. Okuno,et al.  Recognition of simultaneous speech by estimating reliability of separated signals for robot audition , 2006 .

[30]  Masataka Goto,et al.  Musical instrument identification based on F0-dependent multivariate normal distribution , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[31]  Tetsuya Ogata,et al.  Generation of Robot Motions from Environmental Sounds Using Inter-modality Mapping by RNNPB , 2006 .

[32]  Jean Rouat,et al.  Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[33]  Hiroshi G. Okuno,et al.  Improvement of robot audition by interfacing sound source separation and automatic speech recognition with Missing Feature Theory , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[34]  Tomohiro Nakatani,et al.  Automatic Sound-Imitation Word Recognition from Environmental Sounds Focusing on Ambiguity Problem in Determining Phonemes , 2004, PRICAI.

[35]  Hiroaki Kitano,et al.  Exploiting auditory fovea in humanoid-human interaction , 2002, AAAI/IAAI.

[36]  Tetsuya Ogata,et al.  Extracting multi-modal dynamics of objects using RNNPB , 2005, IROS.

[37]  Hiroaki Kitano,et al.  Design and Implementation of Personality of Humanoids in Human Humanoid Non-verbal Interaction , 2003, IEA/AIE.

[38]  V. Edwards Scattering Theory , 1973, Nature.

[39]  Naoyuki Kanda,et al.  Multi-Domain Spoken Dialogue System with Extensibility and Robustness against Speech Recognition Errors , 2006, SIGDIAL Workshop.

[40]  Hiroshi G. Okuno,et al.  Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots , 2004, Speech Commun..

[41]  Fumio Kanehiro,et al.  Robust speech interface based on audio and video information fusion for humanoid HRP-2 , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[42]  Rolf Vetter,et al.  Robust speech recognition using missing feature theory and vector quantization , 2001, INTERSPEECH.

[43]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[44]  François Michaud,et al.  Code reusability tools for programming mobile robots , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[45]  Tetsunori Kobayashi,et al.  Multi-person conversation via multi-modal interface - a robot who communicate with multi-user - , 1999, EUROSPEECH.

[46]  Kikuo Fujimura,et al.  The intelligent ASIMO: system overview and integration , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[47]  Nobuaki Minematsu,et al.  Free software toolkit for Japanese large vocabulary continuous speech recognition , 2000, INTERSPEECH.

[48]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[49]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[50]  Hiromasa Fujihara,et al.  Automatic Synchronization between Lyrics and Music CD Recordings Based on Viterbi Alignment of Segregated Vocal Signals , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[51]  Hiroshi G. Okuno,et al.  Real-Time Tracking of Multiple Sound Sources by Integration of In-Room and Robot-Embedded Microphone Arrays , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[52]  Gerhard Widmer,et al.  Exploring Music Collections by Browsing Different Views , 2004, Computer Music Journal.

[53]  Masataka Goto,et al.  Automatic Drum Sound Description for Real-World Music Using Template Adaptation and Matching Methods , 2004, ISMIR.

[54]  Masataka Goto,et al.  Category-level identification of non-registered musical instrument sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[55]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[56]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).