An embedded audio-visual tracking and speech purification system on a dual-core processor platform

Design of an embedded audio-visual tracking and speech purification system is described in this paper. The system is able to perform human face tracking, voice activity detection, sound source direction estimation, and speech enhancement in real-time. Estimating the sound source directions helps to initialize the human face tracking module when the target changes the direction. The implementation architecture is based on an embedded dual-core processor, Texas Instruments DM6446 platform (Davinci), which contains an ARM core and a DSP core. For speech signal processing, an eight-channel digital microphone array is developed and the associated pre-processing and interfacing features are designed using the Altera Cyclone II FPGA. All the experiments are conducted in a real environment and the experimental results show that this system can execute all the audition and vision functions in real-time.

[1]  Gabor C. Temes,et al.  Understanding Delta-Sigma Data Converters , 2004 .

[2]  Stanley T. Birchfield,et al.  Spatiograms versus histograms for region-based tracking , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[4]  Sigfrid D. Soli,et al.  A novel DSP system for microphone array applications , 1996, 1996 IEEE International Symposium on Circuits and Systems. Circuits and Systems Connecting the World. ISCAS 96.

[5]  Alexander H. Waibel,et al.  Probabilistic integration of sparse audio-visual cues for identity tracking , 2008, ACM Multimedia.

[6]  Thomas Kailath,et al.  Detection of signals by information theoretic criteria , 1985, IEEE Trans. Acoust. Speech Signal Process..

[7]  Larry S. Davis,et al.  Efficient mean-shift tracking via a new similarity measure , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Hiroaki Kitano,et al.  Sound and Visual Tracking for Humanoid Robot , 2004, Applied Intelligence.

[9]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Alan F. Smeaton,et al.  An Improved Spatiogram Similarity Measure for Robust Object Localisation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Tetsuya Ogata,et al.  Auditory and visual integration based localization and tracking of humans in daily-life environments , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  J. Gregory Trafton,et al.  Enabling effective human-robot interaction using perspective-taking in robots , 2005, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[13]  Jwu-Sheng Hu,et al.  Processing of speech signals using a microphone array for intelligent robots , 2005 .

[14]  Munsang Kim,et al.  Particle Filter Algorithm for Single Speaker Tracking with Audio-Video Data Fusion , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[15]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[16]  Chalapathy Neti,et al.  A real-time prototype for small-vocabulary audio-visual ASR , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[17]  Masahide Kaneko,et al.  Omni-directional Audio-Visual Speaker Detection for Mobile Robot , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[18]  C.-C. Cheng,et al.  Robust speaker's location detection in a vehicle environment using GMM models , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[20]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[21]  Jwu-Sheng Hu,et al.  Frequency Domain Microphone Array Calibration and Beamforming for Automatic Speech Recognition , 2005, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[22]  ScienceDirect Microprocessors and microsystems , 1978 .

[23]  Roberto Brunelli,et al.  A Generative Approach to Audio-Visual Person Tracking , 2006, CLEAR.

[24]  Hiroshi G. Okuno,et al.  Improvement of robot audition by interfacing sound source separation and automatic speech recognition with Missing Feature Theory , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[25]  Stan Birchfield,et al.  Spatial Histograms for Region‐Based Tracking , 2007 .