Robust anchorperson detection based on audio streams using a hybrid I-vector and DNN system

Anchorperson segment detection enables efficient video content indexing for information retrieval. Anchorperson detection based on audio analysis has gained popularity due to lower computational complexity and satisfactory performance. This paper presents a robust framework using a hybrid I-vector and deep neural network (DNN) system to perform anchorperson detection based on audio streams of video content. The proposed system first applies I-vector to extract speaker identity features from the audio data. With the extracted speaker identity features, a DNN classifier is then used to verify the claimed anchorperson identity. In addition, subspace feature normalization (SFN) is incorporated into the hybrid system for robust feature extraction to compensate the audio mismatch issues caused by recording devices. An anchorperson verification experiment was conducted to evaluate the equal error rate (EER) of the proposed hybrid system. Experimental results demonstrate that the proposed system outperforms the state-of-the-art hybrid I-vector and support vector machine (SVM) system. Moreover, the proposed system was further enhanced by integrating SFN to effectively compensate the audio mismatch issues in anchorperson detection tasks.

[1]  Hao Jiang,et al.  Integrating visual, audio and text analysis for news video , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[2]  Qian Huang,et al.  Adaptive anchor detection using online trained audio/visual model , 1999, Electronic Imaging.

[3]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Thomas Sikora,et al.  Audiovisual Anchorperson Detection for Topic-Oriented Navigation in Broadcast News , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[5]  Xinbo Gao,et al.  Unsupervised video-shot segmentation and model-free anchorperson detection for news video story parsing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[6]  George R. Doddington,et al.  Speaker verification over long distance telephone lines , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[8]  Man-Wai Mak,et al.  Boosting the Performance of I-Vector Based Speaker Verification via Utterance Partitioning , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[10]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[11]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[12]  Torbjørn Svendsen,et al.  On the automatic segmentation of speech signals , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[14]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[16]  Olli Viikki,et al.  A recursive feature vector normalization approach for robust speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Hermann Ney,et al.  Quantile based histogram equalization for noise robust large vocabulary speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[20]  Yu Tsao,et al.  A study on cepstral sub-band normalization for robust ASR , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[21]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.