Audiovisual speaker indexing for Web-TV automations

Abstract The current paper introduces a multimodal framework to provide Web-TV automations for live broadcasting and overall big streaming data management. The term indexing refers to the spatiotemporal localization of speakers participating in a discussion panel. Multiple modalities acting in parallel form the data-driven decision-making pipeline. The automated workflow includes the tasks of active speaker detection and localization, frame selection, and creation of a semantically annotated database. For improved performance and robustness, an information fusion model is proposed, which makes use of different audio and visual modalities. Audio-driven Voice Activity Detection follows the Enhanced Temporal Integration methodology applied on a standard audio feature set. For the localization of the dominant audio source, the argument that maximizes the General Cross-Correlation method is calculated. The visual modalities include face and mouth detection and Visual Voice Activity Detection. A Long Short Term Memory network is trained with mouth image sequences to determine voice activity. The values of the audio and visual Voice Activity Detection modules, as well as the General Cross-Correlation result, are used to train an Adaptive Neuro-Fuzzy model, which is responsible for the final decision. Experimental results prove the superiority of the information fusion approach compared to unimodal audio and visual models.

[1]  Sangkyun Lee,et al.  Feature Selection for High-Dimensional Data with RapidMiner , 2012 .

[2]  Jianwu Dang,et al.  Voice Activity Detection Based on an Unsupervised Learning Framework , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Israel Cohen,et al.  Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Charalampos Dimoulas,et al.  Embedding sound localization and spatial audio interaction through coincident microphones arrays , 2015, AM '15.

[5]  George Kalliris,et al.  Improved Localization of Sound Sources Using Multi-Band Processing of Ambisonic Components , 2009 .

[6]  Eun-Kyoung Kim,et al.  Enhanced voice activity detection using acoustic event detection and classification , 2011, IEEE Transactions on Consumer Electronics.

[7]  Mohan S. Kankanhalli,et al.  Multi-camera Skype: Enhancing the Quality of Experience of Video Conferencing , 2013 .

[8]  Charalampos Dimoulas,et al.  Enhanced Temporal Feature Integration in Audio Semantics via Alpha-Stable Modeling , 2021 .

[9]  Tomi Kinnunen,et al.  Semi-supervised speech activity detection with an application to automatic speaker verification , 2018, Comput. Speech Lang..

[10]  Craig Hight,et al.  Automation within digital videography: from the Ken Burns Effect to ‘meaning-making’ engines , 2014 .

[11]  Charalampos Dimoulas,et al.  Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings. , 2020, The Journal of the Acoustical Society of America.

[12]  Ali Dehghan Firoozabadi,et al.  Incorporating GammaTone filterbank and Welch spectral estimation in subband processing-based localization of multiple simultaneous speakers , 2017 .

[13]  Charalampos Dimoulas,et al.  1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification , 2020 .

[14]  José Escolano,et al.  Evaluation of generalized cross-correlation methods for direction of arrival estimation using two microphones in real environments , 2012 .

[15]  Charalampos Dimoulas,et al.  Crowdsourcing Audio Semantics by Means of Hybrid Bimodal Segmentation with Hierarchical Classification , 2016 .

[16]  George Kalliris,et al.  Sound Source Localization and B-Format Enhancement Using Soundfield Microphone Sets , 2007 .

[17]  Alan McCree,et al.  State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations , 2020, Comput. Speech Lang..

[18]  Shi-Wen Deng,et al.  Statistical voice activity detection based on sparse representation over learned dictionary , 2013, Digit. Signal Process..

[19]  Rubén San-Segundo-Hernández,et al.  Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector , 2011, Comput. Electr. Eng..

[20]  Ming Li,et al.  LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization , 2019, INTERSPEECH.

[21]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[22]  Masakiyo Fujimoto,et al.  Noise robust voice activity detection based on periodic to aperiodic component ratio , 2010, Speech Commun..

[23]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[24]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Man-Wai Mak,et al.  A study of voice activity detection techniques for NIST speaker recognition evaluations , 2014, Comput. Speech Lang..

[27]  George Kalliris,et al.  Automated audio detection, segmentation and indexing, with application to post-production editing , 2007 .

[28]  Vasileios Bountourakis,et al.  An Enhanced Temporal Feature Integration Method for Environmental Sound Recognition , 2019, Acoustics.

[29]  Bipin Indurkhya,et al.  Learning Photography Aesthetics with Deep CNNs , 2017, MAICS.

[30]  George Kalliris,et al.  Collaborative Annotation Platform for Audio Semantics , 2013 .

[31]  Nikhil Ketkar,et al.  Deep Learning with Python , 2017 .

[32]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[33]  Maja Pantic,et al.  End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Akinori Nishihara,et al.  Efficient voice activity detection algorithm using long-term spectral flatness measure , 2013, EURASIP J. Audio Speech Music. Process..

[35]  Charalampos A. Dimoulas,et al.  Growing Media Skills and Know-How in Situ: Technology-Enhanced Practices and Collaborative Support in Mobile News-Reporting , 2019, Education Sciences.

[36]  I. Cohen,et al.  AR-GARCH in Presence of Noise: Parameter Estimation and Its Application to Voice Activity Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Charalampos Dimoulas,et al.  Experimenting with 1D CNN Architectures for Generic Audio Classification , 2020 .

[38]  Charalampos Dimoulas,et al.  jReporter: A Smart Voice-Recording Mobile Application , 2019 .

[39]  J. V. van Dijck,et al.  Making Public Television Social? Public Service Broadcasting and the Challenges of Social Media , 2015 .

[40]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Philip M. Napoli On Automation in Media Industries: Integrating Algorithmic Media Production into Media Industries Scholarship , 2014 .

[42]  Joon-Hyuk Chang,et al.  Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection , 2016, Comput. Speech Lang..

[43]  Philippe Souères,et al.  A survey on sound source localization in robotics: From binaural to array processing methods , 2015, Comput. Speech Lang..

[44]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Thomas Padois,et al.  Enhancement of time-domain acoustic imaging based on generalized cross-correlation and spatial weighting , 2016 .

[46]  Hyeontaek Lim,et al.  Formant-Based Robust Voice Activity Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Mark B. Sandler,et al.  The Sonic Visualiser: A Visualisation Platform for Semantic Descriptors from Musical Signals , 2006, ISMIR.

[48]  Trieu-Kien Truong,et al.  Improved voice activity detection algorithm using wavelet and support vector machine , 2010, Comput. Speech Lang..

[49]  Sharath Pankanti,et al.  Video surveillance: past, present, and now the future [DSP Forum] , 2013, IEEE Signal Processing Magazine.

[50]  Charalampos Dimoulas,et al.  Extending Temporal Feature Integration for Semantic Audio Analysis , 2017 .

[51]  Francesco Piazza,et al.  Localizing speakers in multiple rooms by using Deep Neural Networks , 2018, Comput. Speech Lang..

[52]  Iván V. Meza,et al.  Localization of sound sources in robotics: A review , 2017, Robotics Auton. Syst..

[53]  Rigas Kotsakis,et al.  Continuous Speech Emotion Recognition with Convolutional Neural Networks , 2020 .

[54]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[55]  Mohammad Hossein Moattar,et al.  A review on speaker diarization systems and approaches , 2012, Speech Commun..

[56]  Ji Wu,et al.  An efficient voice activity detection algorithm by combining statistical model and energy detection , 2011, EURASIP J. Adv. Signal Process..

[57]  Tetsuya Ogata,et al.  Sound Source Localization Using Deep Learning Models , 2017, J. Robotics Mechatronics.

[58]  Gautham J. Mysore,et al.  Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[59]  Joon-Hyuk Chang,et al.  Voice activity detection based on statistical models and machine learning approaches , 2010, Comput. Speech Lang..

[60]  Justin Salamon,et al.  MIR.EDU: AN OPEN-SOURCE LIBRARY FOR TEACHING SOUND AND MUSIC DESCRIPTION , 2014 .

[61]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[62]  Carman Neustaedter,et al.  Automated videography for residential communications , 2010, Electronic Imaging.

[63]  Hugo Van hamme,et al.  Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video , 2015, ICMI.

[64]  Alexandros Iosifidis,et al.  Visual Voice Activity Detection in the Wild , 2016, IEEE Transactions on Multimedia.

[65]  Jean-Pierre Martens,et al.  Adaptive speaker diarization of broadcast news based on factor analysis , 2017, Comput. Speech Lang..

[66]  Israel Cohen,et al.  A deep architecture for audio-visual voice activity detection in the presence of transients , 2018, Signal Process..

[67]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[68]  Maximo Cobos,et al.  Two-microphone multi-speaker localization based on a Laplacian Mixture Model , 2011, Digit. Signal Process..

[69]  Carlos Busso,et al.  Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection , 2017, INTERSPEECH.

[70]  Hichem Sahli,et al.  Robust speaker localization for real-world robots , 2015, Comput. Speech Lang..

[71]  Jyh-Shing Roger Jang,et al.  ANFIS: adaptive-network-based fuzzy inference system , 1993, IEEE Trans. Syst. Man Cybern..