Audio-visual speaker identification with multi-view distance metric learning

Both audio and visual information can be useful for speaker identification in videos. This paper proposes an audio-visual speaker identification approach that benefits from a multi-view distance metric learning method. Our metric learning scheme not only builds distance measures based on the label information of training data but also the consistency of different views. In this way, better metrics can be learned in comparison with metric learning for each view individually. We conduct experiments on VidTIMIT dataset and empirical results have demonstrated the effectiveness of our approach over a set of existing methods. In addition, we also implement our method on a multi-view digit recognition task and encouraging results are also obtained.

[1]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[2]  Vasant Honavar,et al.  Multiple label prediction for image annotation with multiple Kernel correlation models , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Jieping Ye,et al.  Adaptive Distance Metric Learning for Clustering , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[5]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[6]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[7]  Wei Liu,et al.  Learning Distance Metrics with Contextual Constraints for Image Retrieval , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[9]  R.I. Damper,et al.  Fusion of two classifiers for speaker identification: removing and not removing silence , 2005, 2005 7th International Conference on Information Fusion.

[10]  Karen Livescu,et al.  Multi-view learning of acoustic features for speaker recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  Vikas Sindhwani,et al.  An RKHS for multi-view learning and manifold co-regularization , 2008, ICML '08.

[12]  Conrad Sanderson,et al.  Biometric Person Recognition: Face, Speech and Fusion , 2008 .

[13]  Mikhail Belkin,et al.  A Co-Regularization Approach to Semi-supervised Learning with Multiple Views , 2005 .

[14]  Aggelos K. Katsaggelos,et al.  Audio-Visual Biometrics , 2006, Proceedings of the IEEE.

[15]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[16]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[18]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[19]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[20]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[21]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.