Face clustering in videos: GMM-based hierarchical clustering using Spatio-Temporal data

In recent years, an increase in multimedia data generation and efficient forms of storage have given rise to needs like quick browsing, efficient summarization and techniques for information retrieval. Face Clustering, together with other technologies such as speech recognition, can effectively solve these problems. Applications such as video indexing, major cast detection and video summarization greatly benefit from the development of accurate face clustering algorithms. Since videos represent a temporally ordered collection of faces, it is only natural to use the knowledge of the temporal ordering of these faces, in conjunction with the spatial features extracted from them, to obtain optimal clusterings. This paper is aimed at developing a novel clustering algorithm, by modifying the highly successful hierarchical agglomerative clustering (HAC) process, so that it includes an effective initialization mechanism, via an initial temporal clustering and Gaussian Mixture Model based cluster splitting, and introduces a temporal aspect during cluster combination, in addition to the spatial distances. Experiments show that it significantly outperforms HAC while being equally flexible.

[1]  Noel E. O'Connor,et al.  Face detection and clustering for video indexing applications , 2003 .

[2]  L Sirovich,et al.  Low-dimensional procedure for the characterization of human faces. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[3]  Alex Pentland,et al.  Face recognition using eigenfaces , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Sham M. Kakade,et al.  Leveraging archival video for building face datasets , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Subhradeep Kayal,et al.  Face Clustering Experiments on News Video Images , 2013 .

[6]  John Daugman,et al.  Neural networks for image transformation, analysis, and compression , 1988, Neural Networks.

[7]  Joachim M. Buhmann,et al.  Distortion Invariant Object Recognition in the Dynamic Link Architecture , 1993, IEEE Trans. Computers.

[8]  Daphna Weinshall,et al.  Enhancing image and video retrieval: learning via equivalence constraints , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[9]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  Thierry Chateau,et al.  A multi-cue spatio-temporal framework for automatic frontal face clustering in video sequences , 2013, EURASIP J. Image Video Process..

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  Ralph Gross,et al.  Quo vadis Face Recognition , 2001 .

[14]  Zhu Liu,et al.  Major cast detection in video using both audio and visual information , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  G. Schwarz Estimating the Dimension of a Model , 1978 .