Identity Extraction from Clusters of Multi-modal Observations

In this paper, we present a method for identity extraction from TV News Broadcasts. We define the identity as a set of multi-modal observations. In our case it is the face of a person and a name of a person. The method is based on agglomerative clustering of observations. The resulting clusters represent individual identities, that appeared in the broadcasts. To evaluate the accuracy of our system, we hand labelled approximately one year worth of TV News broadcasts. This resulted in total of \(10\,301\) multi-modal observations and 2563 unique identities. Our method achieved a coverage measure of 90.69 % and precision measure of 94.69 %. Given the simplicity of the proposed algorithm, these results are very satisfactory. Furthermore, the designed system is modular and new modalities can be easily added.

[1]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[3]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jiri Matas,et al.  E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text , 2018, ACCV Workshops.