Listen, Look, and Find the One

Person search with one portrait, which attempts to search the targets in arbitrary scenes using one portrait image at a time, is an essential yet unexplored problem in the multimedia field. Existing approaches, which predominantly depend on the visual information of persons, cannot solve problems when there are variations in the person’s appearance caused by complex environments and changes in pose, makeup, and clothing. In contrast to existing methods, in this article, we propose an associative multimodality index for person search with face, body, and voice information. In the offline stage, an associative network is proposed to learn the relationships among face, body, and voice information. It can adaptively estimate the weights of each embedding to construct an appropriate representation. The multimodality index can be built by using these representations, which exploit the face and voice as long-term keys and the body appearance as a short-term connection. In the online stage, through the multimodality association in the index, we can retrieve all targets depending only on the facial features of the query portrait. Furthermore, to evaluate our multimodality search framework and facilitate related research, we construct the Cast Search in Movies with Voice (CSM-V) dataset, a large-scale benchmark that contains 127K annotated voices corresponding to tracklets from 192 movies. According to extensive experiments on the CSM-V dataset, the proposed multimodality person search framework outperforms the state-of-the-art methods.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Jianquan Liu,et al.  P-Index: A Novel Index Based on Prime Factorization for Similarity Search , 2019, 2019 IEEE International Conference on Big Data and Smart Computing (BigComp).

[3]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[4]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[5]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[6]  Shifeng Zhang,et al.  WIDER Face and Pedestrian Challenge 2018: Methods and Results , 2019, ArXiv.

[7]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[8]  Wu Liu,et al.  Learning Efficient Spatial-Temporal Gait Features with Deep Learning for Human Identification , 2018, Neuroinformatics.

[9]  Federico Tombari,et al.  Query-Guided End-To-End Person Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Guillaume-Alexandre Bilodeau,et al.  Domain-Specific Face Synthesis for Video Face Recognition From a Single Sample Per Person , 2018, IEEE Transactions on Information Forensics and Security.

[11]  Anil K. Jain,et al.  A longitudinal study of automatic face recognition , 2015, 2015 International Conference on Biometrics (ICB).

[12]  Tianbao Yang,et al.  Learning Attributes Equals Multi-Source Domain Generalization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dahua Lin,et al.  Person Search in Videos with One Portrait Through Visual and Temporal Links , 2018, ECCV.

[14]  Yuxiao Hu,et al.  MS-Celeb-1M: Challenge of Recognizing One Million Celebrities in the Real World , 2016, IMAWM.

[15]  Xiaoli Li,et al.  Cloud‐aided online EEG classification system for brain healthcare: A case study of depression evaluation with a lightweight CNN , 2020, Softw. Pract. Exp..

[16]  M. Gribaudo,et al.  2002 , 2001, Cell and Tissue Research.

[17]  Yunde Jia,et al.  Temporal Action Localization in Untrimmed Videos Using Action Pattern Trees , 2019, IEEE Transactions on Multimedia.

[18]  Hao Zhang,et al.  Incremental Factorization of Big Time Series Data with Blind Factor Approximation , 2019, IEEE Transactions on Knowledge and Data Engineering.

[19]  Yongdong Zhang,et al.  Listen, look, and gotcha: instant video search with mobile phones by layered audio-video indexing , 2013, ACM Multimedia.

[20]  Qi Tian,et al.  MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[21]  Naoyuki Kanda,et al.  Face-Voice Matching using Cross-modal Embeddings , 2018, ACM Multimedia.

[22]  Qi Tian,et al.  Person Re-identification in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Stefanos Zafeiriou,et al.  300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[24]  Jian Liu,et al.  iQIYI-VID: A Large Dataset for Multi-modal Person Identification , 2018, ArXiv.

[25]  Ruimin Hu,et al.  Multi-Correlation Filters With Triangle-Structure Constraints for Object Tracking , 2019, IEEE Transactions on Multimedia.

[26]  Wu Liu,et al.  T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alan L. Yuille,et al.  Semi-Supervised Sparse Representation Based Classification for Face Recognition With Insufficient Labeled Samples , 2016, IEEE Transactions on Image Processing.

[29]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[31]  Nasser M. Nasrabadi,et al.  Text-Independent Speaker Verification Using 3D Convolutional Neural Networks , 2017, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[32]  Jean-Luc Dugelay,et al.  KinectFaceDB: A Kinect Database for Face Recognition , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[33]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Wu Liu,et al.  A discriminative null space based deep learning approach for person re-identification , 2016, 2016 4th International Conference on Cloud Computing and Intelligence Systems (CCIS).

[35]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[36]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[37]  Xiaogang Wang,et al.  Joint Detection and Identification Feature Learning for Person Search , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Erik Learned-Miller,et al.  FDDB: A benchmark for face detection in unconstrained settings , 2010 .

[39]  Wu Liu,et al.  Beyond Human-level License Plate Super-resolution with Progressive Vehicle Search and Domain Priori GAN , 2017, ACM Multimedia.

[40]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Shin'ichi Satoh,et al.  Person Reidentification via Discrepancy Matrix and Matrix Metric , 2018, IEEE Transactions on Cybernetics.

[42]  Edward A. Patrick,et al.  Review of Pattern Recognition in Medical Diagnosis and Consulting Relative to a New System Model , 1974, IEEE Trans. Syst. Man Cybern..

[43]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Albert Y. Zomaya,et al.  H-PARAFAC: Hierarchical Parallel Factor Analysis of Multidimensional Big Data , 2017, IEEE Transactions on Parallel and Distributed Systems.

[45]  Fei Wang,et al.  The Devil of Face Recognition is in the Noise , 2018, ECCV.

[46]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[47]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[49]  Josef Kittler,et al.  Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Bo Zhao,et al.  Neural Person Search Machines , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Joachim Gross,et al.  “Hearing faces and seeing voices”: Amodal coding of person identity in the human brain , 2016, Scientific Reports.

[52]  Shifeng Zhang,et al.  FaceBoxes: A CPU real-time face detector with high accuracy , 2017, 2017 IEEE International Joint Conference on Biometrics (IJCB).