Character Identification in TV-series via Non-local Cost Aggregation

We propose a non-local cost aggregation algorithm to recognize the identity of face and person tracks in a TV-series. In our approach, the fundamental element for identification is a track node, which is built on top of face and person tracks. Track nodes with temporal dependency are grouped into a knot. These knots then serve as the basic units in the construction of a k-knot graph for exploring the video structure. We build the minimum-distance spanning tree (MST) from the k-knot graph such that track nodes of similar appearance are adjacent to each other in MST. Non-local cost aggregation is performed on MST, which ensures information from face and person tracks is utilized as a whole to improve the identification performance. The identification task is performed by minimizing the cost of each knot, which takes into account the unique presence of a subject in a venue. Experimental results demonstrate the effectiveness of our method.

[1]  David K. Smith Network Flows: Theory, Algorithms, and Applications , 1994 .

[2]  Rainer Stiefelhagen,et al.  Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Rainer Stiefelhagen,et al.  Story-based Video Retrieval in TV series using Plot Synopses , 2014, ICMR.

[6]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[7]  Makarand Tapaswi,et al.  StoryGraphs: Visualizing Character Interactions as a Timeline , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andrew Zisserman,et al.  Person Spotting: Video Shot Retrieval for Face Sets , 2005, CIVR.

[9]  Changsheng Xu,et al.  Character-based movie summarization , 2010, ACM Multimedia.

[10]  Fei-Fei Li,et al.  Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.

[11]  Qiang Ji,et al.  Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Cordelia Schmid,et al.  Unsupervised metric learning for face identification in TV video , 2011, 2011 International Conference on Computer Vision.

[13]  Qingxiong Yang,et al.  A non-local cost aggregation method for stereo matching , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Li Bai,et al.  Cosine Similarity Metric Learning for Face Verification , 2010, ACCV.

[15]  Julee Cobb,et al.  Hello, My Name Is… , 2016 .

[16]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Rainer Stiefelhagen,et al.  Improved weak labels using contextual cues for person identification in videos , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[19]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Rama Chellappa,et al.  Face Association across Unconstrained Video Frames Using Conditional Random Fields , 2012, ECCV.

[21]  Qiang Ji,et al.  Constrained Clustering and Its Application to Face Clustering in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.