Learning from ambiguously labeled images

In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. Similarly, movie screenplays can tell us who is in the scene, but not when and where they are on the screen. We formulate the learning problem in this setting as partially-supervised multiclass classification where each instance is labeled ambiguously with more than one label. We show theoretically that effective learning is possible under reasonable assumptions even when all the data is weakly labeled. Motivated by the analysis, we propose a general convex learning formulation based on minimization of a surrogate loss appropriate for the ambiguous label setting. We apply our framework to identifying faces culled from Web news sources and to naming characters in TV series and movies. We experiment on a very large dataset consisting of 100 hours of video, and in particular achieve 6% error for character naming on 16 episodes of LOST.

[1]  Rong Yan,et al.  A Discriminative Learning Framework with Pairwise Constraints for Video Object Classification , 2006, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Rong Jin,et al.  Learning with Multiple Labels , 2002, NIPS.

[3]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[4]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[5]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[6]  Eyke Hüllermeier,et al.  Learning from ambiguously labeled examples , 2005, Intell. Data Anal..

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Erik G. Learned-Miller,et al.  Unsupervised Joint Alignment of Complex Images , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Tamara L. Berg,et al.  names and faces. , 1982, The Physician and sportsmedicine.

[10]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[11]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[12]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[13]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[14]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[15]  Sham M. Kakade,et al.  Leveraging archival video for building face datasets , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[17]  Tsuhan Chen,et al.  Using Group Prior to Identify People in Consumer Images , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.