Collaborative Deep Metric Learning for Video Understanding

The goal of video understanding is to develop algorithms that enable machines understand videos at the level of human experts. Researchers have tackled various domains including video classification, search, personalized recommendation, and more. However, there is a research gap in combining these domains in one unified learning framework. Towards that, we propose a deep network that embeds videos using their audio-visual content, onto a metric space which preserves video-to-video relationships. Then, we use the trained embedding network to tackle various domains including video classification and recommendation, showing significant improvements over state-of-the-art baselines. The proposed approach is highly scalable to deploy on large-scale video sharing platforms like YouTube.

[1]  Yi Yang,et al.  UTS submission to Google YouTube-8M Challenge 2017 , 2017, ArXiv.

[2]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[3]  Matthias Bethge,et al.  Comparing deep neural networks against humans: object recognition when the signal gets weaker , 2017, ArXiv.

[4]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[5]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yi Yang,et al.  Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second , 2015, ICMR.

[8]  Gunhee Kim,et al.  Encoding Video and Label Priors for Multi-label Video Classification on YouTube-8M dataset , 2017, ArXiv.

[9]  Alexander J. Smola,et al.  Maximum Margin Matrix Factorization for Collaborative Ranking , 2007 .

[10]  Miha Skalic,et al.  Deep Learning Methods for Efficient Large Scale Video Labeling , 2017, ArXiv.

[11]  Xueqi Cheng,et al.  Collaborative factorization for recommender systems , 2013, SIGIR.

[12]  Miroslaw Bober,et al.  Cultivating DNN Diversity for Large Scale Video Labelling , 2017, ArXiv.

[13]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Rong Jin,et al.  Fine-grained visual categorization via multi-stage metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xiaoyuan Su,et al.  Hybrid Collaborative Filtering Algorithms Using a Mixture of Experts , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[17]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[18]  Haosheng Zou,et al.  The YouTube-8M Kaggle Competition: Challenges and Methods , 2017, ArXiv.

[19]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[20]  Jerome H. Friedman,et al.  Flexible Metric Nearest Neighbor Classification , 1994 .

[21]  David M. Pennock,et al.  A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains , 2002, NIPS.

[22]  Samy Bengio,et al.  LLORMA: Local Low-Rank Matrix Approximation , 2016, J. Mach. Learn. Res..

[23]  Marc Sebban,et al.  A Survey on Metric Learning for Feature Vectors and Structured Data , 2013, ArXiv.

[24]  Samy Bengio,et al.  Local collaborative ranking , 2014, WWW.

[25]  Xi Wang,et al.  Aggregating Frame-level Features for Large-Scale Video Classification , 2017, ArXiv.

[26]  Martin A. Riedmiller,et al.  RPROP - A Fast Adaptive Learning Algorithm , 1992 .

[27]  Gert R. G. Lanckriet,et al.  Learning Content Similarity for Music Recommendation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Luc Van Gool,et al.  Some Like It Hot — Visual Guidance for Preference Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[30]  Prem Melville and Raymond J. Mooney and Ramadass Nagarajan Content-Boosted Collaborative Filtering , 2001 .

[31]  Florian Strub,et al.  Hybrid Collaborative Filtering with Neural Networks , 2016, ArXiv.

[32]  Liangliang Cao,et al.  Delving Deep into Personal Photo and Video Search , 2017, WSDM.

[33]  Lars Schmidt-Thieme,et al.  Taxonomy-driven computation of product recommendations , 2004, CIKM '04.

[34]  Keinosuke Fukunaga,et al.  The optimal distance measure for nearest neighbor classification , 1981, IEEE Trans. Inf. Theory.

[35]  Yoram Singer,et al.  Local Low-Rank Matrix Approximation , 2013, ICML.

[36]  Eric Horvitz,et al.  Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-Based Approach , 2000, UAI.

[37]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[38]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[39]  Pavlo Molchanov,et al.  Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[40]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[41]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[42]  Maksims Volkovs,et al.  Collaborative Ranking With 17 Parameters , 2012, NIPS.

[43]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[44]  Xi Wang,et al.  Exploiting Objects with LSTMs for Video Categorization , 2016, ACM Multimedia.

[45]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[46]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Xiao Liu,et al.  Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding , 2017, ArXiv.

[48]  Huilin Xiong,et al.  Kernel-based distance metric learning for microarray data classification , 2006, BMC Bioinformatics.

[49]  Raymond J. Mooney,et al.  Content-boosted collaborative filtering for improved recommendations , 2002, AAAI/IAAI.

[50]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  R. Dieng-Kuntz,et al.  A Graph-Based Algorithm for Alignment of OWL Ontologies , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[52]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[53]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[54]  Silvio Savarese,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[56]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ji Wu,et al.  The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge , 2017, ArXiv.

[58]  Guy Lebanon,et al.  Metric learning for text documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Basura Fernando,et al.  Learning End-to-end Video Classification with Rank-Pooling , 2016, ICML.

[60]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[61]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[62]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).