Web-scale Multimedia Search for Internet Video Content

The Internet has been witnessing an explosion of video content. According to a Cisco study, video content is estimated to account for 80% of all the entire world's internet traffic by 2019. Video data are becoming one of the most valuable sources to assess information and knowledge. However, existing video search solutions are still based on text matching (text-to-text search), and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all. The need for large-scale and intelligent video search, which bridges the gap between the user's information need and the video content, seems to be urgent. In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content, including text-to-video and text&video-to-video search. Suppose our goal is to search the videos about birthday party. In traditional text-to-text queries, we have to search the keywords in the user-generated metadata (titles or descriptions). In a text-to-video query, however, we might look for visual clues in the video content such as "cake", "gift" and "kids", audio clues like "birthday song" and "cheering sound", or visible text like "happy birthday". Text-to-video queries are flexible and can be further refined by Boolean and temporal operators. After watching the retrieved videos, the user may select a few interesting videos to find more relevant videos like these. This can be achieved by issuing a text&video-to-video query which adds the selected video examples to the query. The proposed method provides a new dimension of looking at content-based video search, from finding a simple concept like "puppy" to searching a complex incident like "a scene in urban area where people running away after an explosion". To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. In addition, our method can efficiently scale up the search to hundreds of millions videos, and only takes about 0.2 second to search a semantic query on a collection of 100 million videos, 1 second to process a hybrid query over 1 million videos. Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search engine that is capable of indexing and searching a collection of 100 million videos.

[1]  Alexander G. Hauptmann,et al.  Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition , 2014, ArXiv.

[2]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[3]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[4]  Qiang Wu,et al.  Adapting boosting for information retrieval measures , 2010, Information Retrieval.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[7]  Deyu Meng,et al.  Towards Efficient Learning of Optimal Spatial Bag-of-Words Representations , 2014, ICMR.

[8]  Chong-Wah Ngo,et al.  Video Event Detection Using Motion Relativity and Feature Selection , 2014, IEEE Transactions on Multimedia.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[11]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Yi Yang,et al.  E-LAMP: integration of innovative ideas for multimedia event detection , 2013, Machine Vision and Applications.

[13]  Qi Xie,et al.  Self-Paced Learning for Matrix Factorization , 2015, AAAI.

[14]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[16]  Yue Gao,et al.  Multimedia Social Event Detection in Microblog , 2015, MMM.

[17]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[18]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[20]  Shih-Fu Chang,et al.  Video search reranking via information bottleneck principle , 2006, MM '06.

[21]  James E. Falk,et al.  Concave Minimization Via Collapsing Polytopes , 1986, Oper. Res..

[22]  Nicu Sebe,et al.  Multi-Paced Dictionary Learning for cross-domain retrieval and recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[23]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Yunchao Wei,et al.  Towards Computational Baby Learning: A Weakly-Supervised Approach for Object Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Yunchao Wei,et al.  STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yulia Tsvetkov,et al.  Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning , 2016, ACL.

[27]  Tao Mei,et al.  Learning to video search rerank via pseudo preference feedback , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[28]  Wesley De Neve,et al.  The rise of mobile and social short-form video: an in-depth measurement study of vine , 2014 .

[29]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[30]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Fiona Fui-Hoon Nah,et al.  A study on tolerable waiting time: how long are Web users willing to wait? , 2004, AMCIS.

[32]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[33]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[34]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[35]  Louis-Philippe Morency,et al.  Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks , 2016, ArXiv.

[36]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[37]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[38]  Yannis Kalantidis,et al.  Tag Prediction at Flickr: A View from the Darkroom , 2016, ACM Multimedia.

[39]  Lei Zhang,et al.  Active Self-Paced Learning for Cost-Effective and Progressive Face Identification , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.

[42]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[43]  Virgílio A. F. Almeida,et al.  Video Pollution on the Web , 2010, First Monday.

[44]  Dong Cao,et al.  Self-Paced Cross-Modal Subspace Matching , 2016, SIGIR.

[45]  Yoshua Bengio,et al.  Evolving Culture Versus Local Minima , 2014, Growing Adaptive Machines.

[46]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[47]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[48]  Sumit Basu,et al.  Teaching Classification Boundaries to Humans , 2013, AAAI.

[49]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[50]  Mahadev Satyanarayanan,et al.  Early Implementation Experience with Wearable Cognitive Assistance Applications , 2015, WearSys@MobiSys.

[51]  Apostol Natsev,et al.  Efficient Large Scale Video Classification , 2015, ArXiv.

[52]  Ryen W. White,et al.  Sampling high-quality clicks from noisy click data , 2010, WWW '10.

[53]  Meng Wang,et al.  Harvesting visual concepts for image search with complex queries , 2012, ACM Multimedia.

[54]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[55]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[56]  Ricardo Baeza-Yates,et al.  Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising , 2016, ArXiv.

[57]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[58]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[59]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Cees G. M. Snoek,et al.  The MediaMill at TRECVID 2013: : Searching concepts, Objects, Instances and events in video , 2013, TRECVID.

[61]  Alexander G. Hauptmann,et al.  Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[62]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[63]  Deyu Meng,et al.  Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos , 2015, ICMR.

[64]  Dong Liu,et al.  Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[65]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[66]  Deyu Meng,et al.  What Objective Does Self-paced Learning Indeed Optimize? , 2015, ArXiv.

[67]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[68]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[69]  Jakub M. Tomczak,et al.  Self-paced Learning for Imbalanced Data , 2016, ACIIDS.

[70]  Deva Ramanan,et al.  Self-Paced Learning for Long-Term Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Marshall L. Fisher,et al.  The Lagrangian Relaxation Method for Solving Integer Programming Problems , 2004, Manag. Sci..

[72]  Yi Yang,et al.  Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second , 2015, ICMR.

[73]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[74]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[75]  Xiaofang Xu,et al.  Bayesian Variable Selection and Estimation for Group Lasso , 2015, 1512.01013.

[76]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[77]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Yiu-Kai Ng,et al.  Predicting the ratings of multimedia items for making personalized recommendations , 2012, SIGIR '12.

[79]  Maoguo Gong,et al.  Multi-Objective Self-Paced Learning , 2016, AAAI.

[80]  Rong Yan,et al.  Negative pseudo-relevance feedback in content-based video retrieval , 2003, MULTIMEDIA '03.

[81]  Mirjam Wattenhofer,et al.  YouTube around the world: geographic popularity of videos , 2012, WWW.

[82]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[83]  Frank M. Shipman,et al.  Saving, reusing, and remixing web video: using attitudes and practices to reveal social norms , 2013, WWW.

[84]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[85]  J. Friedman Stochastic gradient boosting , 2002 .

[86]  R. Manmatha,et al.  Modeling Concept Dependencies for Event Detection , 2014, ICMR.

[87]  Xirong Li,et al.  Few-Example Video Event Retrieval using Tag Propagation , 2014, ICMR.

[88]  Deyu Meng,et al.  Learning to Detect Concepts from Webly-Labeled Video Data , 2016, IJCAI.

[89]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  John R. Smith Riding the multimedia big data wave , 2013, SIGIR.

[91]  Sandra Zilles,et al.  Interactive Learning from Multiple Noisy Labels , 2016, ECML/PKDD.

[92]  Andrei Z. Broder,et al.  Big Data: New Paradigm or "Sound and Fury, Signifying Nothing"? , 2015, WSDM.

[93]  Yi Yang,et al.  Viral Video Style: A Closer Look at Viral Videos on YouTube , 2014, ICMR.

[94]  Shin'ichi Satoh,et al.  Large vocabulary quantization for searching instances from videos , 2012, ICMR '12.

[95]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[96]  James Allan,et al.  A cluster-based resampling method for pseudo-relevance feedback , 2008, SIGIR '08.

[97]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[98]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[99]  Valentin I. Spitkovsky,et al.  Baby Steps: How “Less is More” in Unsupervised Dependency Parsing , 2009 .

[100]  Xiaojun Chang,et al.  Incremental Multimodal Query Construction for Video Search , 2015, ICMR.

[101]  Daphne Koller,et al.  Learning specific-class segmentation from diverse data , 2011, 2011 International Conference on Computer Vision.

[102]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[103]  Shiguang Shan,et al.  Informedia@TrecVID 2014: MED and MER , 2014 .

[104]  Shih-Fu Chang,et al.  Minimally Needed Evidence for Complex Event Recognition in Unconstrained Videos , 2014, ICMR.

[105]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[106]  Ran He,et al.  Self-Paced Learning: An Implicit Regularization Perspective , 2016, AAAI.

[107]  Antonio Torralba,et al.  Are all training examples equally valuable? , 2013, ArXiv.

[108]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[109]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[110]  Qi Tian,et al.  Learning to judge image search results , 2011, MM '11.

[111]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[112]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[113]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[114]  Rong Yan,et al.  Multimedia Search with Pseudo-relevance Feedback , 2003, CIVR.

[115]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[116]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[117]  Tat-Seng Chua,et al.  Deep Q-Networks for Accelerating the Training of Deep Neural Networks , 2016, ArXiv.

[118]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[119]  Jieping Ye,et al.  A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems , 2013, ICML.

[120]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[121]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[122]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[123]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[124]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[125]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[126]  Rong Yan,et al.  Video Retrieval Based on Semantic Concepts , 2008, Proceedings of the IEEE.

[127]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[128]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[129]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[130]  Tong Zhang,et al.  Analysis of Multi-stage Convex Relaxation for Sparse Regularization , 2010, J. Mach. Learn. Res..

[131]  Nicu Sebe,et al.  Fisher kernel based relevance feedback for multimodal video retrieval , 2013, ICMR '13.

[132]  Cordelia Schmid,et al.  Evaluation of GIST descriptors for web-scale image search , 2009, CIVR '09.

[133]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[134]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[135]  Shih-Fu Chang How far we've come: Impact of 20 years of multimedia information retrieval , 2013, TOMCCAP.

[136]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[137]  Xian-Sheng Hua,et al.  Bayesian video search reranking , 2008, ACM Multimedia.

[138]  A. G. Amitha Perera,et al.  Multimedia event detection with multimodal feature fusion and temporal concept localization , 2013, Machine Vision and Applications.

[139]  Pradipto Das,et al.  Translating related words to videos and back through latent topics , 2013, WSDM.

[140]  Nicu Sebe,et al.  Academic Coupled Dictionary Learning for Sketch-based Image Retrieval , 2016, ACM Multimedia.

[141]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[142]  Frédéric Jurie,et al.  Improving web image search results using query-relative classifiers , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[143]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[144]  James Allan,et al.  Zero-shot video retrieval using content and concepts , 2013, CIKM.

[145]  John R. Smith,et al.  On the detection of semantic concepts at TRECVID , 2004, MULTIMEDIA '04.

[146]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[147]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[148]  Fei-Fei Li,et al.  Shifting Weights: Adapting Object Detectors from Image to Video , 2012, NIPS.

[149]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[150]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[151]  Maria Eskevich,et al.  Defining and Evaluating Video Hyperlinking for Navigating Multimedia Archives , 2015, WWW.

[152]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[153]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[154]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[155]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[156]  Yu He,et al.  The YouTube video recommendation system , 2010, RecSys '10.

[157]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[158]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[159]  Larry S. Davis,et al.  Selecting Relevant Web Trained Concepts for Automated Event Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[160]  Raphaël Troncy,et al.  Automatic fine-grained hyperlinking of videos within a closed collection using scene segmentation , 2014, ACM Multimedia.

[161]  Yang Gao,et al.  Self-paced dictionary learning for image classification , 2012, ACM Multimedia.

[162]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[163]  Chao Li,et al.  A Self-Paced Multiple-Instance Learning Framework for Co-Saliency Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[164]  Qinghua Zheng,et al.  Efficient Deep Web Crawling Using Reinforcement Learning , 2010, PAKDD.

[165]  Julien Mairal,et al.  Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization , 2013, NIPS.

[166]  Benoit Huet,et al.  When textual and visual information join forces for multimedia retrieval , 2014, ICMR.

[167]  Jingdong Wang,et al.  Robust visual reranking via sparsity and ranking constraints , 2011, ACM Multimedia.

[168]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[169]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[170]  Kathrin Klamroth,et al.  Biconvex sets and optimization with biconvex functions: a survey and extensions , 2007, Math. Methods Oper. Res..

[171]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[172]  Bilge Mutlu,et al.  How Do Humans Teach: On Curriculum Learning and Teaching Dimension , 2011, NIPS.

[173]  Liangliang Cao,et al.  Delving Deep into Personal Photo and Video Search , 2017, WSDM.

[174]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[175]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[176]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[177]  Francis K. H. Quek,et al.  Search Strategies for Pattern Identification in Multimodal Data: Three Case Studies , 2014, ICMR.

[178]  Florian Metze,et al.  Improvements to speaker adaptive training of deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[179]  Andrew Zisserman,et al.  Video Google: Efficient Visual Search of Videos , 2006, Toward Category-Level Object Recognition.

[180]  Alan Hanjalic,et al.  Supervised reranking for web image search , 2010, ACM Multimedia.

[181]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[182]  Changsheng Li,et al.  Self-Paced Multi-Task Learning , 2016, AAAI.

[183]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[184]  C. Schmid,et al.  Recognizing activities with cluster-trees of tracklets , 2012, BMVC.

[185]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[186]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[187]  Vasileios Mezaris,et al.  Video event detection using generalized subclass discriminant analysis and linear support vector machines , 2014, ICMR.

[188]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[189]  Y. Miao Incorporating Context Information into Deep Neural Network Acoustic Models , 2015 .

[190]  Hyungtae Lee,et al.  Analyzing Complex Events and Human Actions in "in-the-wild" Videos , 2014 .

[191]  Xiao Liu,et al.  Crawling Deep Web Content through Query Forms , 2009, WEBIST.