Learning Cooking Techniques from YouTube

Cooking is a human activity with sophisticated process. Underlying the multitude of culinary recipes, there exist a set of fundamental and general cooking techniques, such as cutting, braising, slicing, and sauntering, etc. These skills are hard to learn through cooking recipes, which only provide textual instructions about certain dishs. Although visual instructions such as videos are more direct and intuitive for user to learn these skills, they mainly focus on certain dishes but not general cooking techniques. In this paper, we explore how to leverage YouTube video collections as a source to automatically mine videos of basic cooking techniques. The proposed approach first collects a group of videos by searching YouTube, and then leverages the trajectory bag of words model to represent human motion. Furthermore, the approach clusters the candidate shots into motion similar groups, and selects the most representative cluster and shots of the cooking technique to present to the user. The testing on 22 cooking techniques shows the feasibility of our proposed framework.

[1]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[2]  Tat-Seng Chua,et al.  Video reference: question answering on YouTube , 2009, MM '09.

[3]  Hung-Khoon Tan,et al.  Beyond search: Event-driven summarization for web videos , 2011, TOMCCAP.

[4]  Tat-Seng Chua,et al.  Word2Image: towards visual interpreting of words , 2008, ACM Multimedia.

[5]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Mor Naaman,et al.  Generating diverse and representative image search results for landmarks , 2008, WWW.

[7]  Hung-Khoon Tan,et al.  Event driven summarization for web videos , 2009, WSM '09.

[8]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[9]  Shin'ichi Satoh,et al.  Detection of important segments in cooking videos , 2001, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001).

[10]  Meng Wang,et al.  Visual tag dictionary: interpreting tags with visual words , 2009, WSMC '09.

[11]  Tat-Seng Chua,et al.  From text question-answering to multimedia QA on web-scale media resources , 2009, LS-MMRM '09.

[12]  Tat-Seng Chua,et al.  Word 2 Image : Towards Visual Interpretation of Words , 2008 .

[13]  Sadao Kurohashi,et al.  Automatic object model acquisition and object recognition by integrating linguistic and visual information , 2007, ACM Multimedia.