Automatic MOOC Video Classification using Transcript Features and Convolutional Neural Networks

The amount of MOOC video materials has grown exponentially in recent years. Therefore, their storage and analysis need to be made as fully automated as possible in order to maintain their management quality. In this work, we present a method for automatic topic classification of MOOC videos using speech transcripts and convolutional neural networks (CNN). Our method works as follows: First, speech recognition is used to generate video transcripts. Then, the transcripts are converted into images using a statistical co-occurrence transformation that we designed. Finally, a CNN is used to produce video category labels for a transcript image input. For our data, we use the Khan Academy on a Stick dataset that contains 2,545 videos, where each video is labeled with one or two of 13 categories. Experiments show that our method is strongly competitive against other methods that are also based on transcript features and supervised learning.

[1]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[2]  Philip J. Guo,et al.  How video production affects student engagement: an empirical study of MOOC videos , 2014, L@S.

[3]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[4]  Yi Yu,et al.  Fuzzy clustering of lecture videos based on topic modeling , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[5]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Thomas S. Huang,et al.  Deep Networks for Image Super-Resolution with Sparse Prior , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Joseph Chapes Online Video in Higher Education: Uses and Practices , 2017 .

[10]  Bart Kosko,et al.  Using noise to speed up video classification with recurrent backpropagation , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[11]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[12]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[13]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Gayle S. Christensen,et al.  The MOOC Phenomenon: Who Takes Massive Open Online Courses and Why? , 2013 .

[15]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Noel E. O'Connor,et al.  Action recognition in video using a spatial-temporal graph-based feature representation , 2015, 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Hangjung Zo,et al.  Understanding the MOOCs continuance: The role of openness and reputation , 2015, Comput. Educ..

[20]  M AlraimiKhaled,et al.  Understanding the MOOCs continuance , 2015 .

[21]  David Suendermann-Oeft,et al.  Comparing Open-Source Speech Recognition Toolkits ⋆ , 2014 .

[22]  Noel E. O'Connor,et al.  Holistic features for real-time crowd behaviour anomaly detection , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[23]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Frédo Durand,et al.  Visual transcripts , 2015, ACM Trans. Graph..

[26]  Fabio Tesser,et al.  Comparing open source ASR toolkits on Italian children speech , 2014, WOCCI.

[27]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).