Educational video classification by using a transcript to image transform and supervised learning

In this work, we present a method for automatic topic classification of educational videos using a speech transcript transform. Our method works as follows: First, speech recognition is used to generate video transcripts. Then, the transcripts are converted into images using a statistical cooccurrence transformation that we designed. Finally, a classifier is used to produce video category labels for a transcript image input. For our classifiers, we report results using a convolutional neural network (CNN) and a principal component analysis (PCA) model. In order to evaluate our method, we used the Khan Academy on a Stick dataset that contains 2,545 videos, where each video is labeled with one or two of 13 categories. Experiments show that our method is effective and strongly competitive against other supervised learning-based methods.

[1]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[4]  Alistair Sutherland,et al.  Manifold Interpolation for an Efficient Hand Shape Recognition in the Irish Sign Language , 2016, ISVC.

[5]  Leo Liberti,et al.  Euclidean Distance Geometry and Applications , 2012, SIAM Rev..

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Frédo Durand,et al.  Visual transcripts , 2015, ACM Trans. Graph..

[8]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Joseph Chapes Online Video in Higher Education: Uses and Practices , 2017 .

[10]  Bart Kosko,et al.  Using noise to speed up video classification with recurrent backpropagation , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[11]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[12]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[14]  Han Liu,et al.  Scale-Invariant Sparse PCA on High-Dimensional Meta-Elliptical Data , 2014, Journal of the American Statistical Association.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[17]  Thomas S. Huang,et al.  Deep Networks for Image Super-Resolution with Sparse Prior , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Noel E. O'Connor,et al.  Action recognition in video using a spatial-temporal graph-based feature representation , 2015, 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[19]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yi Yu,et al.  Fuzzy clustering of lecture videos based on topic modeling , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[21]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[22]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[23]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[24]  Noel E. O'Connor,et al.  Holistic features for real-time crowd behaviour anomaly detection , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[25]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.