Image understanding for global lifelog media cloud

We implemented media lifelog system with highlighting system with image analysis, video analysis and audio segmentation modules. Image analysis module has image classification, saliency region detection, face detection and facial expression recognition process. Video analysis module has cut detection and key frame detection process. And the result images of key frame detection is used as the input of image analysis module. Audio analysis module has audio segmentation process. ImageNet data is used for training and test database. The image classification accuracy is 83%. Automatic cut detection F1 score is 0.70. Cut detection F1 score is 0.80. Audio segmentation F score is 0.53. And facial expression recognition precision rate is 94.8% at 0.756 sec on a mobile phone.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Li Deng,et al.  Deep learning: from speech recognition to language and multimodal processing , 2016, APSIPA Transactions on Signal and Information Processing.

[3]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.