Multi-frame, multi-modal, and multi-kernel concept detection in video