The LEAR submission at Thumos 2014

We describe the submission of the INRIA LEAR team to the THU-MOS workshop in conjunction with ECCV 2014. Our system is based on Fisher vector (FV) encoding of dense trajectory features (DTF), which we also used in our 2013 submission. This year's submission additionally incorporates static-image features (SIFT, Color, and CNN) and audio features (ASR and MFCC) for the classification task. For the detection task, we combine scores from the clas-sification task with FV-DTF features extracted from video slices. We found that these additional visual and audio feature significantly improve the classification results. For localization we found that using the classification scores as a contex-tual feature besides local motion features leads to significant improvements.