Whodunnit? Crime Drama as a Case for Natural Language Understanding

In this paper we argue that crime drama exemplified in television programs such as CSI: Crime Scene Investigation is an ideal testbed for approximating real-world natural language understanding and the complex inferences associated with it. We propose to treat crime drama as a new inference task, capitalizing on the fact that each episode poses the same basic question (i.e., who committed the crime) and naturally provides the answer when the perpetrator is revealed. We develop a new dataset based on CSI episodes, formalize perpetrator identification as a sequence labeling problem, and develop an LSTM-based model which learns from multi-modal data. Experimental results show that an incremental inference strategy is key to making accurate guesses as well as learning from representations fusing textual, visual, and acoustic input.

[1]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  John S. Boreczky,et al.  A hidden Markov model framework for video segmentation using audio and image features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Rainer Stiefelhagen,et al.  Aligning plot synopses to videos for story-based retrieval , 2015, International Journal of Multimedia Information Retrieval.

[5]  Carina Silberer,et al.  Visually Grounded Meaning Representations , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[7]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[9]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[12]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Andrew Zisserman,et al.  "Who are you?" - Learning person specific classifiers from video , 2009, CVPR.

[15]  Mirella Lapata,et al.  Learning to Interpret and Describe Abstract Scenes , 2015, NAACL.

[16]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[17]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[18]  Gang Wei,et al.  Video classification based on HMM using text and faces , 2000, 2000 10th European Signal Processing Conference.

[19]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.

[20]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[21]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[22]  Goutam Saha,et al.  Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition , 2012, Speech Commun..

[23]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[24]  Ellen M. Voorhees,et al.  Building a question answering test collection , 2000, SIGIR '00.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[28]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[29]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[30]  Mirella Lapata,et al.  Movie Script Summarization as Graph-based Scene Extraction , 2015, NAACL.

[31]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[32]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[33]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[35]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[37]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[38]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[39]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[41]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[42]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Changsheng Xu,et al.  Character-based movie summarization , 2010, ACM Multimedia.

[44]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[45]  Jason Weston,et al.  The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[46]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[47]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yaser Sheikh,et al.  On the use of computable features for film classification , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[49]  C.-C. Jay Kuo,et al.  Environmental sound recognition: A survey , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.