Differentiable Grammars for Videos

This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos. Learning latent terminals, non-terminals, and production rules directly from continuous data allows the construction of a generative model capturing sequential structures with multiple possibilities. Our model is fully differentiable, and provides easily interpretable results which are important in order to understand the learned structures. It outperforms the state-of-the-art on several challenging datasets and is more accurate for forecasting future activities in videos. We plan to open-source the code. this https URL

[1]  Xuan Liang,et al.  Assessing Beijing's PM2.5 pollution: severity, weather impact, APEC and winter heating , 2015, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Michael S. Ryoo,et al.  Temporal Gaussian Mixture Layer for Videos , 2018, ICML.

[3]  Bohyung Han,et al.  On-Line Video Event Detection by Constraint Flow , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  C. Lee Giles,et al.  Learning a class of large finite state machines with a recurrent neural network , 1995, Neural Networks.

[5]  C. Lee Giles,et al.  The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations , 2017, ArXiv.

[6]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  King-Sun Fu,et al.  Syntactic Pattern Recognition And Applications , 1968 .

[8]  Janet Wiles,et al.  Context-free and context-sensitive dynamics in recurrent neural networks , 2000, Connect. Sci..

[9]  Deva Ramanan,et al.  Predictive-Corrective Networks for Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[11]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[13]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[14]  Bernt Schiele,et al.  Time-Conditioned Action Anticipation in One Shot , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Padhraic Smyth,et al.  Discrete recurrent neural networks for grammatical inference , 1994, IEEE Trans. Neural Networks.

[16]  John F. Kolen,et al.  Fool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics , 1993, NIPS.

[17]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[18]  Jake K. Aggarwal,et al.  Semantic Representation and Recognition of Continued and Recursive Human Activities , 2009, International Journal of Computer Vision.

[19]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[20]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[21]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[22]  Richard Evans,et al.  Learning Explanatory Rules from Noisy Data , 2017, J. Artif. Intell. Res..

[23]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[24]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[26]  Peter Tiňo,et al.  Finite State Machines and Recurrent Neural Networks -- Automata and Dynamical Systems Approaches , 1995 .

[27]  Benjamin Z. Yao,et al.  Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[28]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[29]  Alan Fern,et al.  Probabilistic event logic for interval-based event recognition , 2011, CVPR 2011.

[30]  Rama Chellappa,et al.  Attribute Grammar-Based Event Recognition and Anomaly Detection , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Ali Farhadi,et al.  Asynchronous Temporal Fields for Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[35]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[36]  Fan Yang,et al.  Differentiable Learning of Logical Rules for Knowledge Base Reasoning , 2017, NIPS.

[37]  Noah A. Smith,et al.  Recurrent Neural Network Grammars , 2016, NAACL.

[38]  Boyu Wang,et al.  Back to the beginning: Starting point detection for early recognition of ongoing human actions , 2018, Comput. Vis. Image Underst..

[39]  Noam Chomsky,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[40]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Michael S. Ryoo,et al.  Fine-Grained Activity Recognition in Baseball Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[43]  Risto Miikkulainen,et al.  SARDSRN: A Neural Network Shift-Reduce Parser , 1999, IJCAI.