MoGlow

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive assumptions such as the motion being cyclic in nature. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method attains a motion quality close to recorded motion capture for both humans and animals.

[1]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[2]  Gustav Eje Henter,et al.  Minimum Entropy Rate Simplification of Stochastic Processes , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[4]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[5]  Stephen D. Laycock,et al.  Predicting Head Pose from Speech with a Conditional Variational Autoencoder , 2017, INTERSPEECH.

[6]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[7]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[8]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[9]  Derek Nowrouzezahrai,et al.  Robust motion in-betweening , 2020, ACM Trans. Graph..

[10]  Aaron Hertzmann,et al.  Style-based inverse kinematics , 2004, ACM Trans. Graph..

[11]  Kevin Murphy,et al.  Switching Kalman Filters , 1998 .

[12]  F. Sebastian Grassia,et al.  Practical Parameterization of Rotations Using the Exponential Map , 1998, J. Graphics, GPU, & Game Tools.

[13]  Youngwoo Yoon,et al.  Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[14]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[15]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[16]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Kazuhiko Sumi,et al.  Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network , 2018, IVA.

[19]  Cassia Valentini-Botinhao,et al.  Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Taku Komura,et al.  Phase-functioned neural networks for character control , 2017, ACM Trans. Graph..

[22]  Jehee Lee,et al.  Interactive character animation by learning multi-objective control , 2018, ACM Trans. Graph..

[23]  LawrenceNeil Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005 .

[24]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[25]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[27]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[28]  Shakir Mohamed,et al.  Learning in Implicit Generative Models , 2016, ArXiv.

[29]  Taku Komura,et al.  Mode-adaptive neural networks for quadruped motion control , 2018, ACM Trans. Graph..

[30]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[32]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[33]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[34]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[35]  Dario Pavllo,et al.  QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[36]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[37]  Alexandre Lacoste,et al.  Neural Autoregressive Flows , 2018, ICML.

[38]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[39]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[40]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[41]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[42]  Yee Whye Teh,et al.  Do Deep Generative Models Know What They Don't Know? , 2018, ICLR.

[43]  Sebastian Starke,et al.  Local motion phases for learning multi-contact character movements , 2020, ACM Trans. Graph..

[44]  Taku Komura,et al.  Learning motion manifolds with convolutional autoencoders , 2015, SIGGRAPH Asia Technical Briefs.

[45]  G. Yule On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers , 1927 .

[46]  Michiel van de Panne,et al.  Character controllers using motion VAEs , 2020, ACM Trans. Graph..

[47]  David Duvenaud,et al.  Inference Suboptimality in Variational Autoencoders , 2018, ICML.

[48]  Carlos Busso,et al.  Speech-driven Animation with Meaningful Behaviors , 2017, Speech Commun..

[49]  Bajibabu Bollepalli,et al.  GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram , 2019, INTERSPEECH.

[50]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Greg Welch,et al.  An Introduction to Kalman Filter , 1995, SIGGRAPH 2001.

[52]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Yingying Wang,et al.  Efficient Neural Networks for Real-time Motion Style Transfer , 2019, PACMCGIT.

[54]  Lei Xie,et al.  BLSTM neural networks for speech driven head motion synthesis , 2015, INTERSPEECH.

[55]  Michael Gleicher,et al.  Automated extraction and parameterization of motions in large data sets , 2004, SIGGRAPH 2004.

[56]  Hai Xuan Pham,et al.  Generative Adversarial Talking Head: Bringing Portraits to Life with a Weakly Supervised Neural Network , 2018, ArXiv.

[57]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[58]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[59]  Xin Wang,et al.  Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[60]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[61]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[62]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[63]  Yi Zhou,et al.  Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis , 2017, ICLR.

[64]  Lucas Kovar,et al.  Motion graphs , 2002, SIGGRAPH Classes.

[65]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[66]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[68]  Alexandre Lacoste,et al.  Probability Distillation: A Caveat and Alternatives , 2019, UAI.

[69]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[70]  Carlos Busso,et al.  Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Ferenc Huszar,et al.  How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? , 2015, ArXiv.

[72]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[73]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[74]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[75]  Michael Neff,et al.  Multi-objective adversarial gesture generation , 2019, MIG.

[76]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[77]  Michael F. Cohen,et al.  Verbs and Adverbs: Multidimensional Motion Interpolation , 1998, IEEE Computer Graphics and Applications.

[78]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[79]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[80]  Dong Yu,et al.  Maximizing Mutual Information for Tacotron , 2019, ArXiv.

[81]  Naoshi Kaneko,et al.  Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[82]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[83]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[84]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[85]  Stephen D. Laycock,et al.  Predicting Head Pose in Dyadic Conversation , 2017, IVA.

[86]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[87]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[88]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[89]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[90]  Gustav Eje Henter,et al.  Robust model training and generalisation with Studentising flows , 2020, ArXiv.

[91]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[92]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[93]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[94]  Sergey Levine,et al.  Continuous character control with low-dimensional embeddings , 2012, ACM Trans. Graph..

[95]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[96]  Jessica K. Hodgins,et al.  Performance animation from low-dimensional control signals , 2005, SIGGRAPH 2005.

[97]  Zhiyong Wang,et al.  Combining Recurrent Neural Networks and Adversarial Training for Human Motion Synthesis and Control , 2018, IEEE Transactions on Visualization and Computer Graphics.

[98]  Thomas Drugman,et al.  Robust universal neural vocoding , 2018, ArXiv.

[99]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[100]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[101]  Gustavo Deco,et al.  Higher Order Statistical Decorrelation without Information Loss , 1994, NIPS.

[102]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[103]  Sergey Levine,et al.  VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[104]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[105]  Geoffrey E. Hinton,et al.  Two Distributed-State Models For Generating High-Dimensional Time Series , 2011, J. Mach. Learn. Res..

[106]  Tomohiko Mukai,et al.  Geostatistical motion interpolation , 2005, SIGGRAPH 2005.

[107]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[108]  Hugo Larochelle,et al.  Neural Autoregressive Distribution Estimation , 2016, J. Mach. Learn. Res..