Egocentric Basketball Motion Planning from a Single First-Person Image

We present a model that uses a single first-person image to generate an egocentric basketball motion sequence in the form of a 12D camera configuration trajectory, which encodes a player's 3D location and 3D head orientation throughout the sequence. To do this, we first introduce a future convolutional neural network (CNN) that predicts an initial sequence of 12D camera configurations, aiming to capture how real players move during a one-on-one basketball game. We also introduce a goal verifier network, which is trained to verify that a given camera configuration is consistent with the final goals of real one-on-one basketball players. Next, we propose an inverse synthesis procedure to synthesize a refined sequence of 12D camera configurations that (1) sufficiently matches the initial configurations predicted by the future CNN, while (2) maximizing the output of the goal verifier network. Finally, by following the trajectory resulting from the refined camera configuration sequence, we obtain the complete 12D motion sequence. Our model generates realistic basketball motion sequences that capture the goals of real players, outperforming standard deep learning approaches such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and generative adversarial networks (GANs).

[1]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yisong Yue,et al.  Generating Long-term Trajectories Using Deep Hierarchical Networks , 2016, NIPS.

[4]  Jianbo Shi,et al.  Am I a Baller? Basketball Performance Assessment from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[6]  Kris M. Kitani,et al.  Predicting wide receiver trajectories in American football , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[8]  Mubarak Shah,et al.  Floor Fields for Tracking in High Density Crowd Scenes , 2008, ECCV.

[9]  Nicholas Rhinehart,et al.  First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[11]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Yisong Yue,et al.  Data-Driven Ghosting using Deep Imitation Learning , 2017 .

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[16]  Rouhollah Rahmatizadeh,et al.  Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Jianbo Shi,et al.  Unsupervised Learning of Important Objects from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kris M. Kitani,et al.  Forecasting Interactive Dynamics of Pedestrians with Fictitious Play , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jianbo Shi,et al.  Predicting Behaviors of Basketball Players from First Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[24]  Fei-Fei Li,et al.  Socially-Aware Large-Scale Crowd Forecasting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[27]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[28]  Sergey Levine,et al.  Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Andrea Vedaldi,et al.  Visualizing Deep Convolutional Neural Networks Using Natural Pre-images , 2015, International Journal of Computer Vision.

[30]  Jianbo Shi,et al.  Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jianbo Shi,et al.  First Person Action-Object Detection with EgoNet , 2016, Robotics: Science and Systems.

[32]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Alex Pentland,et al.  Modeling and Prediction of Human Behavior , 1999, Neural Computation.

[35]  Mubarak Shah,et al.  Abnormal crowd behavior detection using social force model , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Kristen Grauman,et al.  Detecting Engagement in Egocentric Video , 2016, ECCV.

[37]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[38]  Ali Farhadi,et al.  Action Recognition in the Presence of One Egocentric and Multiple Static Cameras , 2014, ACCV.

[39]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[40]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[41]  Lakhmi C. Jain,et al.  Recurrent Neural Networks: Design and Applications , 1999 .

[42]  Silvio Savarese,et al.  Watch-Bot: Unsupervised learning for reminding humans of forgotten actions , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).