Autonomous aerial cinematography in unstructured environments with learned artistic decision‐making

Aerial cinematography is revolutionizing industries that require live and dynamic camera viewpoints such as entertainment, sports, and security. However, safely piloting a drone while filming a moving target in the presence of obstacles is immensely taxing, often requiring multiple expert human operators. Hence, there is demand for an autonomous cinematographer that can reason about both geometry and scene context in real-time. Existing approaches do not address all aspects of this problem; they either require high-precision motion-capture systems or GPS tags to localize targets, rely on prior maps of the environment, plan for short time horizons, or only follow artistic guidelines specified before flight. In this work, we address the problem in its entirety and propose a complete system for real-time aerial cinematography that for the first time combines: (1) vision-based target estimation; (2) 3D signed-distance mapping for occlusion estimation; (3) efficient trajectory optimization for long time-horizon camera motion; and (4) learning-based artistic shot selection. We extensively evaluate our system both in simulation and in field experiments by filming dynamic targets moving through unstructured environments. Our results indicate that our system can operate reliably in the real world without restrictive assumptions. We also provide in-depth analysis and discussions for each module, with the hope that our design tradeoffs can generalize to other related applications. Videos of the complete system can be found at: this https URL.

[1]  Davide Scaramuzza,et al.  An information gain formulation for active volumetric 3D reconstruction , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Milan Simic,et al.  Sampling-Based Robot Motion Planning: A Review , 2014, IEEE Access.

[3]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Otmar Hilliges,et al.  Airways: Optimization-Based Planning of Quadrotor Trajectories according to High-Level User Goals , 2016, CHI.

[5]  Marc Christie,et al.  The director's lens: an intelligent assistant for virtual cinematography , 2011, ACM Multimedia.

[6]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Marc Christie,et al.  Intuitive and efficient camera control with the toric space , 2015, ACM Trans. Graph..

[8]  Sheng Tang,et al.  Accurate Estimation of Human Body Orientation From RGB-D Sensors , 2013, IEEE Transactions on Cybernetics.

[9]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[10]  Kil To Chong,et al.  Design of an EKF-CI based sensor fusion for robust heading estimation of marine vehicle , 2015 .

[11]  Steven M. Drucker,et al.  Intelligent Camera Control in a Virtual Environment , 1994 .

[12]  Roy Thompson,et al.  Grammar of the Shot , 1998 .

[13]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[14]  Michael Gleicher,et al.  Through-the-lens camera control , 1992, SIGGRAPH.

[15]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[16]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[17]  Kai Yang,et al.  Estimation of the vehicle-pedestrian encounter/conflict risk on the road based on TASI 110-car naturalistic driving data collection , 2014, 2014 IEEE Intelligent Vehicles Symposium Proceedings.

[18]  Sebastian Scherer,et al.  Improved Generalization of Heading Direction Estimation for Aerial Filming Using Semi-Supervised Regression , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[19]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[21]  GleicherMichael,et al.  Through-the-lens camera control , 1992 .

[22]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[23]  Maneesh Agrawala,et al.  Computational video editing for dialogue-driven scenes , 2017, ACM Trans. Graph..

[24]  Richard G. Compton,et al.  Supporting Information Section , 2014 .

[25]  Scott E. Hudson,et al.  Parallel detection of conversational groups of free-standing people and tracking of their lower-body orientation , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Siddhartha S. Srinivasa,et al.  Regionally accelerated batch informed trees (RABIT*): A framework to integrate local information into optimal path planning , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[28]  Sebastian Scherer,et al.  Integrating kinematics and environment context into deep inverse reinforcement learning for predicting off-road vehicle trajectories , 2018, CoRL.

[29]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[30]  Wu Liu,et al.  Weighted sequence loss based spatial-temporal deep learning framework for human body orientation estimation , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[31]  Sebastian Thrun,et al.  Unsupervised learning of invariant features using video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[33]  Klaus C. J. Dietmayer,et al.  The Ko-PER intersection laserscanner and video dataset , 2014, 17th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[34]  Pat Hanrahan,et al.  Towards a Drone Cinematographer: Guiding Quadrotor Cameras using Visual Composition Principles , 2016, ArXiv.

[35]  Mohit Shridhar,et al.  XPose: Reinventing User Interaction with Flying Cameras , 2017, Robotics: Science and Systems.

[36]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[37]  Xin Yang,et al.  ACT: An Autonomous Drone Cinematography System for Action Scenes , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Sebastian Scherer,et al.  Towards a Robust Aerial Cinematography Platform: Localizing and Tracking Moving Targets in Unstructured Environments , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[39]  Zhenyu Na,et al.  Heading estimation fusing inertial sensors and landmarks for indoor navigation using a smartphone in the pocket , 2017, EURASIP J. Wirel. Commun. Netw..

[40]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[41]  Steven M. LaValle,et al.  RRT-connect: An efficient approach to single-query path planning , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[42]  Lydia E. Kavraki,et al.  Anytime solution optimization for sampling-based motion planning , 2013, 2013 IEEE International Conference on Robotics and Automation.

[43]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[44]  Otmar Hilliges,et al.  Optimizing for aesthetically pleasing quadrotor camera motion , 2018, ACM Trans. Graph..

[45]  Fan Yang,et al.  Good Semi-supervised Learning That Requires a Bad GAN , 2017, NIPS.

[46]  Emilio Frazzoli,et al.  Sampling-based algorithms for optimal motion planning , 2011, Int. J. Robotics Res..

[47]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[52]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[53]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.

[54]  Siddhartha S. Srinivasa,et al.  CHOMP: Covariant Hamiltonian optimization for motion planning , 2013, Int. J. Robotics Res..

[55]  Marc Christie,et al.  Directing Cinematographic Drones , 2017, ACM Trans. Graph..

[56]  Paolo Robuffo Giordano,et al.  Vision-Based Reactive Planning for Aggressive Target Tracking While Avoiding Collisions and Occlusions , 2018, IEEE Robotics and Automation Letters.

[57]  William Whittaker,et al.  Autonomous driving in urban environments: Boss and the Urban Challenge , 2008 .

[58]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[60]  Siddhartha S. Srinivasa,et al.  CHOMP: Gradient optimization techniques for efficient motion planning , 2009, 2009 IEEE International Conference on Robotics and Automation.

[61]  William Whittaker,et al.  Autonomous driving in urban environments: Boss and the Urban Challenge , 2008, J. Field Robotics.

[62]  Alexander Domahidi,et al.  Real-time planning for automated multi-view drone cinematography , 2017, ACM Trans. Graph..

[63]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  D. Arijon,et al.  Grammar of Film Language , 1976 .

[65]  Dani Lischinski,et al.  Creating and chaining camera moves for quadrotor videography , 2018, ACM Trans. Graph..

[66]  Sebastian Scherer,et al.  Sparse Tangential Network (SPARTAN): Motion planning for micro aerial vehicles , 2013, 2013 IEEE International Conference on Robotics and Automation.

[67]  Dariu Gavrila,et al.  A Probabilistic Framework for Joint Pedestrian Head and Body Orientation Estimation , 2015, IEEE Transactions on Intelligent Transportation Systems.

[68]  Meng Zhang,et al.  Creatism: A deep-learning photographer capable of creating professional work , 2017, ArXiv.

[69]  Kwang-Ting Cheng,et al.  Through-the-Lens Drone Filming , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[70]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[71]  Nir Ailon,et al.  Semi-supervised deep learning by metric embedding , 2016, ICLR.

[72]  Hossein Mobahi,et al.  Deep Learning via Semi-supervised Embedding , 2012, Neural Networks: Tricks of the Trade.

[73]  David Silver,et al.  Learning to search: Functional gradient techniques for imitation learning , 2009, Auton. Robots.

[74]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[75]  Byoung-Tak Zhang,et al.  Human Body Orientation Estimation using Convolutional Neural Network , 2016, ArXiv.

[76]  Pat Hanrahan,et al.  Generating dynamically feasible trajectories for quadrotor cameras , 2016, ACM Trans. Graph..

[77]  Pat Hanrahan,et al.  An interactive tool for designing quadrotor camera shots , 2015, ACM Trans. Graph..

[78]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[79]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Lin Wang,et al.  A novel heading estimation algorithm for pedestrian using a smartphone without attitude constraints , 2016, 2016 Fourth International Conference on Ubiquitous Positioning, Indoor Navigation and Location Based Services (UPINLBS).

[81]  Ashish Kapoor,et al.  AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles , 2017, FSR.

[82]  Siddhartha S. Srinivasa,et al.  Chisel: Real Time Large Scale 3D Reconstruction Onboard a Mobile Device using Spatially Hashed Signed Distance Fields , 2015, Robotics: Science and Systems.

[83]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[84]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[85]  Quentin Galvane,et al.  Automated Cinematography with Unmanned Aerial Vehicles , 2016, WICED@Eurographics.

[86]  Sebastian Scherer,et al.  Can a Robot Become a Movie Director? Learning Artistic Principles for Aerial Cinematography , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[87]  Sebastian Scherer,et al.  Autonomous drone cinematographer: Using artistic principles to create smooth, safe, occlusion-free trajectories for aerial filming , 2018, ISER.

[88]  Hongbin Zha,et al.  Probabilistic Inference for Occluded and Multiview On-road Vehicle Detection , 2016, IEEE Transactions on Intelligent Transportation Systems.

[89]  Marc Christie,et al.  Thinking Like a Director , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[90]  Vijay Kumar,et al.  Information-Theoretic Planning with Trajectory Optimization for Dense 3D Mapping , 2015, Robotics: Science and Systems.

[91]  Pieter Abbeel,et al.  Finding Locally Optimal, Collision-Free Trajectories with Sequential Convex Optimization , 2013, Robotics: Science and Systems.

[92]  Maxim Likhachev,et al.  Multi-Heuristic A* , 2014, Int. J. Robotics Res..

[93]  Shenghuo Zhu,et al.  Deep Learning of Invariant Features via Simulated Fixations in Video , 2012, NIPS.

[94]  Sambit Bakshi,et al.  Direction Estimation for Pedestrian Monitoring System in Smart Cities: An HMM Based Approach , 2016, IEEE Access.

[95]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[96]  Virginia Santamarina-Campos,et al.  Correction to: Introduction to Drones and Technology Applied to the Creative Industry. AiRT Project: An Overview of the Main Results and Actions , 2019, Drones and the Creative Industry.

[97]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[98]  Patrick Olivier,et al.  Camera Control in Computer Graphics , 2008, Comput. Graph. Forum.

[99]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[100]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[101]  Roland Siegwart,et al.  Voxblox: Building 3D Signed Distance Fields for Planning , 2016, ArXiv.