Scanner: Efficient Video Analysis at Scale

A growing number of visual computing applications depend on the analysis of large video collections. The challenge is that scaling applications to operate on these datasets requires efficient systems for pixel data access and parallel processing across large numbers of machines. Few programmers have the capability to operate efficiently at these scales, limiting the field's ability to explore new applications that leverage big video data. In response, we have created Scanner, a system for productive and efficient video analysis at scale. Scanner organizes video collections as tables in a data store optimized for sampling frames from compressed video, and executes pixel processing computations, expressed as dataflow graphs, on these frames. Scanner schedules video analysis applications expressed using these abstractions onto heterogeneous throughput computing hardware, such as multi-core CPUs, GPUs, and media processing ASICs, for high-throughput pixel processing. We demonstrate the productivity of Scanner by authoring a variety of video processing applications including the synthesis of stereo VR video streams from multi-camera rigs, markerless 3D human pose reconstruction from video, and data-mining big video datasets such as hundreds of feature-length films or over 70,000 hours of TV news. These applications achieve near-expert performance on a single machine and scale efficiently to hundreds of machines, enabling formerly long-running big video data analysis tasks to be carried out in minutes to hours.

[1]  Ira Kemelmacher-Shlizerman,et al.  Transfiguring portraits , 2016, ACM Trans. Graph..

[2]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[3]  D. Marpe,et al.  The H.264/MPEG4 advanced video coding standard and its applications , 2006, IEEE Communications Magazine.

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[6]  Yong Jae Lee,et al.  AverageExplorer: interactive exploration and alignment of visual data collections , 2014, ACM Trans. Graph..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[9]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[10]  Alexei A. Efros,et al.  KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[12]  Noah Snavely,et al.  StreetStyle: Exploring world-wide clothing styles from millions of photos , 2017, ArXiv.

[13]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[14]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[16]  Michael F. Cohen,et al.  Real-time hyperlapse creation via optimal frame selection , 2015, ACM Trans. Graph..

[17]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[18]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Alexei A. Efros,et al.  A Century of Portraits: A Visual Historical Record of American High School Yearbooks , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[20]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[21]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[23]  Detlev Marpe,et al.  The H . 264 / MPEG 4 Advanced Video Coding Standard and its Applications , 2006 .

[24]  Antonio Torralba,et al.  Infinite Images: Creating and Exploring a Large Photorealistic Virtual Space , 2008, Proceedings of the IEEE.

[25]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[26]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[27]  Bin Wu,et al.  Large Scale Video Data Analysis Based on Spark , 2015, 2015 International Conference on Cloud Computing and Big Data (CCBD).

[28]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[29]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[30]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[32]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[33]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[35]  Jonathan T. Barron,et al.  Jump: virtual reality video , 2016, ACM Trans. Graph..

[36]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[37]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.