Procella: Unifying serving and analytical data at YouTube

Large organizations like YouTube are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure. At YouTube, we solved this problem by building a new SQL query engine - Procella. Procella implements a superset of capabilities required to address all of the four use cases above, with high scale and performance, in a single product. Today, Procella serves hundreds of billions of queries per day across all four workloads at YouTube and several other Google product areas.

[1]  Johannes Gehrke,et al.  A Confluence of Column Stores and Search Engines: Opportunities and Challenges , 2009, VLDB 2009.

[2]  Anurag Gupta,et al.  Amazon Redshift and the Case for Simpler Data Warehouses , 2015, SIGMOD Conference.

[3]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[4]  Kenneth A. Ross,et al.  Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[5]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[6]  Bérenger Bramas Inastemp: A Novel Intrinsics-as-Template Library for Portable SIMD-Vectorization , 2017, Sci. Program..

[7]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[8]  Gustavo Alonso,et al.  Distributed Join Algorithms on Thousands of Cores , 2017, Proc. VLDB Endow..

[9]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[10]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Nicolas Bruno,et al.  Advanced Join Strategies for Large-Scale Distributed Computation , 2014, Proc. VLDB Endow..

[13]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[14]  Arnab Nandi,et al.  FluxQuery: An Execution Framework for Highly Interactive Query Workloads , 2016, SIGMOD Conference.

[15]  Gagan Agrawal,et al.  DistriPlan: An Optimized Join Execution Framework for Geo-Distributed Scientific Data , 2017, SSDBM.

[16]  Ashish Motivala,et al.  The Snowflake Elastic Data Warehouse , 2016, SIGMOD Conference.

[17]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[20]  Jiali Wang,et al.  FDQ: Advance Analytics Over Real Scientific Array Datasets , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[21]  Guy M. Lohman,et al.  Is query optimization a 'solved' problem? , 1989 .

[22]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[23]  Sumeer Bhola,et al.  Monarch , 2020 .

[24]  J. S. Saini,et al.  Adaptive Query Processing , 2006 .

[25]  Alexander Hall,et al.  Processing a Trillion Cells per Mouse Click , 2012, Proc. VLDB Endow..

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Qi Huang,et al.  Gorilla: A Fast, Scalable, In-Memory Time Series Database , 2015, Proc. VLDB Endow..

[28]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[29]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[30]  Gagan Agrawal,et al.  DSDQuery DSI - Querying scientific data repositories with structured operators , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[31]  Fan Yang,et al.  Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing , 2014, Proc. VLDB Endow..

[32]  Goetz Graefe,et al.  F1 Query: Declarative Querying at Scale , 2018, Proc. VLDB Endow..