Pinot: Realtime OLAP for 530 Million Users

Modern users demand analytical features on fresh, real time data. Offering these analytical features to hundreds of millions of users is a relevant problem encountered by many large scale web companies. Relational databases and key-value stores can be scaled to provide point lookups for a large number of users but fall apart at the combination of high ingest rates, high query rates at low latency for analytical queries. Online analytical databases typically rely on bulk data loads and are not typically built to handle nonstop operation in demanding web environments. Offline analytical systems have high throughput but do not offer low query latencies nor can scale to serving tens of thousands of queries per second. We present Pinot, a single system used in production at Linkedin that can serve tens of thousands of analytical queries per second, offers near-realtime data ingestion from streaming data sources, and handles the operational requirements of large web properties. We also provide a performance comparison with Druid, a system similar to Pinot.

[1]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[2]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[3]  Ben Shneiderman,et al.  Interactive Dynamics for Visual Analysis , 2012 .

[4]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[5]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[6]  Ramesh Subramonian,et al.  Untangling cluster management with Helix , 2012, SoCC '12.

[7]  Dinesh Das,et al.  Oracle Database In-Memory: A dual format in-memory database , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[8]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[9]  Jiawei Han,et al.  Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration , 2003, Very Large Data Bases Conference.

[10]  Laks V. S. Lakshmanan,et al.  Modeling impression discounting in large-scale recommender systems , 2014, KDD.

[11]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[12]  Xiaojun Ye,et al.  A Multidimensional OLAP Engine Implementation in Key-Value Database Systems , 2013, WBDB.

[13]  Chaomei Chen,et al.  Top 10 Unsolved Information Visualization Problems , 2005, IEEE Computer Graphics and Applications.

[14]  Jack Chen,et al.  The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database , 2016, Proc. VLDB Endow..

[15]  Michael J. McGuffin,et al.  GPLOM: The Generalized Plot Matrix for Visualizing Multidimensional Multivariate Data , 2013, IEEE Transactions on Visualization and Computer Graphics.

[16]  Owen Kaser,et al.  Better bitmap performance with Roaring bitmaps , 2014, Softw. Pract. Exp..

[17]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[18]  Alexander Hall,et al.  Processing a Trillion Cells per Mouse Click , 2012, Proc. VLDB Endow..

[19]  Daniel Lemire,et al.  Optimizing Druid with Roaring bitmaps , 2016, IDEAS.

[20]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[21]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[22]  Sam Shah,et al.  Avatara: OLAP for Web-scale Analytics Products , 2012, Proc. VLDB Endow..

[23]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Wolfgang Lehner,et al.  SAP HANA database: data management for modern business applications , 2012, SGMD.

[26]  Chris R. Johnson Top Scientific Visualization Research Problems , 2004, IEEE Computer Graphics and Applications.

[27]  Michael J. McGuffin,et al.  VisReduce: Fast and responsive incremental information visualization of large datasets , 2013, 2013 IEEE International Conference on Big Data.

[28]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[29]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..