A single-node datastore for high-velocity multidimensional sensor data

Sources of multidimensional data are becoming more prevalent, partly due to the rise of the Internet of Things (IoT), and so is the need to ingest and analyze data streams at rates higher than before. Some industrial IoT applications require ingesting millions of records per second, while processing queries on recently ingested and historical data. Unfortunately, existing database systems targeting multidimensional data exhibit low per-node ingestion performance, and even if they can scale horizontally in distributed settings, they require large number of nodes to meet such ingest demands. For this reason, in this paper we present a single-node datastore able to ingest multidimensional sensor data at very high rates. Its design centers around a two-level indexing structure, wherein the global index is an in-memory R∗-tree and the local indices are serialized kd-trees. This study is confined to records with numerical indexing fields and range queries, and covers ingest throughput, query response time, and storage footprint. We show that the adopted design streamlines data ingestion and offers ingress rates two orders of magnitude higher than those of a selection of open-source database systems, namely Percona Server, SQLite, and Druid. Our prototype also reports query response times comparable to or better than those of Percona Server and Druid, and compares favorably in terms of storage footprint. We believe the experience reported here is valuable to researchers and practitioners interested in building database systems for high-velocity multidimensional sensor data.

[1]  Xiaofeng Meng,et al.  An efficient multi-dimensional index for cloud data management , 2009, CloudDB@CIKM.

[2]  Hongyu Miao,et al.  StreamBox: Modern Stream Processing on a Multicore Machine , 2017, USENIX Annual Technical Conference.

[3]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[4]  Divyakant Agrawal,et al.  MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[5]  Luis Carlos Erpen De Bona,et al.  Cubrick: Indexing Millions of Records per Second for Interactive Analytics , 2016, Proc. VLDB Endow..

[6]  Qi Huang,et al.  Gorilla: A Fast, Scalable, In-Memory Time Series Database , 2015, Proc. VLDB Endow..

[7]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[8]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[9]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[10]  Reza Dorrigiv,et al.  Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor Data on a Single Node , 2017, ArXiv.

[11]  Shen Li,et al.  Pyro: A Spatial-Temporal Big-Data Storage System , 2015, USENIX Annual Technical Conference.

[12]  Andrew Rau-Chaplin,et al.  VOLAP: A Scalable Distributed System for Real-Time OLAP with High Velocity Data , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  R. Vose,et al.  An Overview of the Global Historical Climatology Network-Daily Database , 2012 .

[14]  Beng Chin Ooi,et al.  Indexing multi-dimensional data in a cloud system , 2010, SIGMOD Conference.

[15]  Jie Wang,et al.  PL-Tree: An Efficient Indexing Method for High-Dimensional Data , 2013, SSTD.

[16]  David E. Culler,et al.  BTrDB: Optimizing Storage System Design for Timeseries Processing , 2016, FAST.

[17]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[18]  Christos Faloutsos,et al.  Multidimensional Access Methods: Trees Have Grown Everywhere , 1997, VLDB.

[19]  Moira C. Norrie,et al.  The PH-tree: a space-efficient storage structure and multi-dimensional index , 2014, SIGMOD Conference.

[20]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[21]  Gage Eads NbQ-CLOCK: A Non-blocking Queue-based CLOCK Algorithm for Web-Object Caching , 2013 .

[22]  David E. Culler,et al.  DISTIL: Design and implementation of a scalable synchrophasor data processing system , 2015, 2015 IEEE International Conference on Smart Grid Communications (SmartGridComm).