Exploiting in-network processing for big data management

Data processing systems face the task of efficiently storing and processing data at petabyte scale, with the amount set to increase in the future. To meet such a requirement, highly scalable, shared-nothing systems, e.g. Google's BigTable [6] or Facebook's Cassandra [14], are built to partition data and process it in parallel on distributed nodes in a cluster. This allows the handling of data at scale but introduces new challenges due to the distribution of data. Running queries involves a high network overhead because data has to be exchanged between cluster nodes and hence, the network becomes a critical part of the system. To avoid the network bottleneck, it is essential for distributed data processing systems (DDPS) to be aware of the network rather than treating it as a black box. We propose in-network processing as a way of achieving network-awareness to decrease bandwidth usage by custom routing, redundancy elimination, and on-path data reduction. Thereby, we can increase the query throughput of a DDPS. The challenges of an in-network processing system range from design issues, such as performance and transparency, to the integration with query optimisation and deployment in data centres. We formulate these challenges as possible research directions and provide a prototype implementation. Our preliminary results suggest that we can significantly improve query throughput in a DDPS by performing partial data reduction within the network.

[1]  Wolfgang Lehner,et al.  SAP HANA database: data management for modern business applications , 2012, SGMD.

[2]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[3]  Vasileios Pappas,et al.  Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement , 2010, 2010 Proceedings IEEE INFOCOM.

[4]  Nigel Ellis,et al.  Extreme scale with full SQL language support in microsoft SQL Azure , 2010, SIGMOD Conference.

[5]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8]  Haitao Wu,et al.  RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store , 2012, HotCloud.

[9]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[10]  Emin Gün Sirer,et al.  SideCar: building programmable datacenter networks without programmable switches , 2010, Hotnets-IX.

[11]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[12]  Anees Shaikh,et al.  CloudNaaS: a cloud networking platform for enterprise applications , 2011, SoCC.

[13]  Wolfgang Lehner,et al.  Data mining with the SAP NetWeaver BI accelerator , 2006, VLDB.

[14]  Antony I. T. Rowstron,et al.  Camdoop: Exploiting In-network Aggregation for Big Data Applications , 2012, NSDI.

[15]  Vyas Sekar,et al.  SmartRE: an architecture for coordinated network-wide redundancy elimination , 2009, SIGCOMM '09.

[16]  Mario A. Nascimento,et al.  Better tree - better fruits: using dominating set trees for MAX queries , 2008, DMSN '08.

[17]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[18]  Alexander L. Wolf,et al.  NaaS: Network-as-a-Service in the Cloud , 2012, Hot-ICE.

[19]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[20]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.