论文信息 - Exploiting in-network processing for big data management

Exploiting in-network processing for big data management

Data processing systems face the task of efficiently storing and processing data at petabyte scale, with the amount set to increase in the future. To meet such a requirement, highly scalable, shared-nothing systems, e.g. Google's BigTable [6] or Facebook's Cassandra [14], are built to partition data and process it in parallel on distributed nodes in a cluster. This allows the handling of data at scale but introduces new challenges due to the distribution of data. Running queries involves a high network overhead because data has to be exchanged between cluster nodes and hence, the network becomes a critical part of the system. To avoid the network bottleneck, it is essential for distributed data processing systems (DDPS) to be aware of the network rather than treating it as a black box. We propose in-network processing as a way of achieving network-awareness to decrease bandwidth usage by custom routing, redundancy elimination, and on-path data reduction. Thereby, we can increase the query throughput of a DDPS. The challenges of an in-network processing system range from design issues, such as performance and transparency, to the integration with query optimisation and deployment in data centres. We formulate these challenges as possible research directions and provide a prototype implementation. Our preliminary results suggest that we can significantly improve query throughput in a DDPS by performing partial data reduction within the network.

Lukas Rupprecht | Lukas Rupprecht

[1] Wolfgang Lehner,et al. SAP HANA database: data management for modern business applications , 2012, SGMD.

[2] Prashant Malik,et al. Cassandra: a decentralized structured storage system , 2010, OPSR.

[3] Vasileios Pappas,et al. Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement , 2010, 2010 Proceedings IEEE INFOCOM.

[4] Nigel Ellis,et al. Extreme scale with full SQL language support in microsoft SQL Azure , 2010, SIGMOD Conference.

[5] Haitao Wu,et al. BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[6] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8] Haitao Wu,et al. RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store , 2012, HotCloud.

[9] Michael Stonebraker,et al. H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[10] Emin Gün Sirer,et al. SideCar: building programmable datacenter networks without programmable switches , 2010, Hotnets-IX.

[11] Theodore Johnson,et al. Gigascope: a stream database for network applications , 2003, SIGMOD '03.