论文信息 - Understanding query performance in Accumulo

Understanding query performance in Accumulo

Open-source, BigTable-like distributed databases provide a scalable storage solution for data-intensive applications. The simple key-value storage schema provides fast record ingest and retrieval, nearly independent of the quantity of data stored. However, real applications must support non-trivial queries that require careful key design and value indexing. We study an Apache Accumulo-based big data system designed for a network situational awareness application. The application's storage schema and data retrieval requirements are analyzed. We then characterize the corresponding Accumulo performance bottlenecks. Queries are shown to be communication-bound and server-bound in different situations. Inefficiencies in the open-source communication stack and filesystem limit network and I/O performance, respectively. Additionally, in some situations, parallel clients can contend for server-side resources. Maximizing data retrieval rates for practical queries requires effective key design, indexing, and client parallelization.

Scott M. Sawyer | B. David O'Gwynn | An Tran | Tamara Yu

[1] Jeremy Kepner,et al. Dynamic distributed dimensional data model (D4M) database and computation system , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[3] Michael Stonebraker,et al. A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[4] Eric Anderson,et al. Efficiency matters! , 2010, OPSR.

[5] Dhabaleswar K. Panda,et al. Understanding the communication characteristics in HBase: What are the fundamental bottlenecks? , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[6] Jeremy Kepner,et al. Driving big data with big compute , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[7] Lin Xiao,et al. YCSB++: benchmarking and performance debugging advanced features in scalable table stores , 2011, SoCC.