论文信息 - Online maintenance of very large random samples on flash storage

Online maintenance of very large random samples on flash storage

Recent advances in flash storage have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA’s, laptops, and even servers. However, flash storage has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with flash storage. For example, while random reads can be nearly as fast as sequential reads, random writes and in-place data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for flash storage: efficiently maintaining a very large random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric file are not readily adapted to flash. Second, we propose B-File, an energy-efficient abstraction for flash storage to store self-expiring items, and show how a B-File can be used to efficiently maintain a large sample in flash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with flash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-File and to query the large sample stored in a B-File for a subsample of an arbitrary size. Finally, we present an evaluation with flash storage that shows our techniques are several orders of magnitude faster and more energy-efficient than (flash-friendly versions of) reservoir sampling and geometric file. A key finding of our study, of potential use to many flash algorithms beyond sampling, is that “semi-random” writes (as defined in the paper) on flash cards are over two orders of magnitude faster and more energy-efficient than random writes.

Suman Nath | Phillip B. Gibbons | Suman Nath

[1] Jeffrey Scott Vitter,et al. Random sampling with a reservoir , 1985, TOMS.

[2] Jeffrey Scott Vitter,et al. An efficient algorithm for sequential random sampling , 1987, TOMS.

[3] William Pugh,et al. Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[4] Ping Xu,et al. Random sampling from hash files , 1990, SIGMOD '90.

[5] Kai Li,et al. Storage alternatives for mobile computers , 1994, OSDI '94.

[6] Chris Jermaine,et al. A Novel Index Supporting High Volume Data Warehouse Insertion , 1999, VLDB.

[7] Srikanta Tirthapura,et al. Estimating simple functions on the union of data streams , 2001, SPAA '01.

[8] Tei-Wei Kuo,et al. An efficient R-tree implementation over flash-memory storage systems , 2003, GIS '03.

[9] Surajit Chaudhuri,et al. Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[10] Peter Desnoyers,et al. Capsule: an energy-optimized object storage system for memory-constrained sensor devices , 2006, SenSys '06.

[11] Sang-Won Lee,et al. Design of flash-based DBMS: an in-page logging approach , 2007, SIGMOD '07.

[12] Jongmoo Choi,et al. Block recycling schemes and their cost-based optimization in nand flash memory based storage system , 2007, EMSOFT '07.

[13] Suman Nath,et al. FlashDB: Dynamic Self-tuning Database for NAND Flash , 2007, 2007 6th International Symposium on Information Processing in Sensor Networks.

[14] Michael Isard,et al. A design for high-performance flash disks , 2007, OPSR.

[15] Hyojun Kim,et al. BPLRU: A Buffer Management Scheme for Improving Random Writes in Flash Storage , 2008, FAST.

[16] Rina Panigrahy,et al. Design Tradeoffs for SSD Performance , 2008, USENIX ATC.