Selfie: co-locating metadata and data to enable fast virtual block devices

Virtual block devices are widely used to provide block interface to virtual machines (VMs). A virtual block device manages an indirection mapping from the virtual address space presented to a VM, to a storage image hosted on file system or storage volume. This indirection is recorded as metadata on the image, also known as a lookup table, which needs to be immediately updated upon each space allocation on the image for data safety (also known as image growth). This growth is common as VM templates for large-scale deployments and snapshots for fast migration of VMs are heavily used. Though each table update involves only a few bytes of data, it demands a random write of an entire block. Furthermore, data consistency demands correct order of metadata and data writes be enforced, usually by inserting the FLUSH command between them. These metadata operations compromise virtual device's efficiency. In this paper we introduce Selfie, a virtual disk format, that eliminates frequent metadata writes by embedding metadata into data blocks and makes write of a data block and its associated metadata be completed in one atomic block operation. This is made possible by opportunistically compressing data in a block to make room for the metadata. Experiments show that Selfie gains as much as 5× performance improvements over existing mainstream virtual disks. It delivers near-raw performance with an impressive scalability for concurrent I/O workloads.

[1]  Jeffrey C. Mogul,et al.  A Better Update Policy , 1994, USENIX Summer.

[2]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[3]  Norwood Viviano Images , 2017 .

[4]  Andrea C. Arpaci-Dusseau,et al.  A file is not a file: understanding the I/O behavior of Apple desktop applications , 2011, SOSP 2011.

[5]  Andrea C. Arpaci-Dusseau,et al.  Optimistic crash consistency , 2013, SOSP.

[6]  Fred Douglis,et al.  Migratory compression: coarse-grained data reordering to improve compressibility , 2014, FAST.

[7]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8]  Chunqiang Tang,et al.  FVD: A High-Performance Virtual Machine Image Format for Cloud , 2011, USENIX Annual Technical Conference.

[9]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[10]  Angelos Bilas,et al.  Transparent Online Storage Compression at the Block-Level , 2012, TOS.

[11]  Wei Wang,et al.  ReconFS: a reconstructable file system on flash storage , 2014, FAST.

[12]  Angelos Bilas,et al.  Using transparent compression to improve SSD-based I/O caches , 2010, EuroSys '10.

[13]  Andrea C. Arpaci-Dusseau,et al.  A File Is Not a File: Understanding the I/O Behavior of Apple Desktop Applications , 2012, TOCS.

[14]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[15]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[16]  Eyal de Lara,et al.  SnowFlock: rapid virtual machine cloning for cloud computing , 2009, EuroSys '09.