QEMU FVD       

links

QEMU FVD - overview


Fast Virtual Disk (FVD) for QEMU

1. Introduction

Fast Virtual Disk (FVD) is a new image format and the corresponding block device driver for QEMU. FVD provides virtual machines with high-performance, feature-rich virtual disks. As QEMU is used by several hypervisors (including KVM) to perform I/O emulation, FVD provides direct benefits to these hypervisors.

The following announcement was sent to the qemu-devel@nongnu.org mailing list, which is available in the mailing list archive.

Title: [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

Dear QEMU Community Members,

Happy new year! We would like to contribute a new year gift to the community.

As the community considers the next-generation image formats for QEMU, hopefully we really challenge ourselves hard enough to find the right solution for the long term, rather than just a convenient solution for the short term, because an image format has long-term impacts and is hard to change once released. In this spirit, we would like to argue that QCOW2 and QED's use of a two-level lookup table as the basis for implementing all features is a fundamental obstacle for achieving high performance. Accordingly, we advocate the newly developed Fast Virtual Disk (FVD) image format for adoption in the QEMU mainline. FVD achieves the performance of a RAW image running on a raw partition, while providing the rich features of compact image, copy-on-write, copy-on-read, and adaptive prefetching. FVD is extensible and can accommodate additional features. Experiments show that the throughput of FVD is 249% higher than that of QCOW2 when using the PostMark benchmark to create files.

FVD came out of the work done at IBM T.J. Watson Research Center, when studying virtual disk related issues during the development of the IBM Cloud. At IBM internally, FVD (a.k.a. ODS) has been widely demonstrated since June 2010. Recently, the FVD technical papers were completed and the source code was cleared for external release. Now we finally can share FVD with the community, and seek your valuable feedback and contributions. All related information is available at http://researcher.ibm.com/view_project.php?id=1852 , including a high-level overview of FVD, the source code, and the technical papers.

The FVD patch also includes a fully automated testing framework that exercises QEMU block device drivers under stress loads and extreme race conditions. Currently (as of January 2011), QCOW2 cannot pass the automated test. The symptom is that QCOW2 attempts to read beyond the end of the base image. QCOW2 experts please take a look at this "potential" bug.

Best Regards,
Chunqiang Tang
Home page: http://www.research.ibm.com/people/c/ctang

2. Code and Paper Download

The so-called "FVD-cow paper" below describes the copy-on-write, copy-on-read, and adaptive prefetching capabilities of FVD: "FVD: a High-Performance Virtual Machine Image Format for Cloud," by Chunqiang Tang, October, 2010.

The so-called "FVD-compact paper" below describes the compact image capability of FVD: "Compact Image Support in Fast Virtual Disk (FVD)," by Chunqiang Tang, November, 2010.

Download code from https://sites.google.com/site/tangchq/qemu-fvd . Note that the download link may not be accessible to some region in the world and a proxy is needed.

3. Comparing FVD and QCOW2/QED

FVD is designed to address the limitations of QCOW2. It has the following advantages over QCOW2/QED:
  • Optimize on-disk data layout. FVD strives to make the on-disk data layout identical (or at least as close as possible) to that of a RAW image stored on a raw partition, because all guest file systems are optimized for that layout. By contrast, QCOW2/QED may cause a severely fragmented data layout on the physical disk, which significantly increases disk seek distances and degrades disk I/O performance. For instance, when a guest OS creates or resizes a guest file system, it writes out the guest file system metadata, which are all grouped together and put at the beginning of a QCOW2/QED image, despite the fact that the guest file system metadata's virtual block addresses are deliberately scattered across the virtual disk for better reliability and locality, e.g., co-locating inodes and file content blocks in block groups. As a result, it causes a long disk seek distance between accessing the metadata of a file in the guest file system and accessing the file's content blocks. This problem is further explained in Sections 2.3 and 3.1 of the FVD-cow paper. The results in Figure 7 of the paper show that the average disk seek distance with a QCOW2 image is 460% longer than that with an FVD image.
  • Eliminate the overhead of a host file system when it can be avoided. Despite QCOW2/QED's compact image capability, in order to support storage over-commit (a.k.a. storage thin provisioning), QCOW2/QED must run on top of a host file system (e.g., ext3), but a host file system incurs a high overhead, causes data fragmentation, and compromises data integrity (see http://lwn.net/Articles/348739/). Specifically, the results in Figure 6 of the FVD-cow papershow that a RAW image on ext3 is 50-63% slower than a RAW image on a raw partition. By contrast, FVD can get rid of the host file system and directly store an image on a logical volume while still supporting storage over-commit, by automatically extending the size of the logical volume as needed. (Note: storage over-commit means that, e.g., a 100GB physical disk can be used to host 10 VMs, each with a 20GB virtual disk. This is possible because not every VM completely fills up its 20GB virtual disk.)
  • Eliminate the overhead of a compact image when it can be avoided. A compact image stores data in such a way that the size of the compact image is smaller than the size of the virtual disk perceived by the VM. The size of a compact image automatically grows as more data are written into the image. Compact image enables but is not mandatory for storage over-commit. For example, a RAW image stored as a sparse file on an ext3 host file system is not a compact image but supports storage over-commit. QCOW2 mandates the use of a compact image format, whereas compact image is optional in FVD. That is, a copy-on-write FVD image can have a RAW-image-like data layout. Ironically, related to the bullet above, QCOW2/QED must run on top of a host file system in order to support storage over-commit, but when QCOW2/QED actually runs on top of a host file system, QCOW2/QED's compact image capability is no longer needed, and merely adds overhead and causes the fragmented data layout problem described above.This is because 99.9% of QCOW2 images that ever exist today are stored on host file systems that already support sparse files, including ext2/ext3/ext4, GFS, NTFS, FFS, LFS, reiserFS, Reiser4, XFS, JFS, VMFS, and ZFS. Storing a RAW image on those file systems automatically gets a sparse image and supports storage over-commit. It is important to provide support for those 0.1% special cases that do not support sparse files, but not in a way that the needs of 0.1% of the users unconditionally hijack the needs of 99.9% of the users. FVD allows a user to decide whether to enable FVD's compact image capability, depending on whether the host file system already provides it.
  • Minimize disk I/O overhead for reading on-disk metadata. FVD minimizes the size of on-disk metadata so that they can be easily cached in memory and avoids disk I/O overhead for reading them. For example, for a 1TB virtual disk, the size of FVD's on-disk metadata are only 6MB, whereas QCOW2/QED's on-disk metadata are 128MB.
  • Minimize disk I/O overhead for updating on-disk metadata. FVD includes important optimizations to reduce on-disk metadata updates without compromising data integrity. See Section 3.3 of the FVD-cow paper and Section 2.3 of the FVD-compact paper for details. Thanks to the reduced overhead in reading or updating on-disk metadata, FVD introduces 45% less disk I/Os than QCOW2 does, when using the PostMark benchmark to create files (see Section 4.1.2 of the FVD-cow paper).

4. How FVD Works

FVD is designed to solve the problems described above and achieve two goals:
  • High performance, i.e., the performance of a RAW image on a raw partition.
  • Flexibility to support all use cases so that a clear message can be advertised to the user community, i.e., simply use FVD.
FVD provides the following main features:
  • Copy-on-write. Similar to QCOW2/QED, an FVD image can be layered on top of a base image. When the VM writes to the virtual disk, the "dirty data" are stored in the FVD image.
  • Copy-on-read. In addition to copy-on-write, when the VM reads some data from the base image, those "clean data" can optionally be saved in the FVD image so that later reads to those data can get them from the FVD image rather than the base image. This capability is useful, e.g., in a Cloud environment, where the base image is stored on network attached storage (NAS) and the FVD image is stored on direct-attached storage (DAS). Copy-on-read avoids repeatedly reading the same data from NAS. It can also be used for post-copy migration of local storage.
  • Adaptive prefetching.In addition to copy-on-read, optionally, FVD's prefetching mechanism can find resource idle time to copy from NAS to DAS the rest of the image that have not been accessed by the VM. Prefetching is conservative in that if FVD detects a contention on any resource (including
    DAS, NAS, or network), FVD pauses prefetching temporarily and resumes prefetching later when congestion disappears.
  • Compact image. FVD can store an image in a compact format and support storage over-commit.
FVD uses two on-disk metadata structures:
  • A bitmap to implement copy-on-write. A bit in the bitmap tracks the state of a block. The bit is 0 if the block is in the base image, and the bit is 1 if the block is in the FVD image. The default size of a block is 64KB, as that in QCOW2. To represent the state of a 1TB base image, FVD only needs a 2MB bitmap, which can be easily cached in memory.This bitmap also implements copy-on-read and adaptive prefetching (see the FVD-cow paper).
  • A one-level lookup table to implement compact image. One entry in the table maps the virtual disk address of a chunk to an offset in the FVD image where the chunk is stored. The default size of a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the lookup table is only 4MB.
By design, the chunk size is larger than the block size in order to reduce the size of the lookup table. A smaller lookup table can be easily cached in memory. The bitmap is small because of its efficient representation. Using a smaller block size improves runtime disk I/O performance, because during copy-on-write and copy-on-read, a complete block need be read from the base image and saved in the FVD image even if the VM only accesses part of that block.

FVD has several additional features:
  • Each function (compact image, copy-on-write, copy-on-read, or prefetching) can be enabled or disabled individually. For example, unlike QCOW2/QED, a copy-on-write FVD image does not have to use the compact image format.
  • An FVD image can be stored on any media, including a host file system, a raw partition, or a logical volume.
  • When a compact FVD image is stored on a logical volume and configured properly, FVD can automatically grow the size of the logical volume when the compact image needs more storage spaces. This allows FVD to support storage over-commit without using a host file system, which improves both performance and data integrity.
  • FVD uses an on-disk journal to store updates to on-disk metadata, which reduces disk I/O overhead and eliminates unnecessary metadata locking.

5. Flexibility of FVD

For brevity, we simply refer to FVD's two features as FVD-cow (copy-on-write) and FVD-compact (compact image). With both features enabled, it is called FVD-cow-compact. With both features disabled, it is called FVD-basic. Also for brevity, we do not further discuss copy-on-read and adaptive prefetching, as they are closely related to FVD-cow.

Let's consider how FVD supports different use cases, from the simplest one to the most complicated one demanding both high performance and rich features.
  • A naive user simply wants to get a VM up and running. She does not need any fancy features and cares little about performance.
Solution: use a default configuration of FVD, which perhaps can be either FVD-compact or FVD-basic on a host file system.
  • The user simply wants high performance and nothing else.
Solution: store an FVD-basic image on a raw partition or a logical volume.
  • The user wants high performance and storage over-commit.
Solution: FVD-compact on a logical volume.
Alternative: store a QCOW2/QED image on a host file system. The drawbacks are 1) the performance overhead and data integrity issue introduced by a host file system, and 2) fragmented on-disk data layout caused by both QCOW2/QED and the host file system.
  • The user wants high performance, storage over-commit, and copy-on-write (perhaps also copy-on-read and prefetching).
Solution: FVD-cow-compact on a logical volume.
  • The user wants high performance and copy-on-write (perhaps also copy-on-read and prefetching), but no need for storage over-commit.
Solution: FVD-cow on a raw partition or a logical volume.
  • The user wants high performance, storage over-commit, copy-on-write (perhaps also copy-on-read and prefetching), but the host OS does not support logical volumes.
Solution: FVD-cow on a host file system.
  • The user wants high performance, storage over-commit, copy-on-write (perhaps also copy-on-read and prefetching), but the host OS supports neither logical volumes nor sparse files.
Solution: FVD-cow-compact on a host file system that does not support sparse files.
Comment: This is a very rare scenario as almost every modern file system supports sparse files (see http://en.wikipedia.org/wiki/Comparison_of_file_systems).

6. Discussion

Next, we will answer some questions about FVD.
  • Does FVD's flexibility confuse users?
Answer: FVD is actually simple. The default configuration of FVD works out-of-the-box and suits many inexperienced users. Even for an advanced user who wants both high performance and rich features, the decision making process is straightforward: 1) whether to use copy-on-write, 2) whether to use a compact image, and 3) where to store the image---a logical volume or a host file system. For 1), the user already knows the answer depending on whether she has a base image. For 2), if the user has sufficient storage space, then not using a compact image provides better performance. For 3), a user wants high performance will use a logical volume.
  • Does FVD-cow-compact introduce more on-disk metadata updates than QCOW2/QED do? The concern is that FVD-cow-compact has two metadata structures (the lookup table and the bitmap), and hence handling a write issued by a VM may need to update both, which doubles the overhead.
Answer: FVD-cow-compact actually introduces less metadata updates than QCOW2/QED do, because of two reasons. First, the "journal optimization" described in Section 2.3 of the FVD-compact paper allows FVD to update a block's metadata (i.e., both the lookup table and the bitmap) in a single disk write, which not only is more efficient but also ensures their consistency. Second, the optimizations described in Section 3.3 of the FVD-cow paper eliminate the need to update the bitmap in most common cases, resulting a metadata update frequency much lower than that of QCOW2/QED.
  • Does FVD-cow-compact consume more memory than QCOW2/QED do? The concern is that FVD-cow-compact has two metadata structures (the lookup table and the bitmap), and hence may need more memory to cache them.
Answer: FVD-cow-compact actually needs much less memory to cache metadata than QCOW2/QED do. For a 1TB virtual disk, FVD's bitmap is 2MB and FVD's lookup table is 4MB, i.e., a total of 6MB. For a 1TB virtual disk, the L2 tables of QCOW2/QED alone are 128MB.
  • Does FVD-compact use more storage space than QCOW2/QED do? The concern is that FVD-compact allocates storage space at the granularity of 1MB chunks, whereas QCOW2/QED do allocation at the granularity of 64KB.
Answer: The difference in storage space consumption might not be large. Experiments show that creating a guest ext3 file system on a 10GB FVD-compact disk uses 378MB storage space, while the same on QCOW2 uses 312MB storage space. If FVD-compact uses a chunk size of 512KB, it uses 334MB storage space. Additionally, there are two arguments. First, using a large chunk size for storage allocation is a popular practice---VirtualBox VDI uses 1MB chunks , and Microsoft VHD and VMware VMDK for ESX both use 2MB chunks. Second, when an FVD-compact image is stored on a host file system (which is how QCOW2/QED images are stored), storage spaces are actually allocated by the host file system at the granularity of 4KB regardless of the chunk size in FVD-compact. In this case, FVD-compact consumes no more storage space than QCOW2/QED do.
  • Can FVD be extended to support QCOW2's other features, including encryption, snapshot, and compression?
Answer: It is possible to add additional features to FVD. Encryption in FVD can work in a way similar to that in QCOW2, and no changes to the bitmap or the lookup table are needed. We recommend following VMware VMDK's approach to implement snapshots, i.e., starting a new CoW image for each snapshot, as opposed to QCOW2's approach of storing all snapshots in one image. As for compression, we are not certain if it is a necessary feature, although it might be doable.
  • How to make the transition from QCOW2 to FVD?
Answer: existing images can continue to use QCOW2 or be converted to FVD using the qemu-img tool. Over time, new images will be attracted to use FVD due to its superior performance, rich features, and flexibility to cover all use cases.

7. Automated Testing

FVD comes with a fully automated testing framework "qemu-test", which can exercise QEMU block device drivers under stress loads and extreme race conditions. A simulated disk allows qemu-test to fully control and randomize the timing of disk I/O activities and callbacks in order to trigger rare race conditions. It also ensures that every observed bug is precisely repeatable. These ideas follow our previous work for testing distributed systems, but it is a new implementation for testing QEMU block device drivers. Currently (as of January 2011), QCOW2 cannot pass the automated test. The symptom is that QCOW2 attempts to read beyond the end of the base image. QCOW2 experts please take a look at this "potential" bug. Simply run the script "test-qcow2.sh" and the bug will show up after some time.

8. Conclusion

In summary, the design of FVD takes a principled approach to achieve the following benefits:
  • Strive to make the on-disk data layout identical (or at least as close as possible) to that of a RAW image stored on a raw partition.
  • Eliminate the overhead of a host file system when it can be avoided.
  • Eliminate the overhead of a compact image when it can be avoided.
  • Minimize disk I/O overhead for reading on-disk metadata by reducing metadata size.
  • Minimize disk I/O overhead for updating on-disk metadata.
Overall, FVD is the most flexible and best-performing image format, not only for QEMU but also among image formats supported other hypervisors. We strongly recommend the adoption of FVD into the QEMU mainline, and encourage community contributions to FVD.