QEMU FVD
links
QEMU FVD - overview
Fast Virtual Disk (FVD) for QEMU
1.
Introduction
Fast Virtual Disk (FVD) is a new image format and the corresponding
block device driver for QEMU. FVD provides virtual machines with
high-performance, feature-rich virtual disks. As QEMU is used by several
hypervisors (including KVM)
to
perform I/O emulation, FVD provides direct benefits to these
hypervisors.The following announcement was sent to the qemu-devel@nongnu.org mailing list, which is available in the mailing list archive.
Title: [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%
Dear QEMU Community Members,
Happy new year! We would like to contribute a new year gift to the community.
As the community considers the next-generation image formats for QEMU, hopefully we really challenge ourselves hard enough to find the right solution for the long term, rather than just a convenient solution for the short term, because an image format has long-term impacts and is hard to change once released. In this spirit, we would like to argue that QCOW2 and QED's use of a two-level lookup table as the basis for implementing all features is a fundamental obstacle for achieving high performance. Accordingly, we advocate the newly developed Fast Virtual Disk (FVD) image format for adoption in the QEMU mainline. FVD achieves the performance of a RAW image running on a raw partition, while providing the rich features of compact image, copy-on-write, copy-on-read, and adaptive prefetching. FVD is extensible and can accommodate additional features. Experiments show that the throughput of FVD is 249% higher than that of QCOW2 when using the PostMark benchmark to create files.
FVD came out of the work done at IBM T.J. Watson Research Center, when studying virtual disk related issues during the development of the IBM Cloud. At IBM internally, FVD (a.k.a. ODS) has been widely demonstrated since June 2010. Recently, the FVD technical papers were completed and the source code was cleared for external release. Now we finally can share FVD with the community, and seek your valuable feedback and contributions. All related information is available at http://researcher.ibm.com/view_project.php?id=1852 , including a high-level overview of FVD, the source code, and the technical papers.
The FVD patch also includes a fully automated testing framework that exercises QEMU block device drivers under stress loads and extreme race conditions. Currently (as of January 2011), QCOW2 cannot pass the automated test. The symptom is that QCOW2 attempts to read beyond the end of the base image. QCOW2 experts please take a look at this "potential" bug.
Best Regards,
Chunqiang Tang
Home page: http://www.research.ibm.com/people/c/ctang
2.
Code and Paper Download
The so-called "FVD-cow paper"
below describes the
copy-on-write, copy-on-read, and adaptive prefetching
capabilities of FVD: "FVD:
a
High-Performance
Virtual
Machine
Image
Format
for
Cloud,"
by Chunqiang Tang, October, 2010.The so-called "FVD-compact paper" below describes the compact image capability of FVD: "Compact Image Support in Fast Virtual Disk (FVD)," by Chunqiang Tang, November, 2010.
Download code from https://sites.google.com/site/tangchq/qemu-fvd . Note that the download link may not be accessible to some region in the world and a proxy is needed.
3. Comparing FVD and QCOW2/QED
FVD is designed to address the limitations of QCOW2. It has the following advantages over QCOW2/QED:- Optimize on-disk data layout. FVD strives to make the on-disk data layout identical (or at least as close as possible) to that of a RAW image stored on a raw partition, because all guest file systems are optimized for that layout. By contrast, QCOW2/QED may cause a severely fragmented data layout on the physical disk, which significantly increases disk seek distances and degrades disk I/O performance. For instance, when a guest OS creates or resizes a guest file system, it writes out the guest file system metadata, which are all grouped together and put at the beginning of a QCOW2/QED image, despite the fact that the guest file system metadata's virtual block addresses are deliberately scattered across the virtual disk for better reliability and locality, e.g., co-locating inodes and file content blocks in block groups. As a result, it causes a long disk seek distance between accessing the metadata of a file in the guest file system and accessing the file's content blocks. This problem is further explained in Sections 2.3 and 3.1 of the FVD-cow paper. The results in Figure 7 of the paper show that the average disk seek distance with a QCOW2 image is 460% longer than that with an FVD image.
- Eliminate the overhead of a host file system when it can be avoided. Despite QCOW2/QED's compact image capability, in order to support storage over-commit (a.k.a. storage thin provisioning), QCOW2/QED must run on top of a host file system (e.g., ext3), but a host file system incurs a high overhead, causes data fragmentation, and compromises data integrity (see http://lwn.net/Articles/348739/). Specifically, the results in Figure 6 of the FVD-cow papershow that a RAW image on ext3 is 50-63% slower than a RAW image on a raw partition. By contrast, FVD can get rid of the host file system and directly store an image on a logical volume while still supporting storage over-commit, by automatically extending the size of the logical volume as needed. (Note: storage over-commit means that, e.g., a 100GB physical disk can be used to host 10 VMs, each with a 20GB virtual disk. This is possible because not every VM completely fills up its 20GB virtual disk.)
- Eliminate the overhead of a
compact image when it can be avoided. A compact image stores
data
in such a way that the size of the compact image is smaller than the
size of the virtual disk perceived by the VM. The size of a compact
image automatically grows as more data
are written into the image. Compact image enables but is not mandatory
for storage over-commit. For example, a
RAW image stored as a sparse file on an ext3 host file system is not a
compact image but
supports storage over-commit. QCOW2 mandates the use of a compact image
format, whereas compact image is optional in FVD. That is, a
copy-on-write FVD image can have a RAW-image-like data layout.
Ironically, related
to the bullet above, QCOW2/QED must
run on top of a host file system in order to support storage
over-commit, but when QCOW2/QED actually runs on
top
of
a host file system, QCOW2/QED's compact image capability is no longer
needed,
and merely adds overhead and causes the fragmented data layout
problem described above.This is because 99.9% of QCOW2
images
that ever exist today are
stored on host file systems that already support sparse files,
including ext2/ext3/ext4, GFS, NTFS, FFS, LFS, reiserFS, Reiser4, XFS,
JFS, VMFS, and ZFS. Storing
a RAW image on those file systems automatically gets a sparse image and
supports storage over-commit. It is important to provide
support for those
0.1% special cases that do not support sparse files, but not in a way
that the needs of 0.1% of
the users
unconditionally hijack the needs of 99.9% of the users. FVD allows a
user to decide
whether to enable FVD's compact image capability, depending on whether
the host file system already provides it.
- Minimize disk I/O overhead for reading on-disk metadata. FVD minimizes the size of on-disk metadata so that they can be easily cached in memory and avoids disk I/O overhead for reading them. For example, for a 1TB virtual disk, the size of FVD's on-disk metadata are only 6MB, whereas QCOW2/QED's on-disk metadata are 128MB.
- Minimize disk I/O overhead for
updating on-disk metadata. FVD includes important optimizations
to reduce on-disk metadata updates without compromising data integrity.
See Section 3.3 of the FVD-cow paper and
Section 2.3 of the FVD-compact paper
for details. Thanks to
the reduced overhead in reading or updating on-disk metadata, FVD
introduces 45% less disk I/Os than QCOW2 does, when using the PostMark
benchmark to create files (see Section 4.1.2 of the FVD-cow paper).
4. How FVD Works
FVD is designed to solve the problems described above and achieve two goals:- High performance, i.e., the performance of a RAW image on a raw partition.
- Flexibility to support all use cases so that a clear message can be advertised to the user community, i.e., simply use FVD.
- Copy-on-write. Similar to QCOW2/QED, an FVD image can be layered on top of a base image. When the VM writes to the virtual disk, the "dirty data" are stored in the FVD image.
- Copy-on-read. In addition
to copy-on-write, when the VM reads some data from the base image,
those "clean data" can optionally be saved in the FVD image so that
later reads to
those data can get them from the FVD image rather than the base image.
This capability is useful, e.g., in a Cloud environment, where
the base image is stored on network attached storage (NAS) and the
FVD image is stored on direct-attached storage (DAS). Copy-on-read
avoids repeatedly reading the same data from NAS. It can also be used
for post-copy migration of local storage.
- Adaptive prefetching. In
addition
to
copy-on-read,
optionally,
FVD's
prefetching
mechanism
can
find
resource
idle
time
to
copy from NAS to DAS the rest of the image
that have not been accessed by the VM. Prefetching is
conservative in that if FVD detects a contention on any resource
(including
DAS, NAS, or network), FVD pauses prefetching temporarily and resumes prefetching later when congestion disappears. - Compact image. FVD can
store an image in a compact format and support storage over-commit.
- A bitmap to implement copy-on-write. A bit in the bitmap tracks the state of a block. The bit is 0 if the block is in the base image, and the bit is 1 if the block is in the FVD image. The default size of a block is 64KB, as that in QCOW2. To represent the state of a 1TB base image, FVD only needs a 2MB bitmap, which can be easily cached in memory.This bitmap also implements copy-on-read and adaptive prefetching (see the FVD-cow paper).
- A one-level lookup table to implement compact image. One
entry in the table maps the virtual disk address of a chunk to an
offset in the FVD image where the chunk is stored. The default size of
a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft
VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the
lookup table is only 4MB.
FVD has several additional features:
- Each function (compact image, copy-on-write, copy-on-read, or prefetching) can be enabled or disabled individually. For example, unlike QCOW2/QED, a copy-on-write FVD image does not have to use the compact image format.
- An FVD image can be stored on any media, including a host file system, a raw partition, or a logical volume.
- When a compact FVD image is stored on a logical volume and configured properly, FVD can automatically grow the size of the logical volume when the compact image needs more storage spaces. This allows FVD to support storage over-commit without using a host file system, which improves both performance and data integrity.
- FVD uses an on-disk journal to store updates to on-disk metadata,
which reduces disk I/O overhead and eliminates unnecessary metadata
locking.
5. Flexibility of FVD
For brevity, we simply refer to FVD's two features as FVD-cow (copy-on-write) and FVD-compact (compact image). With both features enabled, it is called FVD-cow-compact. With both features disabled, it is called FVD-basic. Also for brevity, we do not further discuss copy-on-read and adaptive prefetching, as they are closely related to FVD-cow.Let's consider how FVD supports different use cases, from the simplest one to the most complicated one demanding both high performance and rich features.
- A naive user simply wants to get a VM up and running. She does not need any fancy features and cares little about performance.
Solution: use a default
configuration
of FVD, which perhaps can
be
either FVD-compact or FVD-basic on a host file system.
- The user simply wants high performance and nothing else.
Solution:
store an FVD-basic image on a raw partition or a logical
volume.
- The user wants high performance and storage over-commit.
Solution:
FVD-compact
on
a
logical
volume.
Alternative: store a QCOW2/QED image on a host file system. The drawbacks are 1) the performance overhead and data integrity issue introduced by a host file system, and 2) fragmented on-disk data layout caused by both QCOW2/QED and the host file system.
Alternative: store a QCOW2/QED image on a host file system. The drawbacks are 1) the performance overhead and data integrity issue introduced by a host file system, and 2) fragmented on-disk data layout caused by both QCOW2/QED and the host file system.
- The user wants high performance, storage over-commit, and copy-on-write (perhaps also copy-on-read and prefetching).
Solution:
FVD-cow-compact
on
a
logical
volume.
- The user wants high performance and copy-on-write (perhaps also copy-on-read and prefetching), but no need for storage over-commit.
Solution:
FVD-cow
on
a
raw
partition
or a logical volume.
- The user wants high performance, storage over-commit, copy-on-write (perhaps also copy-on-read and prefetching), but the host OS does not support logical volumes.
Solution:
FVD-cow
on
a
host
file
system.
- The user wants high performance, storage over-commit, copy-on-write (perhaps also copy-on-read and prefetching), but the host OS supports neither logical volumes nor sparse files.
Solution:
FVD-cow-compact
on
a
host
file
system that does not support sparse files.
Comment: This is a very rare scenario as almost every modern file system supports sparse files (see http://en.wikipedia.org/wiki/Comparison_of_file_systems).
Comment: This is a very rare scenario as almost every modern file system supports sparse files (see http://en.wikipedia.org/wiki/Comparison_of_file_systems).
6. Discussion
Next, we will answer some questions about FVD.- Does FVD's flexibility confuse users?
Answer:
FVD is
actually simple.
The default configuration of FVD works out-of-the-box and suits many
inexperienced users. Even for an advanced user who wants both high
performance and rich features, the decision making process is
straightforward: 1)
whether to use copy-on-write, 2) whether to use a compact image, and 3)
where to store the image---a logical volume or a host file system. For
1), the user already knows the answer depending on whether she has a
base image. For 2), if the user has sufficient storage space, then not
using a compact image provides better performance. For 3), a user wants
high performance will use a
logical volume.
- Does FVD-cow-compact introduce more on-disk metadata updates than QCOW2/QED do? The concern is that FVD-cow-compact has two metadata structures (the lookup table and the bitmap), and hence handling a write issued by a VM may need to update both, which doubles the overhead.
Answer: FVD-cow-compact actually
introduces less metadata updates than QCOW2/QED do, because of two
reasons. First, the "journal
optimization" described in Section 2.3 of the FVD-compact paper allows
FVD to update a block's metadata (i.e., both the lookup table and the
bitmap) in a single disk write, which not only is more efficient but
also ensures their consistency. Second, the optimizations
described in Section 3.3 of the FVD-cow
paper eliminate the need to update the bitmap in most common cases,
resulting
a metadata update frequency much lower than that of QCOW2/QED.
- Does FVD-cow-compact consume more memory than QCOW2/QED do? The concern is that FVD-cow-compact has two metadata structures (the lookup table and the bitmap), and hence may need more memory to cache them.
Answer: FVD-cow-compact
actually
needs
much less memory to cache metadata than QCOW2/QED do. For a 1TB virtual
disk, FVD's bitmap is 2MB and FVD's lookup table is 4MB, i.e., a total
of 6MB. For a 1TB virtual disk, the L2 tables of QCOW2/QED alone are
128MB.
- Does FVD-compact use more storage space than QCOW2/QED do? The concern is that FVD-compact allocates storage space at the granularity of 1MB chunks, whereas QCOW2/QED do allocation at the granularity of 64KB.
Answer:
The difference in storage space consumption might not be large.
Experiments
show that
creating a guest ext3 file system on a 10GB FVD-compact disk uses 378MB
storage space, while the same on QCOW2 uses 312MB storage space. If
FVD-compact uses a chunk size of 512KB, it uses 334MB storage space.
Additionally, there are
two arguments. First, using a large chunk
size for storage allocation is a popular practice---VirtualBox VDI
uses 1MB chunks , and Microsoft VHD and VMware VMDK for ESX
both use 2MB chunks.
Second, when an FVD-compact
image
is stored on a host file system (which is how QCOW2/QED images are
stored), storage spaces are actually allocated by the host file system
at
the granularity of 4KB regardless of the chunk size in FVD-compact. In
this case, FVD-compact
consumes no more storage space than QCOW2/QED do.
- Can FVD be extended to support QCOW2's other features, including encryption, snapshot, and compression?
Answer:
It is possible to add additional features to FVD. Encryption in FVD can
work
in a way similar to that in QCOW2, and no changes to the bitmap or the
lookup table are
needed. We
recommend following VMware VMDK's approach to implement
snapshots, i.e., starting a new CoW image for each snapshot, as opposed
to QCOW2's approach of storing
all snapshots in one image. As for compression, we are not certain if
it is a necessary feature, although it might be doable.
- How to make the transition from QCOW2 to FVD?
Answer:
existing images can continue to use QCOW2 or be converted to FVD
using the qemu-img tool.
Over time, new images will be attracted to use FVD due to its superior
performance, rich features, and flexibility to cover all use
cases.
7. Automated Testing
FVD comes with a fully automated testing framework "qemu-test", which can exercise QEMU block device drivers under stress loads and extreme race conditions. A simulated disk allows qemu-test to fully control and randomize the timing of disk I/O activities and callbacks in order to trigger rare race conditions. It also ensures that every observed bug is precisely repeatable. These ideas follow our previous work for testing distributed systems, but it is a new implementation for testing QEMU block device drivers. Currently (as of January 2011), QCOW2 cannot pass the automated test. The symptom is that QCOW2 attempts to read beyond the end of the base image. QCOW2 experts please take a look at this "potential" bug. Simply run the script "test-qcow2.sh" and the bug will show up after some time.8. Conclusion
In summary, the design of FVD takes a principled approach to achieve the following benefits:- Strive to make the on-disk data layout identical (or at least as close as possible) to that of a RAW image stored on a raw partition.
- Eliminate the overhead of a host file system when it can be avoided.
- Eliminate the overhead of a compact image when it can be avoided.
- Minimize disk I/O overhead for reading on-disk metadata by reducing metadata size.
- Minimize disk I/O overhead for updating on-disk metadata.