1# Virtualized I/O with Vhost-user {#vhost_processing} 2 3# Table of Contents {#vhost_processing_toc} 4 5- @ref vhost_processing_intro 6- @ref vhost_processing_qemu 7- @ref vhost_processing_init 8- @ref vhost_processing_io_path 9- @ref vhost_spdk_optimizations 10 11# Introduction {#vhost_processing_intro} 12 13This document is intended to provide an overview of how Vhost works behind the 14scenes. Code snippets used in this document might have been simplified for the 15sake of readability and should not be used as an API or implementation 16reference. 17 18Reading from the 19[Virtio specification](http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html): 20 21``` 22The purpose of virtio and [virtio] specification is that virtual environments 23and guests should have a straightforward, efficient, standard and extensible 24mechanism for virtual devices, rather than boutique per-environment or per-OS 25mechanisms. 26``` 27 28Virtio devices use virtqueues to transport data efficiently. Virtqueue is a set 29of three different single-producer, single-consumer ring structures designed to 30store generic scatter-gatter I/O. Virtio is most commonly used in QEMU VMs, 31where the QEMU itself exposes a virtual PCI device and the guest OS communicates 32with it using a specific Virtio PCI driver. With only Virtio involved, it's 33always the QEMU process that handles all I/O traffic. 34 35Vhost is a protocol for devices accessible via inter-process communication. 36It uses the same virtqueue layout as Virtio to allow Vhost devices to be mapped 37directly to Virtio devices. This allows a Vhost device, exposed by an SPDK 38application, to be accessed directly by a guest OS inside a QEMU process with 39an existing Virtio (PCI) driver. Only the configuration, I/O submission 40notification, and I/O completion interruption are piped through QEMU. 41See also @ref vhost_spdk_optimizations 42 43The initial vhost implementation is a part of the Linux kernel and uses ioctl 44interface to communicate with userspace applications. What makes it possible for 45SPDK to expose a vhost device is Vhost-user protocol. 46 47The [Vhost-user specification](https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD) 48describes the protocol as follows: 49 50``` 51[Vhost-user protocol] is aiming to complement the ioctl interface used to 52control the vhost implementation in the Linux kernel. It implements the control 53plane needed to establish virtqueue sharing with a user space process on the 54same host. It uses communication over a Unix domain socket to share file 55descriptors in the ancillary data of the message. 56 57The protocol defines 2 sides of the communication, master and slave. Master is 58the application that shares its virtqueues, in our case QEMU. Slave is the 59consumer of the virtqueues. 60 61In the current implementation QEMU is the Master, and the Slave is intended to 62be a software Ethernet switch running in user space, such as Snabbswitch. 63 64Master and slave can be either a client (i.e. connecting) or server (listening) 65in the socket communication. 66``` 67 68SPDK vhost is a Vhost-user slave server. It exposes Unix domain sockets and 69allows external applications to connect. 70 71# QEMU {#vhost_processing_qemu} 72 73One of major Vhost-user use cases is networking (DPDK) or storage (SPDK) 74offload in QEMU. The following diagram presents how QEMU-based VM 75communicates with SPDK Vhost-SCSI device. 76 77 78 79# Device initialization {#vhost_processing_init} 80 81All initialization and management information is exchanged using Vhost-user 82messages. The connection always starts with the feature negotiation. Both 83the Master and the Slave exposes a list of their implemented features and 84upon negotiation they choose a common set of those. Most of these features are 85implementation-related, but also regard e.g. multiqueue support or live migration. 86 87After the negotiation, the Vhost-user driver shares its memory, so that the vhost 88device (SPDK) can access it directly. The memory can be fragmented into multiple 89physically-discontiguous regions and Vhost-user specification puts a limit on 90their number - currently 8. The driver sends a single message for each region with 91the following data: 92 * file descriptor - for mmap 93 * user address - for memory translations in Vhost-user messages (e.g. 94 translating vring addresses) 95 * guest address - for buffers addresses translations in vrings (for QEMU this 96 is a physical address inside the guest) 97 * user offset - positive offset for the mmap 98 * size 99 100The Master will send new memory regions after each memory change - usually 101hotplug/hotremove. The previous mappings will be removed. 102 103Drivers may also request a device config, consisting of e.g. disk geometry. 104Vhost-SCSI drivers, however, don't need to implement this functionality 105as they use common SCSI I/O to inquiry the underlying disk(s). 106 107Afterwards, the driver requests the number of maximum supported queues and 108starts sending virtqueue data, which consists of: 109 * unique virtqueue id 110 * index of the last processed vring descriptor 111 * vring addresses (from user address space) 112 * call descriptor (for interrupting the driver after I/O completions) 113 * kick descriptor (to listen for I/O requests - unused by SPDK) 114 115If multiqueue feature has been negotiated, the driver has to send a specific 116*ENABLE* message for each extra queue it wants to be polled. Other queues are 117polled as soon as they're initialized. 118 119# I/O path {#vhost_processing_io_path} 120 121The Master sends I/O by allocating proper buffers in shared memory, filling 122the request data, and putting guest addresses of those buffers into virtqueues. 123 124A Virtio-Block request looks as follows. 125 126``` 127struct virtio_blk_req { 128 uint32_t type; // READ, WRITE, FLUSH (read-only) 129 uint64_t offset; // offset in the disk (read-only) 130 struct iovec buffers[]; // scatter-gatter list (read/write) 131 uint8_t status; // I/O completion status (write-only) 132}; 133``` 134And a Virtio-SCSI request as follows. 135 136``` 137struct virtio_scsi_req_cmd { 138 struct virtio_scsi_cmd_req *req; // request data (read-only) 139 struct iovec read_only_buffers[]; // scatter-gatter list for WRITE I/Os 140 struct virtio_scsi_cmd_resp *resp; // response data (write-only) 141 struct iovec write_only_buffers[]; // scatter-gatter list for READ I/Os 142} 143``` 144 145Virtqueue generally consists of an array of descriptors and each I/O needs 146to be converted into a chain of such descriptors. A single descriptor can be 147either readable or writable, so each I/O request consists of at least two 148(request + response). 149 150``` 151struct virtq_desc { 152 /* Address (guest-physical). */ 153 le64 addr; 154 /* Length. */ 155 le32 len; 156 157/* This marks a buffer as continuing via the next field. */ 158#define VIRTQ_DESC_F_NEXT 1 159/* This marks a buffer as device write-only (otherwise device read-only). */ 160#define VIRTQ_DESC_F_WRITE 2 161 /* The flags as indicated above. */ 162 le16 flags; 163 /* Next field if flags & NEXT */ 164 le16 next; 165}; 166``` 167 168Legacy Virtio implementations used the name vring alongside virtqueue, and the 169name vring is still used in virtio data structures inside the code. Instead of 170`struct virtq_desc`, the `struct vring_desc` is much more likely to be found. 171 172The device after polling this descriptor chain needs to translate and transform 173it back into the original request struct. It needs to know the request layout 174up-front, so each device backend (Vhost-Block/SCSI) has its own implementation 175for polling virtqueues. For each descriptor, the device performs a lookup in 176the Vhost-user memory region table and goes through a gpa_to_vva translation 177(guest physical address to vhost virtual address). SPDK enforces the request 178and response data to be contained within a single memory region. I/O buffers 179do not have such limitations and SPDK may automatically perform additional 180iovec splitting and gpa_to_vva translations if required. After forming the request 181structs, SPDK forwards such I/O to the underlying drive and polls for the 182completion. Once I/O completes, SPDK vhost fills the response buffer with 183proper data and interrupts the guest by doing an eventfd_write on the call 184descriptor for proper virtqueue. There are multiple interrupt coalescing 185features involved, but they are not be discussed in this document. 186 187## SPDK optimizations {#vhost_spdk_optimizations} 188 189Due to its poll-mode nature, SPDK vhost removes the requirement for I/O submission 190notifications, drastically increasing the vhost server throughput and decreasing 191the guest overhead of submitting an I/O. A couple of different solutions exist 192to mitigate the I/O completion interrupt overhead (irqfd, vDPA), but those won't 193be discussed in this document. For the highest performance, a poll-mode @ref virtio 194can be used, as it suppresses all I/O completion interrupts, making the I/O 195path to fully bypass the QEMU/KVM overhead. 196