1# Virtualized I/O with Vhost-user {#vhost_processing} 2 3## Table of Contents {#vhost_processing_toc} 4 5- @ref vhost_processing_intro 6- @ref vhost_processing_qemu 7- @ref vhost_processing_init 8- @ref vhost_processing_io_path 9- @ref vhost_spdk_optimizations 10 11## Introduction {#vhost_processing_intro} 12 13This document is intended to provide an overview of how Vhost works behind the 14scenes. Code snippets used in this document might have been simplified for the 15sake of readability and should not be used as an API or implementation 16reference. 17 18Reading from the 19[Virtio specification](http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html): 20 21> The purpose of virtio and [virtio] specification is that virtual environments 22> and guests should have a straightforward, efficient, standard and extensible 23> mechanism for virtual devices, rather than boutique per-environment or per-OS 24> mechanisms. 25 26Virtio devices use virtqueues to transport data efficiently. Virtqueue is a set 27of three different single-producer, single-consumer ring structures designed to 28store generic scatter-gatter I/O. Virtio is most commonly used in QEMU VMs, 29where the QEMU itself exposes a virtual PCI device and the guest OS communicates 30with it using a specific Virtio PCI driver. With only Virtio involved, it's 31always the QEMU process that handles all I/O traffic. 32 33Vhost is a protocol for devices accessible via inter-process communication. 34It uses the same virtqueue layout as Virtio to allow Vhost devices to be mapped 35directly to Virtio devices. This allows a Vhost device, exposed by an SPDK 36application, to be accessed directly by a guest OS inside a QEMU process with 37an existing Virtio (PCI) driver. Only the configuration, I/O submission 38notification, and I/O completion interruption are piped through QEMU. 39See also @ref vhost_spdk_optimizations 40 41The initial vhost implementation is a part of the Linux kernel and uses ioctl 42interface to communicate with userspace applications. What makes it possible for 43SPDK to expose a vhost device is Vhost-user protocol. 44 45The [Vhost-user specification](https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD) 46describes the protocol as follows: 47 48> [Vhost-user protocol] is aiming to complement the ioctl interface used to 49> control the vhost implementation in the Linux kernel. It implements the control 50> plane needed to establish virtqueue sharing with a user space process on the 51> same host. It uses communication over a Unix domain socket to share file 52> descriptors in the ancillary data of the message. 53> 54> The protocol defines 2 sides of the communication, master and slave. Master is 55> the application that shares its virtqueues, in our case QEMU. Slave is the 56> consumer of the virtqueues. 57> 58> In the current implementation QEMU is the Master, and the Slave is intended to 59> be a software Ethernet switch running in user space, such as Snabbswitch. 60> 61> Master and slave can be either a client (i.e. connecting) or server (listening) 62> in the socket communication. 63 64SPDK vhost is a Vhost-user slave server. It exposes Unix domain sockets and 65allows external applications to connect. 66 67## QEMU {#vhost_processing_qemu} 68 69One of major Vhost-user use cases is networking (DPDK) or storage (SPDK) 70offload in QEMU. The following diagram presents how QEMU-based VM 71communicates with SPDK Vhost-SCSI device. 72 73 74 75## Device initialization {#vhost_processing_init} 76 77All initialization and management information is exchanged using Vhost-user 78messages. The connection always starts with the feature negotiation. Both 79the Master and the Slave exposes a list of their implemented features and 80upon negotiation they choose a common set of those. Most of these features are 81implementation-related, but also regard e.g. multiqueue support or live migration. 82 83After the negotiation, the Vhost-user driver shares its memory, so that the vhost 84device (SPDK) can access it directly. The memory can be fragmented into multiple 85physically-discontiguous regions and Vhost-user specification puts a limit on 86their number - currently 8. The driver sends a single message for each region with 87the following data: 88 89- file descriptor - for mmap 90- user address - for memory translations in Vhost-user messages (e.g. 91 translating vring addresses) 92- guest address - for buffers addresses translations in vrings (for QEMU this 93 is a physical address inside the guest) 94- user offset - positive offset for the mmap 95- size 96 97The Master will send new memory regions after each memory change - usually 98hotplug/hotremove. The previous mappings will be removed. 99 100Drivers may also request a device config, consisting of e.g. disk geometry. 101Vhost-SCSI drivers, however, don't need to implement this functionality 102as they use common SCSI I/O to inquiry the underlying disk(s). 103 104Afterwards, the driver requests the number of maximum supported queues and 105starts sending virtqueue data, which consists of: 106 107- unique virtqueue id 108- index of the last processed vring descriptor 109- vring addresses (from user address space) 110- call descriptor (for interrupting the driver after I/O completions) 111- kick descriptor (to listen for I/O requests - unused by SPDK) 112 113If multiqueue feature has been negotiated, the driver has to send a specific 114*ENABLE* message for each extra queue it wants to be polled. Other queues are 115polled as soon as they're initialized. 116 117## I/O path {#vhost_processing_io_path} 118 119The Master sends I/O by allocating proper buffers in shared memory, filling 120the request data, and putting guest addresses of those buffers into virtqueues. 121 122A Virtio-Block request looks as follows. 123 124```c 125struct virtio_blk_req { 126 uint32_t type; // READ, WRITE, FLUSH (read-only) 127 uint64_t offset; // offset in the disk (read-only) 128 struct iovec buffers[]; // scatter-gatter list (read/write) 129 uint8_t status; // I/O completion status (write-only) 130}; 131``` 132And a Virtio-SCSI request as follows. 133 134```c 135struct virtio_scsi_req_cmd { 136 struct virtio_scsi_cmd_req *req; // request data (read-only) 137 struct iovec read_only_buffers[]; // scatter-gatter list for WRITE I/Os 138 struct virtio_scsi_cmd_resp *resp; // response data (write-only) 139 struct iovec write_only_buffers[]; // scatter-gatter list for READ I/Os 140} 141``` 142 143Virtqueue generally consists of an array of descriptors and each I/O needs 144to be converted into a chain of such descriptors. A single descriptor can be 145either readable or writable, so each I/O request consists of at least two 146(request + response). 147 148```c 149struct virtq_desc { 150 /* Address (guest-physical). */ 151 le64 addr; 152 /* Length. */ 153 le32 len; 154 155/* This marks a buffer as continuing via the next field. */ 156#define VIRTQ_DESC_F_NEXT 1 157/* This marks a buffer as device write-only (otherwise device read-only). */ 158#define VIRTQ_DESC_F_WRITE 2 159 /* The flags as indicated above. */ 160 le16 flags; 161 /* Next field if flags & NEXT */ 162 le16 next; 163}; 164``` 165 166Legacy Virtio implementations used the name vring alongside virtqueue, and the 167name vring is still used in virtio data structures inside the code. Instead of 168`struct virtq_desc`, the `struct vring_desc` is much more likely to be found. 169 170The device after polling this descriptor chain needs to translate and transform 171it back into the original request struct. It needs to know the request layout 172up-front, so each device backend (Vhost-Block/SCSI) has its own implementation 173for polling virtqueues. For each descriptor, the device performs a lookup in 174the Vhost-user memory region table and goes through a gpa_to_vva translation 175(guest physical address to vhost virtual address). SPDK enforces the request 176and response data to be contained within a single memory region. I/O buffers 177do not have such limitations and SPDK may automatically perform additional 178iovec splitting and gpa_to_vva translations if required. After forming the request 179structs, SPDK forwards such I/O to the underlying drive and polls for the 180completion. Once I/O completes, SPDK vhost fills the response buffer with 181proper data and interrupts the guest by doing an eventfd_write on the call 182descriptor for proper virtqueue. There are multiple interrupt coalescing 183features involved, but they are not be discussed in this document. 184 185### SPDK optimizations {#vhost_spdk_optimizations} 186 187Due to its poll-mode nature, SPDK vhost removes the requirement for I/O submission 188notifications, drastically increasing the vhost server throughput and decreasing 189the guest overhead of submitting an I/O. A couple of different solutions exist 190to mitigate the I/O completion interrupt overhead (irqfd, vDPA), but those won't 191be discussed in this document. For the highest performance, a poll-mode @ref virtio 192can be used, as it suppresses all I/O completion interrupts, making the I/O 193path to fully bypass the QEMU/KVM overhead. 194