xref: /spdk/doc/vhost_processing.md (revision ba7331a712dbf255f764d9d556e73f16b3a724c0)
1# Virtualized I/O with Vhost-user {#vhost_processing}
2
3## Table of Contents {#vhost_processing_toc}
4
5- @ref vhost_processing_intro
6- @ref vhost_processing_qemu
7- @ref vhost_processing_init
8- @ref vhost_processing_io_path
9- @ref vhost_spdk_optimizations
10
11## Introduction {#vhost_processing_intro}
12
13This document is intended to provide an overview of how Vhost works behind the
14scenes. Code snippets used in this document might have been simplified for the
15sake of readability and should not be used as an API or implementation
16reference.
17
18Reading from the
19[Virtio specification](http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html):
20
21> The purpose of virtio and [virtio] specification is that virtual environments
22> and guests should have a straightforward, efficient, standard and extensible
23> mechanism for virtual devices, rather than boutique per-environment or per-OS
24> mechanisms.
25
26Virtio devices use virtqueues to transport data efficiently. Virtqueue is a set
27of three different single-producer, single-consumer ring structures designed to
28store generic scatter-gatter I/O. Virtio is most commonly used in QEMU VMs,
29where the QEMU itself exposes a virtual PCI device and the guest OS communicates
30with it using a specific Virtio PCI driver. With only Virtio involved, it's
31always the QEMU process that handles all I/O traffic.
32
33Vhost is a protocol for devices accessible via inter-process communication.
34It uses the same virtqueue layout as Virtio to allow Vhost devices to be mapped
35directly to Virtio devices. This allows a Vhost device, exposed by an SPDK
36application, to be accessed directly by a guest OS inside a QEMU process with
37an existing Virtio (PCI) driver. Only the configuration, I/O submission
38notification, and I/O completion interruption are piped through QEMU.
39See also @ref vhost_spdk_optimizations
40
41The initial vhost implementation is a part of the Linux kernel and uses ioctl
42interface to communicate with userspace applications. What makes it possible for
43SPDK to expose a vhost device is Vhost-user protocol.
44
45The [Vhost-user specification](https://qemu-project.gitlab.io/qemu/interop/vhost-user.html)
46describes the protocol as follows:
47
48> [Vhost-user protocol] is aiming to complement the ioctl interface used to
49> control the vhost implementation in the Linux kernel. It implements the control
50> plane needed to establish virtqueue sharing with a user space process on the
51> same host. It uses communication over a Unix domain socket to share file
52> descriptors in the ancillary data of the message.
53>
54> The protocol defines 2 sides of the communication, front-end and back-end.
55> The front-end is the application that shares its virtqueues, in our case QEMU.
56> The back-end is the consumer of the virtqueues.
57>
58> In the current implementation QEMU is the front-end, and the back-end is
59> the external process consuming the virtio queues, for example a software
60> Ethernet switch running in user space, such as Snabbswitch, or a block
61> device back-end processing read and write to a virtual disk.
62>
63> The front-end and back-end can be either a client (i.e. connecting) or
64> server (listening) in the socket communication.
65
66SPDK vhost is a Vhost-user back-end server. It exposes Unix domain sockets and
67allows external applications to connect.
68
69## QEMU {#vhost_processing_qemu}
70
71One of major Vhost-user use cases is networking (DPDK) or storage (SPDK)
72offload in QEMU. The following diagram presents how QEMU-based VM
73communicates with SPDK Vhost-SCSI device.
74
75![QEMU/SPDK vhost data flow](img/qemu_vhost_data_flow.svg)
76
77## Device initialization {#vhost_processing_init}
78
79All initialization and management information is exchanged using Vhost-user
80messages. The connection always starts with the feature negotiation. Both
81the front-end and the back-end expose a list of their implemented features and
82upon negotiation they choose a common set of those. Most of these features are
83implementation-related, but also regard e.g. multiqueue support or live migration.
84
85After the negotiation, the Vhost-user driver shares its memory, so that the vhost
86device (SPDK) can access it directly. The memory can be fragmented into multiple
87physically-discontiguous regions and Vhost-user specification puts a limit on
88their number - currently 8. The driver sends a single message for each region with
89the following data:
90
91- file descriptor - for mmap
92- user address - for memory translations in Vhost-user messages (e.g.
93  translating vring addresses)
94- guest address - for buffers addresses translations in vrings (for QEMU this
95  is a physical address inside the guest)
96- user offset - positive offset for the mmap
97- size
98
99The front-end will send new memory regions after each memory change - usually
100hotplug/hotremove. The previous mappings will be removed.
101
102Drivers may also request a device config, consisting of e.g. disk geometry.
103Vhost-SCSI drivers, however, don't need to implement this functionality
104as they use common SCSI I/O to inquiry the underlying disk(s).
105
106Afterwards, the driver requests the number of maximum supported queues and
107starts sending virtqueue data, which consists of:
108
109- unique virtqueue id
110- index of the last processed vring descriptor
111- vring addresses (from user address space)
112- call descriptor (for interrupting the driver after I/O completions)
113- kick descriptor (to listen for I/O requests - unused by SPDK)
114
115If multiqueue feature has been negotiated, the driver has to send a specific
116*ENABLE* message for each extra queue it wants to be polled. Other queues are
117polled as soon as they're initialized.
118
119## I/O path {#vhost_processing_io_path}
120
121The front-end sends I/O by allocating proper buffers in shared memory, filling
122the request data, and putting guest addresses of those buffers into virtqueues.
123
124A Virtio-Block request looks as follows.
125
126```c
127struct virtio_blk_req {
128        uint32_t type; // READ, WRITE, FLUSH (read-only)
129        uint64_t offset; // offset in the disk (read-only)
130        struct iovec buffers[]; // scatter-gatter list (read/write)
131        uint8_t status; // I/O completion status (write-only)
132};
133```
134And a Virtio-SCSI request as follows.
135
136```c
137struct virtio_scsi_req_cmd {
138  struct virtio_scsi_cmd_req *req; // request data (read-only)
139  struct iovec read_only_buffers[]; // scatter-gatter list for WRITE I/Os
140  struct virtio_scsi_cmd_resp *resp; // response data (write-only)
141  struct iovec write_only_buffers[]; // scatter-gatter list for READ I/Os
142}
143```
144
145Virtqueue generally consists of an array of descriptors and each I/O needs
146to be converted into a chain of such descriptors. A single descriptor can be
147either readable or writable, so each I/O request consists of at least two
148(request + response).
149
150```c
151struct virtq_desc {
152        /* Address (guest-physical). */
153        le64 addr;
154        /* Length. */
155        le32 len;
156
157/* This marks a buffer as continuing via the next field. */
158#define VIRTQ_DESC_F_NEXT   1
159/* This marks a buffer as device write-only (otherwise device read-only). */
160#define VIRTQ_DESC_F_WRITE     2
161        /* The flags as indicated above. */
162        le16 flags;
163        /* Next field if flags & NEXT */
164        le16 next;
165};
166```
167
168Legacy Virtio implementations used the name vring alongside virtqueue, and the
169name vring is still used in virtio data structures inside the code. Instead of
170`struct virtq_desc`, the `struct vring_desc` is much more likely to be found.
171
172The device after polling this descriptor chain needs to translate and transform
173it back into the original request struct. It needs to know the request layout
174up-front, so each device backend (Vhost-Block/SCSI) has its own implementation
175for polling virtqueues. For each descriptor, the device performs a lookup in
176the Vhost-user memory region table and goes through a gpa_to_vva translation
177(guest physical address to vhost virtual address). SPDK enforces the request
178and response data to be contained within a single memory region. I/O buffers
179do not have such limitations and SPDK may automatically perform additional
180iovec splitting and gpa_to_vva translations if required. After forming the request
181structs, SPDK forwards such I/O to the underlying drive and polls for the
182completion. Once I/O completes, SPDK vhost fills the response buffer with
183proper data and interrupts the guest by doing an eventfd_write on the call
184descriptor for proper virtqueue. There are multiple interrupt coalescing
185features involved, but they are not be discussed in this document.
186
187### SPDK optimizations {#vhost_spdk_optimizations}
188
189Due to its poll-mode nature, SPDK vhost removes the requirement for I/O submission
190notifications, drastically increasing the vhost server throughput and decreasing
191the guest overhead of submitting an I/O. A couple of different solutions exist
192to mitigate the I/O completion interrupt overhead (irqfd, vDPA), but those won't
193be discussed in this document. For the highest performance, a poll-mode @ref virtio
194can be used, as it suppresses all I/O completion interrupts, making the I/O
195path to fully bypass the QEMU/KVM overhead.
196