xref: /spdk/doc/vhost_processing.md (revision 9efad7468f30e1c5f7442823f5a8b17acd1e6a9b)
1# Virtualized I/O with Vhost-user {#vhost_processing}
2
3## Table of Contents {#vhost_processing_toc}
4
5- @ref vhost_processing_intro
6- @ref vhost_processing_qemu
7- @ref vhost_processing_init
8- @ref vhost_processing_io_path
9- @ref vhost_spdk_optimizations
10
11## Introduction {#vhost_processing_intro}
12
13This document is intended to provide an overview of how Vhost works behind the
14scenes. Code snippets used in this document might have been simplified for the
15sake of readability and should not be used as an API or implementation
16reference.
17
18Reading from the
19[Virtio specification](http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html):
20
21> The purpose of virtio and [virtio] specification is that virtual environments
22> and guests should have a straightforward, efficient, standard and extensible
23> mechanism for virtual devices, rather than boutique per-environment or per-OS
24> mechanisms.
25
26Virtio devices use virtqueues to transport data efficiently. Virtqueue is a set
27of three different single-producer, single-consumer ring structures designed to
28store generic scatter-gatter I/O. Virtio is most commonly used in QEMU VMs,
29where the QEMU itself exposes a virtual PCI device and the guest OS communicates
30with it using a specific Virtio PCI driver. With only Virtio involved, it's
31always the QEMU process that handles all I/O traffic.
32
33Vhost is a protocol for devices accessible via inter-process communication.
34It uses the same virtqueue layout as Virtio to allow Vhost devices to be mapped
35directly to Virtio devices. This allows a Vhost device, exposed by an SPDK
36application, to be accessed directly by a guest OS inside a QEMU process with
37an existing Virtio (PCI) driver. Only the configuration, I/O submission
38notification, and I/O completion interruption are piped through QEMU.
39See also @ref vhost_spdk_optimizations
40
41The initial vhost implementation is a part of the Linux kernel and uses ioctl
42interface to communicate with userspace applications. What makes it possible for
43SPDK to expose a vhost device is Vhost-user protocol.
44
45The [Vhost-user specification](https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD)
46describes the protocol as follows:
47
48> [Vhost-user protocol] is aiming to complement the ioctl interface used to
49> control the vhost implementation in the Linux kernel. It implements the control
50> plane needed to establish virtqueue sharing with a user space process on the
51> same host. It uses communication over a Unix domain socket to share file
52> descriptors in the ancillary data of the message.
53>
54> The protocol defines 2 sides of the communication, master and slave. Master is
55> the application that shares its virtqueues, in our case QEMU. Slave is the
56> consumer of the virtqueues.
57>
58> In the current implementation QEMU is the Master, and the Slave is intended to
59> be a software Ethernet switch running in user space, such as Snabbswitch.
60>
61> Master and slave can be either a client (i.e. connecting) or server (listening)
62> in the socket communication.
63
64SPDK vhost is a Vhost-user slave server. It exposes Unix domain sockets and
65allows external applications to connect.
66
67## QEMU {#vhost_processing_qemu}
68
69One of major Vhost-user use cases is networking (DPDK) or storage (SPDK)
70offload in QEMU. The following diagram presents how QEMU-based VM
71communicates with SPDK Vhost-SCSI device.
72
73![QEMU/SPDK vhost data flow](img/qemu_vhost_data_flow.svg)
74
75## Device initialization {#vhost_processing_init}
76
77All initialization and management information is exchanged using Vhost-user
78messages. The connection always starts with the feature negotiation. Both
79the Master and the Slave exposes a list of their implemented features and
80upon negotiation they choose a common set of those. Most of these features are
81implementation-related, but also regard e.g. multiqueue support or live migration.
82
83After the negotiation, the Vhost-user driver shares its memory, so that the vhost
84device (SPDK) can access it directly. The memory can be fragmented into multiple
85physically-discontiguous regions and Vhost-user specification puts a limit on
86their number - currently 8. The driver sends a single message for each region with
87the following data:
88
89- file descriptor - for mmap
90- user address - for memory translations in Vhost-user messages (e.g.
91  translating vring addresses)
92- guest address - for buffers addresses translations in vrings (for QEMU this
93  is a physical address inside the guest)
94- user offset - positive offset for the mmap
95- size
96
97The Master will send new memory regions after each memory change - usually
98hotplug/hotremove. The previous mappings will be removed.
99
100Drivers may also request a device config, consisting of e.g. disk geometry.
101Vhost-SCSI drivers, however, don't need to implement this functionality
102as they use common SCSI I/O to inquiry the underlying disk(s).
103
104Afterwards, the driver requests the number of maximum supported queues and
105starts sending virtqueue data, which consists of:
106
107- unique virtqueue id
108- index of the last processed vring descriptor
109- vring addresses (from user address space)
110- call descriptor (for interrupting the driver after I/O completions)
111- kick descriptor (to listen for I/O requests - unused by SPDK)
112
113If multiqueue feature has been negotiated, the driver has to send a specific
114*ENABLE* message for each extra queue it wants to be polled. Other queues are
115polled as soon as they're initialized.
116
117## I/O path {#vhost_processing_io_path}
118
119The Master sends I/O by allocating proper buffers in shared memory, filling
120the request data, and putting guest addresses of those buffers into virtqueues.
121
122A Virtio-Block request looks as follows.
123
124```c
125struct virtio_blk_req {
126        uint32_t type; // READ, WRITE, FLUSH (read-only)
127        uint64_t offset; // offset in the disk (read-only)
128        struct iovec buffers[]; // scatter-gatter list (read/write)
129        uint8_t status; // I/O completion status (write-only)
130};
131```
132And a Virtio-SCSI request as follows.
133
134```c
135struct virtio_scsi_req_cmd {
136  struct virtio_scsi_cmd_req *req; // request data (read-only)
137  struct iovec read_only_buffers[]; // scatter-gatter list for WRITE I/Os
138  struct virtio_scsi_cmd_resp *resp; // response data (write-only)
139  struct iovec write_only_buffers[]; // scatter-gatter list for READ I/Os
140}
141```
142
143Virtqueue generally consists of an array of descriptors and each I/O needs
144to be converted into a chain of such descriptors. A single descriptor can be
145either readable or writable, so each I/O request consists of at least two
146(request + response).
147
148```c
149struct virtq_desc {
150        /* Address (guest-physical). */
151        le64 addr;
152        /* Length. */
153        le32 len;
154
155/* This marks a buffer as continuing via the next field. */
156#define VIRTQ_DESC_F_NEXT   1
157/* This marks a buffer as device write-only (otherwise device read-only). */
158#define VIRTQ_DESC_F_WRITE     2
159        /* The flags as indicated above. */
160        le16 flags;
161        /* Next field if flags & NEXT */
162        le16 next;
163};
164```
165
166Legacy Virtio implementations used the name vring alongside virtqueue, and the
167name vring is still used in virtio data structures inside the code. Instead of
168`struct virtq_desc`, the `struct vring_desc` is much more likely to be found.
169
170The device after polling this descriptor chain needs to translate and transform
171it back into the original request struct. It needs to know the request layout
172up-front, so each device backend (Vhost-Block/SCSI) has its own implementation
173for polling virtqueues. For each descriptor, the device performs a lookup in
174the Vhost-user memory region table and goes through a gpa_to_vva translation
175(guest physical address to vhost virtual address). SPDK enforces the request
176and response data to be contained within a single memory region. I/O buffers
177do not have such limitations and SPDK may automatically perform additional
178iovec splitting and gpa_to_vva translations if required. After forming the request
179structs, SPDK forwards such I/O to the underlying drive and polls for the
180completion. Once I/O completes, SPDK vhost fills the response buffer with
181proper data and interrupts the guest by doing an eventfd_write on the call
182descriptor for proper virtqueue. There are multiple interrupt coalescing
183features involved, but they are not be discussed in this document.
184
185### SPDK optimizations {#vhost_spdk_optimizations}
186
187Due to its poll-mode nature, SPDK vhost removes the requirement for I/O submission
188notifications, drastically increasing the vhost server throughput and decreasing
189the guest overhead of submitting an I/O. A couple of different solutions exist
190to mitigate the I/O completion interrupt overhead (irqfd, vDPA), but those won't
191be discussed in this document. For the highest performance, a poll-mode @ref virtio
192can be used, as it suppresses all I/O completion interrupts, making the I/O
193path to fully bypass the QEMU/KVM overhead.
194