xref: /spdk/doc/vhost_processing.md (revision ba7331a712dbf255f764d9d556e73f16b3a724c0)
11589ce9cSBen Walker# Virtualized I/O with Vhost-user {#vhost_processing}
28a989c7cSDariusz Stojaczyk
31e1fd9acSwawryk## Table of Contents {#vhost_processing_toc}
48a989c7cSDariusz Stojaczyk
58a989c7cSDariusz Stojaczyk- @ref vhost_processing_intro
68a989c7cSDariusz Stojaczyk- @ref vhost_processing_qemu
78a989c7cSDariusz Stojaczyk- @ref vhost_processing_init
88a989c7cSDariusz Stojaczyk- @ref vhost_processing_io_path
9813ed709SDariusz Stojaczyk- @ref vhost_spdk_optimizations
108a989c7cSDariusz Stojaczyk
111e1fd9acSwawryk## Introduction {#vhost_processing_intro}
128a989c7cSDariusz Stojaczyk
131589ce9cSBen WalkerThis document is intended to provide an overview of how Vhost works behind the
141589ce9cSBen Walkerscenes. Code snippets used in this document might have been simplified for the
151589ce9cSBen Walkersake of readability and should not be used as an API or implementation
161589ce9cSBen Walkerreference.
178a989c7cSDariusz Stojaczyk
18813ed709SDariusz StojaczykReading from the
19813ed709SDariusz Stojaczyk[Virtio specification](http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html):
20813ed709SDariusz Stojaczyk
2163ee471bSMaciej Wawryk> The purpose of virtio and [virtio] specification is that virtual environments
2263ee471bSMaciej Wawryk> and guests should have a straightforward, efficient, standard and extensible
2363ee471bSMaciej Wawryk> mechanism for virtual devices, rather than boutique per-environment or per-OS
2463ee471bSMaciej Wawryk> mechanisms.
25813ed709SDariusz Stojaczyk
26813ed709SDariusz StojaczykVirtio devices use virtqueues to transport data efficiently. Virtqueue is a set
27813ed709SDariusz Stojaczykof three different single-producer, single-consumer ring structures designed to
28813ed709SDariusz Stojaczykstore generic scatter-gatter I/O. Virtio is most commonly used in QEMU VMs,
29813ed709SDariusz Stojaczykwhere the QEMU itself exposes a virtual PCI device and the guest OS communicates
30813ed709SDariusz Stojaczykwith it using a specific Virtio PCI driver. With only Virtio involved, it's
31813ed709SDariusz Stojaczykalways the QEMU process that handles all I/O traffic.
32813ed709SDariusz Stojaczyk
33813ed709SDariusz StojaczykVhost is a protocol for devices accessible via inter-process communication.
34813ed709SDariusz StojaczykIt uses the same virtqueue layout as Virtio to allow Vhost devices to be mapped
35813ed709SDariusz Stojaczykdirectly to Virtio devices. This allows a Vhost device, exposed by an SPDK
36813ed709SDariusz Stojaczykapplication, to be accessed directly by a guest OS inside a QEMU process with
37813ed709SDariusz Stojaczykan existing Virtio (PCI) driver. Only the configuration, I/O submission
38813ed709SDariusz Stojaczyknotification, and I/O completion interruption are piped through QEMU.
39813ed709SDariusz StojaczykSee also @ref vhost_spdk_optimizations
40813ed709SDariusz Stojaczyk
41813ed709SDariusz StojaczykThe initial vhost implementation is a part of the Linux kernel and uses ioctl
42813ed709SDariusz Stojaczykinterface to communicate with userspace applications. What makes it possible for
43813ed709SDariusz StojaczykSPDK to expose a vhost device is Vhost-user protocol.
448a989c7cSDariusz Stojaczyk
45*ba7331a7SJohn KariukiThe [Vhost-user specification](https://qemu-project.gitlab.io/qemu/interop/vhost-user.html)
468a989c7cSDariusz Stojaczykdescribes the protocol as follows:
478a989c7cSDariusz Stojaczyk
4863ee471bSMaciej Wawryk> [Vhost-user protocol] is aiming to complement the ioctl interface used to
4963ee471bSMaciej Wawryk> control the vhost implementation in the Linux kernel. It implements the control
5063ee471bSMaciej Wawryk> plane needed to establish virtqueue sharing with a user space process on the
5163ee471bSMaciej Wawryk> same host. It uses communication over a Unix domain socket to share file
5263ee471bSMaciej Wawryk> descriptors in the ancillary data of the message.
5363ee471bSMaciej Wawryk>
54*ba7331a7SJohn Kariuki> The protocol defines 2 sides of the communication, front-end and back-end.
55*ba7331a7SJohn Kariuki> The front-end is the application that shares its virtqueues, in our case QEMU.
56*ba7331a7SJohn Kariuki> The back-end is the consumer of the virtqueues.
5763ee471bSMaciej Wawryk>
58*ba7331a7SJohn Kariuki> In the current implementation QEMU is the front-end, and the back-end is
59*ba7331a7SJohn Kariuki> the external process consuming the virtio queues, for example a software
60*ba7331a7SJohn Kariuki> Ethernet switch running in user space, such as Snabbswitch, or a block
61*ba7331a7SJohn Kariuki> device back-end processing read and write to a virtual disk.
6263ee471bSMaciej Wawryk>
63*ba7331a7SJohn Kariuki> The front-end and back-end can be either a client (i.e. connecting) or
64*ba7331a7SJohn Kariuki> server (listening) in the socket communication.
658a989c7cSDariusz Stojaczyk
66*ba7331a7SJohn KariukiSPDK vhost is a Vhost-user back-end server. It exposes Unix domain sockets and
678a989c7cSDariusz Stojaczykallows external applications to connect.
688a989c7cSDariusz Stojaczyk
691e1fd9acSwawryk## QEMU {#vhost_processing_qemu}
708a989c7cSDariusz Stojaczyk
718a989c7cSDariusz StojaczykOne of major Vhost-user use cases is networking (DPDK) or storage (SPDK)
728a989c7cSDariusz Stojaczykoffload in QEMU. The following diagram presents how QEMU-based VM
738a989c7cSDariusz Stojaczykcommunicates with SPDK Vhost-SCSI device.
748a989c7cSDariusz Stojaczyk
75026c69dbSDariusz Stojaczyk![QEMU/SPDK vhost data flow](img/qemu_vhost_data_flow.svg)
768a989c7cSDariusz Stojaczyk
771e1fd9acSwawryk## Device initialization {#vhost_processing_init}
788a989c7cSDariusz Stojaczyk
79813ed709SDariusz StojaczykAll initialization and management information is exchanged using Vhost-user
808a989c7cSDariusz Stojaczykmessages. The connection always starts with the feature negotiation. Both
81*ba7331a7SJohn Kariukithe front-end and the back-end expose a list of their implemented features and
82813ed709SDariusz Stojaczykupon negotiation they choose a common set of those. Most of these features are
83813ed709SDariusz Stojaczykimplementation-related, but also regard e.g. multiqueue support or live migration.
848a989c7cSDariusz Stojaczyk
851f813ec3SChen WangAfter the negotiation, the Vhost-user driver shares its memory, so that the vhost
868a989c7cSDariusz Stojaczykdevice (SPDK) can access it directly. The memory can be fragmented into multiple
87813ed709SDariusz Stojaczykphysically-discontiguous regions and Vhost-user specification puts a limit on
88813ed709SDariusz Stojaczyktheir number - currently 8. The driver sends a single message for each region with
89813ed709SDariusz Stojaczykthe following data:
903d8a0b19SKarol Latecki
911df1583bSwawryk- file descriptor - for mmap
921df1583bSwawryk- user address - for memory translations in Vhost-user messages (e.g.
938a989c7cSDariusz Stojaczyk  translating vring addresses)
941df1583bSwawryk- guest address - for buffers addresses translations in vrings (for QEMU this
958a989c7cSDariusz Stojaczyk  is a physical address inside the guest)
961df1583bSwawryk- user offset - positive offset for the mmap
971df1583bSwawryk- size
988a989c7cSDariusz Stojaczyk
99*ba7331a7SJohn KariukiThe front-end will send new memory regions after each memory change - usually
1008a989c7cSDariusz Stojaczykhotplug/hotremove. The previous mappings will be removed.
1018a989c7cSDariusz Stojaczyk
1028a989c7cSDariusz StojaczykDrivers may also request a device config, consisting of e.g. disk geometry.
103813ed709SDariusz StojaczykVhost-SCSI drivers, however, don't need to implement this functionality
1048a989c7cSDariusz Stojaczykas they use common SCSI I/O to inquiry the underlying disk(s).
1058a989c7cSDariusz Stojaczyk
1068a989c7cSDariusz StojaczykAfterwards, the driver requests the number of maximum supported queues and
1078a989c7cSDariusz Stojaczykstarts sending virtqueue data, which consists of:
1083d8a0b19SKarol Latecki
1091df1583bSwawryk- unique virtqueue id
1101df1583bSwawryk- index of the last processed vring descriptor
1111df1583bSwawryk- vring addresses (from user address space)
1121df1583bSwawryk- call descriptor (for interrupting the driver after I/O completions)
1131df1583bSwawryk- kick descriptor (to listen for I/O requests - unused by SPDK)
1148a989c7cSDariusz Stojaczyk
1158a989c7cSDariusz StojaczykIf multiqueue feature has been negotiated, the driver has to send a specific
1168a989c7cSDariusz Stojaczyk*ENABLE* message for each extra queue it wants to be polled. Other queues are
1178a989c7cSDariusz Stojaczykpolled as soon as they're initialized.
1188a989c7cSDariusz Stojaczyk
1191e1fd9acSwawryk## I/O path {#vhost_processing_io_path}
1208a989c7cSDariusz Stojaczyk
121*ba7331a7SJohn KariukiThe front-end sends I/O by allocating proper buffers in shared memory, filling
1228a989c7cSDariusz Stojaczykthe request data, and putting guest addresses of those buffers into virtqueues.
1238a989c7cSDariusz Stojaczyk
1248a989c7cSDariusz StojaczykA Virtio-Block request looks as follows.
1258a989c7cSDariusz Stojaczyk
12663ee471bSMaciej Wawryk```c
1278a989c7cSDariusz Stojaczykstruct virtio_blk_req {
1288a989c7cSDariusz Stojaczyk        uint32_t type; // READ, WRITE, FLUSH (read-only)
1298a989c7cSDariusz Stojaczyk        uint64_t offset; // offset in the disk (read-only)
1308a989c7cSDariusz Stojaczyk        struct iovec buffers[]; // scatter-gatter list (read/write)
1318a989c7cSDariusz Stojaczyk        uint8_t status; // I/O completion status (write-only)
1328a989c7cSDariusz Stojaczyk};
1338a989c7cSDariusz Stojaczyk```
1348a989c7cSDariusz StojaczykAnd a Virtio-SCSI request as follows.
1358a989c7cSDariusz Stojaczyk
13663ee471bSMaciej Wawryk```c
1378a989c7cSDariusz Stojaczykstruct virtio_scsi_req_cmd {
1388a989c7cSDariusz Stojaczyk  struct virtio_scsi_cmd_req *req; // request data (read-only)
1398a989c7cSDariusz Stojaczyk  struct iovec read_only_buffers[]; // scatter-gatter list for WRITE I/Os
1408a989c7cSDariusz Stojaczyk  struct virtio_scsi_cmd_resp *resp; // response data (write-only)
1418a989c7cSDariusz Stojaczyk  struct iovec write_only_buffers[]; // scatter-gatter list for READ I/Os
1428a989c7cSDariusz Stojaczyk}
1438a989c7cSDariusz Stojaczyk```
1448a989c7cSDariusz Stojaczyk
1458a989c7cSDariusz StojaczykVirtqueue generally consists of an array of descriptors and each I/O needs
146813ed709SDariusz Stojaczykto be converted into a chain of such descriptors. A single descriptor can be
147813ed709SDariusz Stojaczykeither readable or writable, so each I/O request consists of at least two
148813ed709SDariusz Stojaczyk(request + response).
1498a989c7cSDariusz Stojaczyk
15063ee471bSMaciej Wawryk```c
1518a989c7cSDariusz Stojaczykstruct virtq_desc {
1528a989c7cSDariusz Stojaczyk        /* Address (guest-physical). */
1538a989c7cSDariusz Stojaczyk        le64 addr;
1548a989c7cSDariusz Stojaczyk        /* Length. */
1558a989c7cSDariusz Stojaczyk        le32 len;
1568a989c7cSDariusz Stojaczyk
1578a989c7cSDariusz Stojaczyk/* This marks a buffer as continuing via the next field. */
1588a989c7cSDariusz Stojaczyk#define VIRTQ_DESC_F_NEXT   1
1598a989c7cSDariusz Stojaczyk/* This marks a buffer as device write-only (otherwise device read-only). */
1608a989c7cSDariusz Stojaczyk#define VIRTQ_DESC_F_WRITE     2
1618a989c7cSDariusz Stojaczyk        /* The flags as indicated above. */
1628a989c7cSDariusz Stojaczyk        le16 flags;
1638a989c7cSDariusz Stojaczyk        /* Next field if flags & NEXT */
1648a989c7cSDariusz Stojaczyk        le16 next;
1658a989c7cSDariusz Stojaczyk};
1668a989c7cSDariusz Stojaczyk```
1678a989c7cSDariusz Stojaczyk
168813ed709SDariusz StojaczykLegacy Virtio implementations used the name vring alongside virtqueue, and the
169813ed709SDariusz Stojaczykname vring is still used in virtio data structures inside the code. Instead of
170813ed709SDariusz Stojaczyk`struct virtq_desc`, the `struct vring_desc` is much more likely to be found.
171813ed709SDariusz Stojaczyk
1728a989c7cSDariusz StojaczykThe device after polling this descriptor chain needs to translate and transform
1738a989c7cSDariusz Stojaczykit back into the original request struct. It needs to know the request layout
1748a989c7cSDariusz Stojaczykup-front, so each device backend (Vhost-Block/SCSI) has its own implementation
1758a989c7cSDariusz Stojaczykfor polling virtqueues. For each descriptor, the device performs a lookup in
1768a989c7cSDariusz Stojaczykthe Vhost-user memory region table and goes through a gpa_to_vva translation
1778a989c7cSDariusz Stojaczyk(guest physical address to vhost virtual address). SPDK enforces the request
1788a989c7cSDariusz Stojaczykand response data to be contained within a single memory region. I/O buffers
1798a989c7cSDariusz Stojaczykdo not have such limitations and SPDK may automatically perform additional
180813ed709SDariusz Stojaczykiovec splitting and gpa_to_vva translations if required. After forming the request
1818a989c7cSDariusz Stojaczykstructs, SPDK forwards such I/O to the underlying drive and polls for the
1828a989c7cSDariusz Stojaczykcompletion. Once I/O completes, SPDK vhost fills the response buffer with
1838a989c7cSDariusz Stojaczykproper data and interrupts the guest by doing an eventfd_write on the call
1848a989c7cSDariusz Stojaczykdescriptor for proper virtqueue. There are multiple interrupt coalescing
185813ed709SDariusz Stojaczykfeatures involved, but they are not be discussed in this document.
186813ed709SDariusz Stojaczyk
1871e1fd9acSwawryk### SPDK optimizations {#vhost_spdk_optimizations}
188813ed709SDariusz Stojaczyk
189813ed709SDariusz StojaczykDue to its poll-mode nature, SPDK vhost removes the requirement for I/O submission
190813ed709SDariusz Stojaczyknotifications, drastically increasing the vhost server throughput and decreasing
191813ed709SDariusz Stojaczykthe guest overhead of submitting an I/O. A couple of different solutions exist
192813ed709SDariusz Stojaczykto mitigate the I/O completion interrupt overhead (irqfd, vDPA), but those won't
193813ed709SDariusz Stojaczykbe discussed in this document. For the highest performance, a poll-mode @ref virtio
194813ed709SDariusz Stojaczykcan be used, as it suppresses all I/O completion interrupts, making the I/O
195813ed709SDariusz Stojaczykpath to fully bypass the QEMU/KVM overhead.
196