xref: /spdk/doc/vhost_processing.md (revision a83f91c29a4740e4bea5f9509b7036e9e7dc2788)
1# Vhost processing {#vhost_processing}
2
3# Table of Contents {#vhost_processing_toc}
4
5- @ref vhost_processing_intro
6- @ref vhost_processing_qemu
7- @ref vhost_processing_init
8- @ref vhost_processing_io_path
9
10# Introduction {#vhost_processing_intro}
11
12This document is intended to provide an overall high level insight into how
13Vhost works behind the scenes. It assumes you're already familiar with the
14basics of virtqueues and vrings from the
15[VIRTIO protocol](http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html).
16Code snippets used in this document might have been simplified for the sake
17of readability and should not be used as an API or implementation reference.
18
19vhost is a protocol for devices accessible via inter-process communication.
20It uses the same virtqueue and vring layout for I/O transport as VIRTIO to
21allow direct mapping to Virtio devices. The initial vhost implementation is
22a part of the Linux kernel and uses ioctl interface to communicate with
23userspace applications. What makes it possible for SPDK to expose a vhost
24device is Vhost-user protocol.
25
26The [Vhost-user specification](https://git.qemu.org/?p=qemu.git;a=blob_plain;f=docs/interop/vhost-user.txt;hb=HEAD)
27describes the protocol as follows:
28
29```
30[Vhost-user protocol] is aiming to complement the ioctl interface used to
31control the vhost implementation in the Linux kernel. It implements the control
32plane needed to establish virtqueue sharing with a user space process on the
33same host. It uses communication over a Unix domain socket to share file
34descriptors in the ancillary data of the message.
35
36The protocol defines 2 sides of the communication, master and slave. Master is
37the application that shares its virtqueues, in our case QEMU. Slave is the
38consumer of the virtqueues.
39
40In the current implementation QEMU is the Master, and the Slave is intended to
41be a software Ethernet switch running in user space, such as Snabbswitch.
42
43Master and slave can be either a client (i.e. connecting) or server (listening)
44in the socket communication.
45```
46
47SPDK vhost is a Vhost-user slave server. It exposes Unix domain sockets and
48allows external applications to connect.
49
50# QEMU {#vhost_processing_qemu}
51
52One of major Vhost-user use cases is networking (DPDK) or storage (SPDK)
53offload in QEMU. The following diagram presents how QEMU-based VM
54communicates with SPDK Vhost-SCSI device.
55
56![QEMU/SPDK vhost data flow](img/qemu_vhost_data_flow.svg)
57
58The irqfd mechanism isn't described in this document, as it KVM/QEMU-specific.
59Briefly speaking, doing an eventfd_write on the callfd descriptor will
60directly interrupt the guest because of irqfd.
61
62# Device initialization {#vhost_processing_init}
63
64All initialization and management information is exchanged via the Vhost-user
65messages. The connection always starts with the feature negotiation. Both
66the Master and the Slave exposes a list of their implemented features. Most
67of these features are implementation-related, but also regard e.g. multiqueue
68support or live migration. A feature will be used only if both sides support
69it.
70
71After the negotiatiation Vhost-user driver shares its memory, so that the vhost
72device (SPDK) can access it directly. The memory can be fragmented into multiple
73physically-discontiguous regions, although Vhost-user specification enforces
74a limit on their number (currently 8). The driver sends a single message with
75the following data for each region:
76 * file descriptor - for mmap
77 * user address - for memory translations in Vhost-user messages (e.g.
78   translating vring addresses)
79 * guest address - for buffers addresses translations in vrings (for QEMU this
80   is a physical address inside the guest)
81 * user offset - positive offset for the mmap
82 * size
83
84The Master will send new memory regions after each memory change - usually
85hotplug/hotremove. The previous mappings will be removed.
86
87Drivers may also request a device config, consisting of e.g. disk geometry.
88Vhost-SCSI drivers, however, don't need implement this functionality
89as they use common SCSI I/O to inquiry the underlying disk(s).
90
91Afterwards, the driver requests the number of maximum supported queues and
92starts sending virtqueue data, which consists of:
93 * unique virtqueue id
94 * index of the last processed vring descriptor
95 * vring addresses (from user address space)
96 * call descriptor (for interrupting the driver after I/O completions)
97 * kick descriptor (to listen for I/O requests - unused by SPDK)
98
99If multiqueue feature has been negotiated, the driver has to send a specific
100*ENABLE* message for each extra queue it wants to be polled. Other queues are
101polled as soon as they're initialized.
102
103# I/O path {#vhost_processing_io_path}
104
105The Master sends I/O by allocating proper buffers in shared memory, filling
106the request data, and putting guest addresses of those buffers into virtqueues.
107
108A Virtio-Block request looks as follows.
109
110```
111struct virtio_blk_req {
112        uint32_t type; // READ, WRITE, FLUSH (read-only)
113        uint64_t offset; // offset in the disk (read-only)
114        struct iovec buffers[]; // scatter-gatter list (read/write)
115        uint8_t status; // I/O completion status (write-only)
116};
117```
118And a Virtio-SCSI request as follows.
119
120```
121struct virtio_scsi_req_cmd {
122  struct virtio_scsi_cmd_req *req; // request data (read-only)
123  struct iovec read_only_buffers[]; // scatter-gatter list for WRITE I/Os
124  struct virtio_scsi_cmd_resp *resp; // response data (write-only)
125  struct iovec write_only_buffers[]; // scatter-gatter list for READ I/Os
126}
127```
128
129Virtqueue generally consists of an array of descriptors and each I/O needs
130to be converted into a chain of such descriptors. A descriptor can be
131either readable or writable, so each I/O request must consist of at least two
132descriptors (request + response).
133
134
135```
136struct virtq_desc {
137        /* Address (guest-physical). */
138        le64 addr;
139        /* Length. */
140        le32 len;
141
142/* This marks a buffer as continuing via the next field. */
143#define VIRTQ_DESC_F_NEXT   1
144/* This marks a buffer as device write-only (otherwise device read-only). */
145#define VIRTQ_DESC_F_WRITE     2
146        /* The flags as indicated above. */
147        le16 flags;
148        /* Next field if flags & NEXT */
149        le16 next;
150};
151```
152
153The device after polling this descriptor chain needs to translate and transform
154it back into the original request struct. It needs to know the request layout
155up-front, so each device backend (Vhost-Block/SCSI) has its own implementation
156for polling virtqueues. For each descriptor, the device performs a lookup in
157the Vhost-user memory region table and goes through a gpa_to_vva translation
158(guest physical address to vhost virtual address). SPDK enforces the request
159and response data to be contained within a single memory region. I/O buffers
160do not have such limitations and SPDK may automatically perform additional
161iovec splitting and gpa_to_vva translations if required. After forming request
162structs, SPDK forwards such I/O to the underlying drive and polls for the
163completion. Once I/O completes, SPDK vhost fills the response buffer with
164proper data and interrupts the guest by doing an eventfd_write on the call
165descriptor for proper virtqueue. There are multiple interrupt coalescing
166features involved, but they won't be discussed in this document.
167