xref: /dpdk/doc/guides/prog_guide/vhost_lib.rst (revision cf9b3c36e5a297200c169dbbf9d6e655d8096948)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2016 Intel Corporation.
3
4Vhost Library
5=============
6
7The vhost library implements a user space virtio net server allowing the user
8to manipulate the virtio ring directly. In another words, it allows the user
9to fetch/put packets from/to the VM virtio net device. To achieve this, a
10vhost library should be able to:
11
12* Access the guest memory:
13
14  For QEMU, this is done by using the ``-object memory-backend-file,share=on,...``
15  option. Which means QEMU will create a file to serve as the guest RAM.
16  The ``share=on`` option allows another process to map that file, which
17  means it can access the guest RAM.
18
19* Know all the necessary information about the vring:
20
21  Information such as where the available ring is stored. Vhost defines some
22  messages (passed through a Unix domain socket file) to tell the backend all
23  the information it needs to know how to manipulate the vring.
24
25
26Vhost API Overview
27------------------
28
29The following is an overview of some key Vhost API functions:
30
31* ``rte_vhost_driver_register(path, flags)``
32
33  This function registers a vhost driver into the system. ``path`` specifies
34  the Unix domain socket file path.
35
36  Currently supported flags are:
37
38  - ``RTE_VHOST_USER_CLIENT``
39
40    DPDK vhost-user will act as the client when this flag is given. See below
41    for an explanation.
42
43  - ``RTE_VHOST_USER_NO_RECONNECT``
44
45    When DPDK vhost-user acts as the client it will keep trying to reconnect
46    to the server (QEMU) until it succeeds. This is useful in two cases:
47
48    * When QEMU is not started yet.
49    * When QEMU restarts (for example due to a guest OS reboot).
50
51    This reconnect option is enabled by default. However, it can be turned off
52    by setting this flag.
53
54  - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY``
55
56    Dequeue zero copy will be enabled when this flag is set. It is disabled by
57    default.
58
59    There are some truths (including limitations) you might want to know while
60    setting this flag:
61
62    * zero copy is not good for small packets (typically for packet size below
63      512).
64
65    * zero copy is really good for VM2VM case. For iperf between two VMs, the
66      boost could be above 70% (when TSO is enabled).
67
68    * For zero copy in VM2NIC case, guest Tx used vring may be starved if the
69      PMD driver consume the mbuf but not release them timely.
70
71      For example, i40e driver has an optimization to maximum NIC pipeline which
72      postpones returning transmitted mbuf until only tx_free_threshold free
73      descs left. The virtio TX used ring will be starved if the formula
74      (num_i40e_tx_desc - num_virtio_tx_desc > tx_free_threshold) is true, since
75      i40e will not return back mbuf.
76
77      A performance tip for tuning zero copy in VM2NIC case is to adjust the
78      frequency of mbuf free (i.e. adjust tx_free_threshold of i40e driver) to
79      balance consumer and producer.
80
81    * Guest memory should be backended with huge pages to achieve better
82      performance. Using 1G page size is the best.
83
84      When dequeue zero copy is enabled, the guest phys address and host phys
85      address mapping has to be established. Using non-huge pages means far
86      more page segments. To make it simple, DPDK vhost does a linear search
87      of those segments, thus the fewer the segments, the quicker we will get
88      the mapping. NOTE: we may speed it by using tree searching in future.
89
90    * zero copy can not work when using vfio-pci with iommu mode currently, this
91      is because we don't setup iommu dma mapping for guest memory. If you have
92      to use vfio-pci driver, please insert vfio-pci kernel module in noiommu
93      mode.
94
95    * The consumer of zero copy mbufs should consume these mbufs as soon as
96      possible, otherwise it may block the operations in vhost.
97
98  - ``RTE_VHOST_USER_IOMMU_SUPPORT``
99
100    IOMMU support will be enabled when this flag is set. It is disabled by
101    default.
102
103    Enabling this flag makes possible to use guest vIOMMU to protect vhost
104    from accessing memory the virtio device isn't allowed to, when the feature
105    is negotiated and an IOMMU device is declared.
106
107  - ``RTE_VHOST_USER_POSTCOPY_SUPPORT``
108
109    Postcopy live-migration support will be enabled when this flag is set.
110    It is disabled by default.
111
112    Enabling this flag should only be done when the calling application does
113    not pre-fault the guest shared memory, otherwise migration would fail.
114
115  - ``RTE_VHOST_USER_LINEARBUF_SUPPORT``
116
117    Enabling this flag forces vhost dequeue function to only provide linear
118    pktmbuf (no multi-segmented pktmbuf).
119
120    The vhost library by default provides a single pktmbuf for given a
121    packet, but if for some reason the data doesn't fit into a single
122    pktmbuf (e.g., TSO is enabled), the library will allocate additional
123    pktmbufs from the same mempool and chain them together to create a
124    multi-segmented pktmbuf.
125
126    However, the vhost application needs to support multi-segmented format.
127    If the vhost application does not support that format and requires large
128    buffers to be dequeue, this flag should be enabled to force only linear
129    buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or drop the packet.
130
131    It is disabled by default.
132
133  - ``RTE_VHOST_USER_EXTBUF_SUPPORT``
134
135    Enabling this flag allows vhost dequeue function to allocate and attach
136    an external buffer to a pktmbuf if the pkmbuf doesn't provide enough
137    space to store all data.
138
139    This is useful when the vhost application wants to support large packets
140    but doesn't want to increase the default mempool object size nor to
141    support multi-segmented mbufs (non-linear). In this case, a fresh buffer
142    is allocated using rte_malloc() which gets attached to a pktmbuf using
143    rte_pktmbuf_attach_extbuf().
144
145    See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented
146    mbufs for application that doesn't support chained mbufs.
147
148    It is disabled by default.
149
150* ``rte_vhost_driver_set_features(path, features)``
151
152  This function sets the feature bits the vhost-user driver supports. The
153  vhost-user driver could be vhost-user net, yet it could be something else,
154  say, vhost-user SCSI.
155
156* ``rte_vhost_driver_callback_register(path, vhost_device_ops)``
157
158  This function registers a set of callbacks, to let DPDK applications take
159  the appropriate action when some events happen. The following events are
160  currently supported:
161
162  * ``new_device(int vid)``
163
164    This callback is invoked when a virtio device becomes ready. ``vid``
165    is the vhost device ID.
166
167  * ``destroy_device(int vid)``
168
169    This callback is invoked when a virtio device is paused or shut down.
170
171  * ``vring_state_changed(int vid, uint16_t queue_id, int enable)``
172
173    This callback is invoked when a specific queue's state is changed, for
174    example to enabled or disabled.
175
176  * ``features_changed(int vid, uint64_t features)``
177
178    This callback is invoked when the features is changed. For example,
179    ``VHOST_F_LOG_ALL`` will be set/cleared at the start/end of live
180    migration, respectively.
181
182  * ``new_connection(int vid)``
183
184    This callback is invoked on new vhost-user socket connection. If DPDK
185    acts as the server the device should not be deleted before
186    ``destroy_connection`` callback is received.
187
188  * ``destroy_connection(int vid)``
189
190    This callback is invoked when vhost-user socket connection is closed.
191    It indicates that device with id ``vid`` is no longer in use and can be
192    safely deleted.
193
194* ``rte_vhost_driver_disable/enable_features(path, features))``
195
196  This function disables/enables some features. For example, it can be used to
197  disable mergeable buffers and TSO features, which both are enabled by
198  default.
199
200* ``rte_vhost_driver_start(path)``
201
202  This function triggers the vhost-user negotiation. It should be invoked at
203  the end of initializing a vhost-user driver.
204
205* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)``
206
207  Transmits (enqueues) ``count`` packets from host to guest.
208
209* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)``
210
211  Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``.
212
213* ``rte_vhost_crypto_create(vid, cryptodev_id, sess_mempool, socket_id)``
214
215  As an extension of new_device(), this function adds virtio-crypto workload
216  acceleration capability to the device. All crypto workload is processed by
217  DPDK cryptodev with the device ID of ``cryptodev_id``.
218
219* ``rte_vhost_crypto_free(vid)``
220
221  Frees the memory and vhost-user message handlers created in
222  rte_vhost_crypto_create().
223
224* ``rte_vhost_crypto_fetch_requests(vid, queue_id, ops, nb_ops)``
225
226  Receives (dequeues) ``nb_ops`` virtio-crypto requests from guest, parses
227  them to DPDK Crypto Operations, and fills the ``ops`` with parsing results.
228
229* ``rte_vhost_crypto_finalize_requests(queue_id, ops, nb_ops)``
230
231  After the ``ops`` are dequeued from Cryptodev, finalizes the jobs and
232  notifies the guest(s).
233
234* ``rte_vhost_crypto_set_zero_copy(vid, option)``
235
236  Enable or disable zero copy feature of the vhost crypto backend.
237
238Vhost-user Implementations
239--------------------------
240
241Vhost-user uses Unix domain sockets for passing messages. This means the DPDK
242vhost-user implementation has two options:
243
244* DPDK vhost-user acts as the server.
245
246  DPDK will create a Unix domain socket server file and listen for
247  connections from the frontend.
248
249  Note, this is the default mode, and the only mode before DPDK v16.07.
250
251
252* DPDK vhost-user acts as the client.
253
254  Unlike the server mode, this mode doesn't create the socket file;
255  it just tries to connect to the server (which responses to create the
256  file instead).
257
258  When the DPDK vhost-user application restarts, DPDK vhost-user will try to
259  connect to the server again. This is how the "reconnect" feature works.
260
261  .. Note::
262     * The "reconnect" feature requires **QEMU v2.7** (or above).
263
264     * The vhost supported features must be exactly the same before and
265       after the restart. For example, if TSO is disabled and then enabled,
266       nothing will work and issues undefined might happen.
267
268No matter which mode is used, once a connection is established, DPDK
269vhost-user will start receiving and processing vhost messages from QEMU.
270
271For messages with a file descriptor, the file descriptor can be used directly
272in the vhost process as it is already installed by the Unix domain socket.
273
274The supported vhost messages are:
275
276* ``VHOST_SET_MEM_TABLE``
277* ``VHOST_SET_VRING_KICK``
278* ``VHOST_SET_VRING_CALL``
279* ``VHOST_SET_LOG_FD``
280* ``VHOST_SET_VRING_ERR``
281
282For ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each
283memory region and its file descriptor in the ancillary data of the message.
284The file descriptor is used to map that region.
285
286``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into
287the data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove
288the vhost device from the data plane.
289
290When the socket connection is closed, vhost will destroy the device.
291
292Guest memory requirement
293------------------------
294
295* Memory pre-allocation
296
297  For non-zerocopy, guest memory pre-allocation is not a must. This can help
298  save of memory. If users really want the guest memory to be pre-allocated
299  (e.g., for performance reason), we can add option ``-mem-prealloc`` when
300  starting QEMU. Or, we can lock all memory at vhost side which will force
301  memory to be allocated when mmap at vhost side; option --mlockall in
302  ovs-dpdk is an example in hand.
303
304  For zerocopy, we force the VM memory to be pre-allocated at vhost lib when
305  mapping the guest memory; and also we need to lock the memory to prevent
306  pages being swapped out to disk.
307
308* Memory sharing
309
310  Make sure ``share=on`` QEMU option is given. vhost-user will not work with
311  a QEMU version without shared memory mapping.
312
313Vhost supported vSwitch reference
314---------------------------------
315
316For more vhost details and how to support vhost in vSwitch, please refer to
317the vhost example in the DPDK Sample Applications Guide.
318
319Vhost data path acceleration (vDPA)
320-----------------------------------
321
322vDPA supports selective datapath in vhost-user lib by enabling virtio ring
323compatible devices to serve virtio driver directly for datapath acceleration.
324
325``rte_vhost_driver_attach_vdpa_device`` is used to configure the vhost device
326with accelerated backend.
327
328Also vhost device capabilities are made configurable to adopt various devices.
329Such capabilities include supported features, protocol features, queue number.
330
331Finally, a set of device ops is defined for device specific operations:
332
333* ``get_queue_num``
334
335  Called to get supported queue number of the device.
336
337* ``get_features``
338
339  Called to get supported features of the device.
340
341* ``get_protocol_features``
342
343  Called to get supported protocol features of the device.
344
345* ``dev_conf``
346
347  Called to configure the actual device when the virtio device becomes ready.
348
349* ``dev_close``
350
351  Called to close the actual device when the virtio device is stopped.
352
353* ``set_vring_state``
354
355  Called to change the state of the vring in the actual device when vring state
356  changes.
357
358* ``set_features``
359
360  Called to set the negotiated features to device.
361
362* ``migration_done``
363
364  Called to allow the device to response to RARP sending.
365
366* ``get_vfio_group_fd``
367
368   Called to get the VFIO group fd of the device.
369
370* ``get_vfio_device_fd``
371
372  Called to get the VFIO device fd of the device.
373
374* ``get_notify_area``
375
376  Called to get the notify area info of the queue.
377