xref: /dpdk/doc/guides/prog_guide/vhost_lib.rst (revision 5c6c1480b3b01c2573cd52a85a01881aaa42b53f)
15630257fSFerruh Yigit..  SPDX-License-Identifier: BSD-3-Clause
25630257fSFerruh Yigit    Copyright(c) 2010-2016 Intel Corporation.
30ee5e7fbSSiobhan Butler
40ee5e7fbSSiobhan ButlerVhost Library
50ee5e7fbSSiobhan Butler=============
60ee5e7fbSSiobhan Butler
72bfaec90SYuanhan LiuThe vhost library implements a user space virtio net server allowing the user
82bfaec90SYuanhan Liuto manipulate the virtio ring directly. In another words, it allows the user
92bfaec90SYuanhan Liuto fetch/put packets from/to the VM virtio net device. To achieve this, a
102bfaec90SYuanhan Liuvhost library should be able to:
112bfaec90SYuanhan Liu
122bfaec90SYuanhan Liu* Access the guest memory:
132bfaec90SYuanhan Liu
142bfaec90SYuanhan Liu  For QEMU, this is done by using the ``-object memory-backend-file,share=on,...``
152bfaec90SYuanhan Liu  option. Which means QEMU will create a file to serve as the guest RAM.
162bfaec90SYuanhan Liu  The ``share=on`` option allows another process to map that file, which
172bfaec90SYuanhan Liu  means it can access the guest RAM.
182bfaec90SYuanhan Liu
192bfaec90SYuanhan Liu* Know all the necessary information about the vring:
202bfaec90SYuanhan Liu
212bfaec90SYuanhan Liu  Information such as where the available ring is stored. Vhost defines some
22647e191bSYuanhan Liu  messages (passed through a Unix domain socket file) to tell the backend all
23647e191bSYuanhan Liu  the information it needs to know how to manipulate the vring.
242bfaec90SYuanhan Liu
250ee5e7fbSSiobhan Butler
260ee5e7fbSSiobhan ButlerVhost API Overview
270ee5e7fbSSiobhan Butler------------------
280ee5e7fbSSiobhan Butler
295fbb3941SYuanhan LiuThe following is an overview of some key Vhost API functions:
300ee5e7fbSSiobhan Butler
312bfaec90SYuanhan Liu* ``rte_vhost_driver_register(path, flags)``
320ee5e7fbSSiobhan Butler
33647e191bSYuanhan Liu  This function registers a vhost driver into the system. ``path`` specifies
34647e191bSYuanhan Liu  the Unix domain socket file path.
350ee5e7fbSSiobhan Butler
36647e191bSYuanhan Liu  Currently supported flags are:
370ee5e7fbSSiobhan Butler
382bfaec90SYuanhan Liu  - ``RTE_VHOST_USER_CLIENT``
390ee5e7fbSSiobhan Butler
402bfaec90SYuanhan Liu    DPDK vhost-user will act as the client when this flag is given. See below
412bfaec90SYuanhan Liu    for an explanation.
420ee5e7fbSSiobhan Butler
432bfaec90SYuanhan Liu  - ``RTE_VHOST_USER_NO_RECONNECT``
440ee5e7fbSSiobhan Butler
452bfaec90SYuanhan Liu    When DPDK vhost-user acts as the client it will keep trying to reconnect
462bfaec90SYuanhan Liu    to the server (QEMU) until it succeeds. This is useful in two cases:
470ee5e7fbSSiobhan Butler
482bfaec90SYuanhan Liu    * When QEMU is not started yet.
492bfaec90SYuanhan Liu    * When QEMU restarts (for example due to a guest OS reboot).
500ee5e7fbSSiobhan Butler
512bfaec90SYuanhan Liu    This reconnect option is enabled by default. However, it can be turned off
522bfaec90SYuanhan Liu    by setting this flag.
530ee5e7fbSSiobhan Butler
549ba1e744SYuanhan Liu  - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY``
559ba1e744SYuanhan Liu
569ba1e744SYuanhan Liu    Dequeue zero copy will be enabled when this flag is set. It is disabled by
579ba1e744SYuanhan Liu    default.
589ba1e744SYuanhan Liu
599ba1e744SYuanhan Liu    There are some truths (including limitations) you might want to know while
609ba1e744SYuanhan Liu    setting this flag:
619ba1e744SYuanhan Liu
629ba1e744SYuanhan Liu    * zero copy is not good for small packets (typically for packet size below
639ba1e744SYuanhan Liu      512).
649ba1e744SYuanhan Liu
659ba1e744SYuanhan Liu    * zero copy is really good for VM2VM case. For iperf between two VMs, the
669ba1e744SYuanhan Liu      boost could be above 70% (when TSO is enableld).
679ba1e744SYuanhan Liu
68a24e7032SJunjie Chen    * For zero copy in VM2NIC case, guest Tx used vring may be starved if the
69a24e7032SJunjie Chen      PMD driver consume the mbuf but not release them timely.
709ba1e744SYuanhan Liu
71a24e7032SJunjie Chen      For example, i40e driver has an optimization to maximum NIC pipeline which
72a24e7032SJunjie Chen      postpones returning transmitted mbuf until only tx_free_threshold free
73a24e7032SJunjie Chen      descs left. The virtio TX used ring will be starved if the formula
74a24e7032SJunjie Chen      (num_i40e_tx_desc - num_virtio_tx_desc > tx_free_threshold) is true, since
75a24e7032SJunjie Chen      i40e will not return back mbuf.
76a24e7032SJunjie Chen
77a24e7032SJunjie Chen      A performance tip for tuning zero copy in VM2NIC case is to adjust the
78a24e7032SJunjie Chen      frequency of mbuf free (i.e. adjust tx_free_threshold of i40e driver) to
79a24e7032SJunjie Chen      balance consumer and producer.
809ba1e744SYuanhan Liu
819ba1e744SYuanhan Liu    * Guest memory should be backended with huge pages to achieve better
829ba1e744SYuanhan Liu      performance. Using 1G page size is the best.
839ba1e744SYuanhan Liu
849ba1e744SYuanhan Liu      When dequeue zero copy is enabled, the guest phys address and host phys
859ba1e744SYuanhan Liu      address mapping has to be established. Using non-huge pages means far
869ba1e744SYuanhan Liu      more page segments. To make it simple, DPDK vhost does a linear search
879ba1e744SYuanhan Liu      of those segments, thus the fewer the segments, the quicker we will get
889ba1e744SYuanhan Liu      the mapping. NOTE: we may speed it by using tree searching in future.
899ba1e744SYuanhan Liu
90e3075e96SJunjie Chen    * zero copy can not work when using vfio-pci with iommu mode currently, this
91e3075e96SJunjie Chen      is because we don't setup iommu dma mapping for guest memory. If you have
92e3075e96SJunjie Chen      to use vfio-pci driver, please insert vfio-pci kernel module in noiommu
93e3075e96SJunjie Chen      mode.
94e3075e96SJunjie Chen
95*5c6c1480STiwei Bie    * The consumer of zero copy mbufs should consume these mbufs as soon as
96*5c6c1480STiwei Bie      possible, otherwise it may block the operations in vhost.
97*5c6c1480STiwei Bie
98002d6a7eSMaxime Coquelin  - ``RTE_VHOST_USER_IOMMU_SUPPORT``
99002d6a7eSMaxime Coquelin
100002d6a7eSMaxime Coquelin    IOMMU support will be enabled when this flag is set. It is disabled by
101002d6a7eSMaxime Coquelin    default.
102002d6a7eSMaxime Coquelin
103002d6a7eSMaxime Coquelin    Enabling this flag makes possible to use guest vIOMMU to protect vhost
104002d6a7eSMaxime Coquelin    from accessing memory the virtio device isn't allowed to, when the feature
105002d6a7eSMaxime Coquelin    is negotiated and an IOMMU device is declared.
106002d6a7eSMaxime Coquelin
107002d6a7eSMaxime Coquelin    However, this feature enables vhost-user's reply-ack protocol feature,
108002d6a7eSMaxime Coquelin    which implementation is buggy in Qemu v2.7.0-v2.9.0 when doing multiqueue.
109002d6a7eSMaxime Coquelin    Enabling this flag with these Qemu version results in Qemu being blocked
110002d6a7eSMaxime Coquelin    when multiple queue pairs are declared.
111002d6a7eSMaxime Coquelin
112cd85039eSMaxime Coquelin  - ``RTE_VHOST_USER_POSTCOPY_SUPPORT``
113cd85039eSMaxime Coquelin
114cd85039eSMaxime Coquelin    Postcopy live-migration support will be enabled when this flag is set.
115cd85039eSMaxime Coquelin    It is disabled by default.
116cd85039eSMaxime Coquelin
117cd85039eSMaxime Coquelin    Enabling this flag should only be done when the calling application does
118cd85039eSMaxime Coquelin    not pre-fault the guest shared memory, otherwise migration would fail.
119cd85039eSMaxime Coquelin
1205fbb3941SYuanhan Liu* ``rte_vhost_driver_set_features(path, features)``
1215fbb3941SYuanhan Liu
1225fbb3941SYuanhan Liu  This function sets the feature bits the vhost-user driver supports. The
1235fbb3941SYuanhan Liu  vhost-user driver could be vhost-user net, yet it could be something else,
1245fbb3941SYuanhan Liu  say, vhost-user SCSI.
1255fbb3941SYuanhan Liu
1267c129037SYuanhan Liu* ``rte_vhost_driver_callback_register(path, vhost_device_ops)``
1272bfaec90SYuanhan Liu
1282bfaec90SYuanhan Liu  This function registers a set of callbacks, to let DPDK applications take
1292bfaec90SYuanhan Liu  the appropriate action when some events happen. The following events are
1302bfaec90SYuanhan Liu  currently supported:
1312bfaec90SYuanhan Liu
1322bfaec90SYuanhan Liu  * ``new_device(int vid)``
1332bfaec90SYuanhan Liu
134cb043557SYuanhan Liu    This callback is invoked when a virtio device becomes ready. ``vid``
135cb043557SYuanhan Liu    is the vhost device ID.
1362bfaec90SYuanhan Liu
1372bfaec90SYuanhan Liu  * ``destroy_device(int vid)``
1382bfaec90SYuanhan Liu
139efba12a7SDariusz Stojaczyk    This callback is invoked when a virtio device is paused or shut down.
1402bfaec90SYuanhan Liu
1412bfaec90SYuanhan Liu  * ``vring_state_changed(int vid, uint16_t queue_id, int enable)``
1422bfaec90SYuanhan Liu
1432bfaec90SYuanhan Liu    This callback is invoked when a specific queue's state is changed, for
1442bfaec90SYuanhan Liu    example to enabled or disabled.
1452bfaec90SYuanhan Liu
146abd53c16SYuanhan Liu  * ``features_changed(int vid, uint64_t features)``
147abd53c16SYuanhan Liu
148abd53c16SYuanhan Liu    This callback is invoked when the features is changed. For example,
149abd53c16SYuanhan Liu    ``VHOST_F_LOG_ALL`` will be set/cleared at the start/end of live
150abd53c16SYuanhan Liu    migration, respectively.
151abd53c16SYuanhan Liu
152efba12a7SDariusz Stojaczyk  * ``new_connection(int vid)``
153efba12a7SDariusz Stojaczyk
154efba12a7SDariusz Stojaczyk    This callback is invoked on new vhost-user socket connection. If DPDK
155efba12a7SDariusz Stojaczyk    acts as the server the device should not be deleted before
156efba12a7SDariusz Stojaczyk    ``destroy_connection`` callback is received.
157efba12a7SDariusz Stojaczyk
158efba12a7SDariusz Stojaczyk  * ``destroy_connection(int vid)``
159efba12a7SDariusz Stojaczyk
160efba12a7SDariusz Stojaczyk    This callback is invoked when vhost-user socket connection is closed.
161efba12a7SDariusz Stojaczyk    It indicates that device with id ``vid`` is no longer in use and can be
162efba12a7SDariusz Stojaczyk    safely deleted.
163efba12a7SDariusz Stojaczyk
164af147591SYuanhan Liu* ``rte_vhost_driver_disable/enable_features(path, features))``
165af147591SYuanhan Liu
166af147591SYuanhan Liu  This function disables/enables some features. For example, it can be used to
167af147591SYuanhan Liu  disable mergeable buffers and TSO features, which both are enabled by
168af147591SYuanhan Liu  default.
169af147591SYuanhan Liu
170af147591SYuanhan Liu* ``rte_vhost_driver_start(path)``
171af147591SYuanhan Liu
172af147591SYuanhan Liu  This function triggers the vhost-user negotiation. It should be invoked at
173af147591SYuanhan Liu  the end of initializing a vhost-user driver.
174af147591SYuanhan Liu
1752bfaec90SYuanhan Liu* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)``
1762bfaec90SYuanhan Liu
1772bfaec90SYuanhan Liu  Transmits (enqueues) ``count`` packets from host to guest.
1782bfaec90SYuanhan Liu
1792bfaec90SYuanhan Liu* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)``
1802bfaec90SYuanhan Liu
1812bfaec90SYuanhan Liu  Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``.
1822bfaec90SYuanhan Liu
183939066d9SFan Zhang* ``rte_vhost_crypto_create(vid, cryptodev_id, sess_mempool, socket_id)``
184939066d9SFan Zhang
185939066d9SFan Zhang  As an extension of new_device(), this function adds virtio-crypto workload
186939066d9SFan Zhang  acceleration capability to the device. All crypto workload is processed by
187939066d9SFan Zhang  DPDK cryptodev with the device ID of ``cryptodev_id``.
188939066d9SFan Zhang
189939066d9SFan Zhang* ``rte_vhost_crypto_free(vid)``
190939066d9SFan Zhang
191939066d9SFan Zhang  Frees the memory and vhost-user message handlers created in
192939066d9SFan Zhang  rte_vhost_crypto_create().
193939066d9SFan Zhang
194939066d9SFan Zhang* ``rte_vhost_crypto_fetch_requests(vid, queue_id, ops, nb_ops)``
195939066d9SFan Zhang
196939066d9SFan Zhang  Receives (dequeues) ``nb_ops`` virtio-crypto requests from guest, parses
197939066d9SFan Zhang  them to DPDK Crypto Operations, and fills the ``ops`` with parsing results.
198939066d9SFan Zhang
199939066d9SFan Zhang* ``rte_vhost_crypto_finalize_requests(queue_id, ops, nb_ops)``
200939066d9SFan Zhang
201939066d9SFan Zhang  After the ``ops`` are dequeued from Cryptodev, finalizes the jobs and
202939066d9SFan Zhang  notifies the guest(s).
203939066d9SFan Zhang
204939066d9SFan Zhang* ``rte_vhost_crypto_set_zero_copy(vid, option)``
205939066d9SFan Zhang
206939066d9SFan Zhang  Enable or disable zero copy feature of the vhost crypto backend.
207939066d9SFan Zhang
208647e191bSYuanhan LiuVhost-user Implementations
209647e191bSYuanhan Liu--------------------------
21042683a7dSHuawei Xie
2112bfaec90SYuanhan LiuVhost-user uses Unix domain sockets for passing messages. This means the DPDK
2122bfaec90SYuanhan Liuvhost-user implementation has two options:
21342683a7dSHuawei Xie
2142bfaec90SYuanhan Liu* DPDK vhost-user acts as the server.
21542683a7dSHuawei Xie
2162bfaec90SYuanhan Liu  DPDK will create a Unix domain socket server file and listen for
2172bfaec90SYuanhan Liu  connections from the frontend.
21842683a7dSHuawei Xie
2192bfaec90SYuanhan Liu  Note, this is the default mode, and the only mode before DPDK v16.07.
22042683a7dSHuawei Xie
2212bfaec90SYuanhan Liu
2222bfaec90SYuanhan Liu* DPDK vhost-user acts as the client.
2232bfaec90SYuanhan Liu
2242bfaec90SYuanhan Liu  Unlike the server mode, this mode doesn't create the socket file;
2252bfaec90SYuanhan Liu  it just tries to connect to the server (which responses to create the
2262bfaec90SYuanhan Liu  file instead).
2272bfaec90SYuanhan Liu
2282bfaec90SYuanhan Liu  When the DPDK vhost-user application restarts, DPDK vhost-user will try to
2292bfaec90SYuanhan Liu  connect to the server again. This is how the "reconnect" feature works.
2302bfaec90SYuanhan Liu
231f6ee75b5SYuanhan Liu  .. Note::
232f6ee75b5SYuanhan Liu     * The "reconnect" feature requires **QEMU v2.7** (or above).
233f6ee75b5SYuanhan Liu
234f6ee75b5SYuanhan Liu     * The vhost supported features must be exactly the same before and
235f6ee75b5SYuanhan Liu       after the restart. For example, if TSO is disabled and then enabled,
236f6ee75b5SYuanhan Liu       nothing will work and issues undefined might happen.
2372bfaec90SYuanhan Liu
2382bfaec90SYuanhan LiuNo matter which mode is used, once a connection is established, DPDK
2392bfaec90SYuanhan Liuvhost-user will start receiving and processing vhost messages from QEMU.
2402bfaec90SYuanhan Liu
2412bfaec90SYuanhan LiuFor messages with a file descriptor, the file descriptor can be used directly
2422bfaec90SYuanhan Liuin the vhost process as it is already installed by the Unix domain socket.
2432bfaec90SYuanhan Liu
2442bfaec90SYuanhan LiuThe supported vhost messages are:
2452bfaec90SYuanhan Liu
2462bfaec90SYuanhan Liu* ``VHOST_SET_MEM_TABLE``
2472bfaec90SYuanhan Liu* ``VHOST_SET_VRING_KICK``
2482bfaec90SYuanhan Liu* ``VHOST_SET_VRING_CALL``
2492bfaec90SYuanhan Liu* ``VHOST_SET_LOG_FD``
2502bfaec90SYuanhan Liu* ``VHOST_SET_VRING_ERR``
2512bfaec90SYuanhan Liu
2522bfaec90SYuanhan LiuFor ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each
2532bfaec90SYuanhan Liumemory region and its file descriptor in the ancillary data of the message.
2542bfaec90SYuanhan LiuThe file descriptor is used to map that region.
2552bfaec90SYuanhan Liu
2562bfaec90SYuanhan Liu``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into
2572bfaec90SYuanhan Liuthe data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove
2582bfaec90SYuanhan Liuthe vhost device from the data plane.
25942683a7dSHuawei Xie
26042683a7dSHuawei XieWhen the socket connection is closed, vhost will destroy the device.
26142683a7dSHuawei Xie
262768274ebSJianfeng TanGuest memory requirement
263768274ebSJianfeng Tan------------------------
264768274ebSJianfeng Tan
265768274ebSJianfeng Tan* Memory pre-allocation
266768274ebSJianfeng Tan
267768274ebSJianfeng Tan  For non-zerocopy, guest memory pre-allocation is not a must. This can help
268768274ebSJianfeng Tan  save of memory. If users really want the guest memory to be pre-allocated
269768274ebSJianfeng Tan  (e.g., for performance reason), we can add option ``-mem-prealloc`` when
270768274ebSJianfeng Tan  starting QEMU. Or, we can lock all memory at vhost side which will force
271768274ebSJianfeng Tan  memory to be allocated when mmap at vhost side; option --mlockall in
272768274ebSJianfeng Tan  ovs-dpdk is an example in hand.
273768274ebSJianfeng Tan
274768274ebSJianfeng Tan  For zerocopy, we force the VM memory to be pre-allocated at vhost lib when
275768274ebSJianfeng Tan  mapping the guest memory; and also we need to lock the memory to prevent
276768274ebSJianfeng Tan  pages being swapped out to disk.
277768274ebSJianfeng Tan
278768274ebSJianfeng Tan* Memory sharing
279768274ebSJianfeng Tan
280768274ebSJianfeng Tan  Make sure ``share=on`` QEMU option is given. vhost-user will not work with
281768274ebSJianfeng Tan  a QEMU version without shared memory mapping.
282768274ebSJianfeng Tan
2830ee5e7fbSSiobhan ButlerVhost supported vSwitch reference
2840ee5e7fbSSiobhan Butler---------------------------------
2850ee5e7fbSSiobhan Butler
2862bfaec90SYuanhan LiuFor more vhost details and how to support vhost in vSwitch, please refer to
2872bfaec90SYuanhan Liuthe vhost example in the DPDK Sample Applications Guide.
2886beea244SZhihong Wang
2896beea244SZhihong WangVhost data path acceleration (vDPA)
2906beea244SZhihong Wang-----------------------------------
2916beea244SZhihong Wang
2926beea244SZhihong WangvDPA supports selective datapath in vhost-user lib by enabling virtio ring
2936beea244SZhihong Wangcompatible devices to serve virtio driver directly for datapath acceleration.
2946beea244SZhihong Wang
2956beea244SZhihong Wang``rte_vhost_driver_attach_vdpa_device`` is used to configure the vhost device
2966beea244SZhihong Wangwith accelerated backend.
2976beea244SZhihong Wang
2986beea244SZhihong WangAlso vhost device capabilities are made configurable to adopt various devices.
2996beea244SZhihong WangSuch capabilities include supported features, protocol features, queue number.
3006beea244SZhihong Wang
3016beea244SZhihong WangFinally, a set of device ops is defined for device specific operations:
3026beea244SZhihong Wang
3036beea244SZhihong Wang* ``get_queue_num``
3046beea244SZhihong Wang
3056beea244SZhihong Wang  Called to get supported queue number of the device.
3066beea244SZhihong Wang
3076beea244SZhihong Wang* ``get_features``
3086beea244SZhihong Wang
3096beea244SZhihong Wang  Called to get supported features of the device.
3106beea244SZhihong Wang
3116beea244SZhihong Wang* ``get_protocol_features``
3126beea244SZhihong Wang
3136beea244SZhihong Wang  Called to get supported protocol features of the device.
3146beea244SZhihong Wang
3156beea244SZhihong Wang* ``dev_conf``
3166beea244SZhihong Wang
3176beea244SZhihong Wang  Called to configure the actual device when the virtio device becomes ready.
3186beea244SZhihong Wang
3196beea244SZhihong Wang* ``dev_close``
3206beea244SZhihong Wang
3216beea244SZhihong Wang  Called to close the actual device when the virtio device is stopped.
3226beea244SZhihong Wang
3236beea244SZhihong Wang* ``set_vring_state``
3246beea244SZhihong Wang
3256beea244SZhihong Wang  Called to change the state of the vring in the actual device when vring state
3266beea244SZhihong Wang  changes.
3276beea244SZhihong Wang
3286beea244SZhihong Wang* ``set_features``
3296beea244SZhihong Wang
3306beea244SZhihong Wang  Called to set the negotiated features to device.
3316beea244SZhihong Wang
3326beea244SZhihong Wang* ``migration_done``
3336beea244SZhihong Wang
3346beea244SZhihong Wang  Called to allow the device to response to RARP sending.
3356beea244SZhihong Wang
3366beea244SZhihong Wang* ``get_vfio_group_fd``
3376beea244SZhihong Wang
3386beea244SZhihong Wang   Called to get the VFIO group fd of the device.
3396beea244SZhihong Wang
3406beea244SZhihong Wang* ``get_vfio_device_fd``
3416beea244SZhihong Wang
3426beea244SZhihong Wang  Called to get the VFIO device fd of the device.
3436beea244SZhihong Wang
3446beea244SZhihong Wang* ``get_notify_area``
3456beea244SZhihong Wang
3466beea244SZhihong Wang  Called to get the notify area info of the queue.
347