15630257fSFerruh Yigit.. SPDX-License-Identifier: BSD-3-Clause 25630257fSFerruh Yigit Copyright(c) 2010-2016 Intel Corporation. 30ee5e7fbSSiobhan Butler 40ee5e7fbSSiobhan ButlerVhost Library 50ee5e7fbSSiobhan Butler============= 60ee5e7fbSSiobhan Butler 72bfaec90SYuanhan LiuThe vhost library implements a user space virtio net server allowing the user 82bfaec90SYuanhan Liuto manipulate the virtio ring directly. In another words, it allows the user 92bfaec90SYuanhan Liuto fetch/put packets from/to the VM virtio net device. To achieve this, a 102bfaec90SYuanhan Liuvhost library should be able to: 112bfaec90SYuanhan Liu 122bfaec90SYuanhan Liu* Access the guest memory: 132bfaec90SYuanhan Liu 142bfaec90SYuanhan Liu For QEMU, this is done by using the ``-object memory-backend-file,share=on,...`` 152bfaec90SYuanhan Liu option. Which means QEMU will create a file to serve as the guest RAM. 162bfaec90SYuanhan Liu The ``share=on`` option allows another process to map that file, which 172bfaec90SYuanhan Liu means it can access the guest RAM. 182bfaec90SYuanhan Liu 192bfaec90SYuanhan Liu* Know all the necessary information about the vring: 202bfaec90SYuanhan Liu 212bfaec90SYuanhan Liu Information such as where the available ring is stored. Vhost defines some 22647e191bSYuanhan Liu messages (passed through a Unix domain socket file) to tell the backend all 23647e191bSYuanhan Liu the information it needs to know how to manipulate the vring. 242bfaec90SYuanhan Liu 250ee5e7fbSSiobhan Butler 260ee5e7fbSSiobhan ButlerVhost API Overview 270ee5e7fbSSiobhan Butler------------------ 280ee5e7fbSSiobhan Butler 295fbb3941SYuanhan LiuThe following is an overview of some key Vhost API functions: 300ee5e7fbSSiobhan Butler 312bfaec90SYuanhan Liu* ``rte_vhost_driver_register(path, flags)`` 320ee5e7fbSSiobhan Butler 33647e191bSYuanhan Liu This function registers a vhost driver into the system. ``path`` specifies 34647e191bSYuanhan Liu the Unix domain socket file path. 350ee5e7fbSSiobhan Butler 36647e191bSYuanhan Liu Currently supported flags are: 370ee5e7fbSSiobhan Butler 382bfaec90SYuanhan Liu - ``RTE_VHOST_USER_CLIENT`` 390ee5e7fbSSiobhan Butler 402bfaec90SYuanhan Liu DPDK vhost-user will act as the client when this flag is given. See below 412bfaec90SYuanhan Liu for an explanation. 420ee5e7fbSSiobhan Butler 432bfaec90SYuanhan Liu - ``RTE_VHOST_USER_NO_RECONNECT`` 440ee5e7fbSSiobhan Butler 452bfaec90SYuanhan Liu When DPDK vhost-user acts as the client it will keep trying to reconnect 462bfaec90SYuanhan Liu to the server (QEMU) until it succeeds. This is useful in two cases: 470ee5e7fbSSiobhan Butler 482bfaec90SYuanhan Liu * When QEMU is not started yet. 492bfaec90SYuanhan Liu * When QEMU restarts (for example due to a guest OS reboot). 500ee5e7fbSSiobhan Butler 512bfaec90SYuanhan Liu This reconnect option is enabled by default. However, it can be turned off 522bfaec90SYuanhan Liu by setting this flag. 530ee5e7fbSSiobhan Butler 549ba1e744SYuanhan Liu - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` 559ba1e744SYuanhan Liu 569ba1e744SYuanhan Liu Dequeue zero copy will be enabled when this flag is set. It is disabled by 579ba1e744SYuanhan Liu default. 589ba1e744SYuanhan Liu 599ba1e744SYuanhan Liu There are some truths (including limitations) you might want to know while 609ba1e744SYuanhan Liu setting this flag: 619ba1e744SYuanhan Liu 629ba1e744SYuanhan Liu * zero copy is not good for small packets (typically for packet size below 639ba1e744SYuanhan Liu 512). 649ba1e744SYuanhan Liu 659ba1e744SYuanhan Liu * zero copy is really good for VM2VM case. For iperf between two VMs, the 669ba1e744SYuanhan Liu boost could be above 70% (when TSO is enableld). 679ba1e744SYuanhan Liu 68a24e7032SJunjie Chen * For zero copy in VM2NIC case, guest Tx used vring may be starved if the 69a24e7032SJunjie Chen PMD driver consume the mbuf but not release them timely. 709ba1e744SYuanhan Liu 71a24e7032SJunjie Chen For example, i40e driver has an optimization to maximum NIC pipeline which 72a24e7032SJunjie Chen postpones returning transmitted mbuf until only tx_free_threshold free 73a24e7032SJunjie Chen descs left. The virtio TX used ring will be starved if the formula 74a24e7032SJunjie Chen (num_i40e_tx_desc - num_virtio_tx_desc > tx_free_threshold) is true, since 75a24e7032SJunjie Chen i40e will not return back mbuf. 76a24e7032SJunjie Chen 77a24e7032SJunjie Chen A performance tip for tuning zero copy in VM2NIC case is to adjust the 78a24e7032SJunjie Chen frequency of mbuf free (i.e. adjust tx_free_threshold of i40e driver) to 79a24e7032SJunjie Chen balance consumer and producer. 809ba1e744SYuanhan Liu 819ba1e744SYuanhan Liu * Guest memory should be backended with huge pages to achieve better 829ba1e744SYuanhan Liu performance. Using 1G page size is the best. 839ba1e744SYuanhan Liu 849ba1e744SYuanhan Liu When dequeue zero copy is enabled, the guest phys address and host phys 859ba1e744SYuanhan Liu address mapping has to be established. Using non-huge pages means far 869ba1e744SYuanhan Liu more page segments. To make it simple, DPDK vhost does a linear search 879ba1e744SYuanhan Liu of those segments, thus the fewer the segments, the quicker we will get 889ba1e744SYuanhan Liu the mapping. NOTE: we may speed it by using tree searching in future. 899ba1e744SYuanhan Liu 90e3075e96SJunjie Chen * zero copy can not work when using vfio-pci with iommu mode currently, this 91e3075e96SJunjie Chen is because we don't setup iommu dma mapping for guest memory. If you have 92e3075e96SJunjie Chen to use vfio-pci driver, please insert vfio-pci kernel module in noiommu 93e3075e96SJunjie Chen mode. 94e3075e96SJunjie Chen 95002d6a7eSMaxime Coquelin - ``RTE_VHOST_USER_IOMMU_SUPPORT`` 96002d6a7eSMaxime Coquelin 97002d6a7eSMaxime Coquelin IOMMU support will be enabled when this flag is set. It is disabled by 98002d6a7eSMaxime Coquelin default. 99002d6a7eSMaxime Coquelin 100002d6a7eSMaxime Coquelin Enabling this flag makes possible to use guest vIOMMU to protect vhost 101002d6a7eSMaxime Coquelin from accessing memory the virtio device isn't allowed to, when the feature 102002d6a7eSMaxime Coquelin is negotiated and an IOMMU device is declared. 103002d6a7eSMaxime Coquelin 104002d6a7eSMaxime Coquelin However, this feature enables vhost-user's reply-ack protocol feature, 105002d6a7eSMaxime Coquelin which implementation is buggy in Qemu v2.7.0-v2.9.0 when doing multiqueue. 106002d6a7eSMaxime Coquelin Enabling this flag with these Qemu version results in Qemu being blocked 107002d6a7eSMaxime Coquelin when multiple queue pairs are declared. 108002d6a7eSMaxime Coquelin 109*cd85039eSMaxime Coquelin - ``RTE_VHOST_USER_POSTCOPY_SUPPORT`` 110*cd85039eSMaxime Coquelin 111*cd85039eSMaxime Coquelin Postcopy live-migration support will be enabled when this flag is set. 112*cd85039eSMaxime Coquelin It is disabled by default. 113*cd85039eSMaxime Coquelin 114*cd85039eSMaxime Coquelin Enabling this flag should only be done when the calling application does 115*cd85039eSMaxime Coquelin not pre-fault the guest shared memory, otherwise migration would fail. 116*cd85039eSMaxime Coquelin 1175fbb3941SYuanhan Liu* ``rte_vhost_driver_set_features(path, features)`` 1185fbb3941SYuanhan Liu 1195fbb3941SYuanhan Liu This function sets the feature bits the vhost-user driver supports. The 1205fbb3941SYuanhan Liu vhost-user driver could be vhost-user net, yet it could be something else, 1215fbb3941SYuanhan Liu say, vhost-user SCSI. 1225fbb3941SYuanhan Liu 1237c129037SYuanhan Liu* ``rte_vhost_driver_callback_register(path, vhost_device_ops)`` 1242bfaec90SYuanhan Liu 1252bfaec90SYuanhan Liu This function registers a set of callbacks, to let DPDK applications take 1262bfaec90SYuanhan Liu the appropriate action when some events happen. The following events are 1272bfaec90SYuanhan Liu currently supported: 1282bfaec90SYuanhan Liu 1292bfaec90SYuanhan Liu * ``new_device(int vid)`` 1302bfaec90SYuanhan Liu 131cb043557SYuanhan Liu This callback is invoked when a virtio device becomes ready. ``vid`` 132cb043557SYuanhan Liu is the vhost device ID. 1332bfaec90SYuanhan Liu 1342bfaec90SYuanhan Liu * ``destroy_device(int vid)`` 1352bfaec90SYuanhan Liu 136efba12a7SDariusz Stojaczyk This callback is invoked when a virtio device is paused or shut down. 1372bfaec90SYuanhan Liu 1382bfaec90SYuanhan Liu * ``vring_state_changed(int vid, uint16_t queue_id, int enable)`` 1392bfaec90SYuanhan Liu 1402bfaec90SYuanhan Liu This callback is invoked when a specific queue's state is changed, for 1412bfaec90SYuanhan Liu example to enabled or disabled. 1422bfaec90SYuanhan Liu 143abd53c16SYuanhan Liu * ``features_changed(int vid, uint64_t features)`` 144abd53c16SYuanhan Liu 145abd53c16SYuanhan Liu This callback is invoked when the features is changed. For example, 146abd53c16SYuanhan Liu ``VHOST_F_LOG_ALL`` will be set/cleared at the start/end of live 147abd53c16SYuanhan Liu migration, respectively. 148abd53c16SYuanhan Liu 149efba12a7SDariusz Stojaczyk * ``new_connection(int vid)`` 150efba12a7SDariusz Stojaczyk 151efba12a7SDariusz Stojaczyk This callback is invoked on new vhost-user socket connection. If DPDK 152efba12a7SDariusz Stojaczyk acts as the server the device should not be deleted before 153efba12a7SDariusz Stojaczyk ``destroy_connection`` callback is received. 154efba12a7SDariusz Stojaczyk 155efba12a7SDariusz Stojaczyk * ``destroy_connection(int vid)`` 156efba12a7SDariusz Stojaczyk 157efba12a7SDariusz Stojaczyk This callback is invoked when vhost-user socket connection is closed. 158efba12a7SDariusz Stojaczyk It indicates that device with id ``vid`` is no longer in use and can be 159efba12a7SDariusz Stojaczyk safely deleted. 160efba12a7SDariusz Stojaczyk 161af147591SYuanhan Liu* ``rte_vhost_driver_disable/enable_features(path, features))`` 162af147591SYuanhan Liu 163af147591SYuanhan Liu This function disables/enables some features. For example, it can be used to 164af147591SYuanhan Liu disable mergeable buffers and TSO features, which both are enabled by 165af147591SYuanhan Liu default. 166af147591SYuanhan Liu 167af147591SYuanhan Liu* ``rte_vhost_driver_start(path)`` 168af147591SYuanhan Liu 169af147591SYuanhan Liu This function triggers the vhost-user negotiation. It should be invoked at 170af147591SYuanhan Liu the end of initializing a vhost-user driver. 171af147591SYuanhan Liu 1722bfaec90SYuanhan Liu* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)`` 1732bfaec90SYuanhan Liu 1742bfaec90SYuanhan Liu Transmits (enqueues) ``count`` packets from host to guest. 1752bfaec90SYuanhan Liu 1762bfaec90SYuanhan Liu* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)`` 1772bfaec90SYuanhan Liu 1782bfaec90SYuanhan Liu Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. 1792bfaec90SYuanhan Liu 180939066d9SFan Zhang* ``rte_vhost_crypto_create(vid, cryptodev_id, sess_mempool, socket_id)`` 181939066d9SFan Zhang 182939066d9SFan Zhang As an extension of new_device(), this function adds virtio-crypto workload 183939066d9SFan Zhang acceleration capability to the device. All crypto workload is processed by 184939066d9SFan Zhang DPDK cryptodev with the device ID of ``cryptodev_id``. 185939066d9SFan Zhang 186939066d9SFan Zhang* ``rte_vhost_crypto_free(vid)`` 187939066d9SFan Zhang 188939066d9SFan Zhang Frees the memory and vhost-user message handlers created in 189939066d9SFan Zhang rte_vhost_crypto_create(). 190939066d9SFan Zhang 191939066d9SFan Zhang* ``rte_vhost_crypto_fetch_requests(vid, queue_id, ops, nb_ops)`` 192939066d9SFan Zhang 193939066d9SFan Zhang Receives (dequeues) ``nb_ops`` virtio-crypto requests from guest, parses 194939066d9SFan Zhang them to DPDK Crypto Operations, and fills the ``ops`` with parsing results. 195939066d9SFan Zhang 196939066d9SFan Zhang* ``rte_vhost_crypto_finalize_requests(queue_id, ops, nb_ops)`` 197939066d9SFan Zhang 198939066d9SFan Zhang After the ``ops`` are dequeued from Cryptodev, finalizes the jobs and 199939066d9SFan Zhang notifies the guest(s). 200939066d9SFan Zhang 201939066d9SFan Zhang* ``rte_vhost_crypto_set_zero_copy(vid, option)`` 202939066d9SFan Zhang 203939066d9SFan Zhang Enable or disable zero copy feature of the vhost crypto backend. 204939066d9SFan Zhang 205647e191bSYuanhan LiuVhost-user Implementations 206647e191bSYuanhan Liu-------------------------- 20742683a7dSHuawei Xie 2082bfaec90SYuanhan LiuVhost-user uses Unix domain sockets for passing messages. This means the DPDK 2092bfaec90SYuanhan Liuvhost-user implementation has two options: 21042683a7dSHuawei Xie 2112bfaec90SYuanhan Liu* DPDK vhost-user acts as the server. 21242683a7dSHuawei Xie 2132bfaec90SYuanhan Liu DPDK will create a Unix domain socket server file and listen for 2142bfaec90SYuanhan Liu connections from the frontend. 21542683a7dSHuawei Xie 2162bfaec90SYuanhan Liu Note, this is the default mode, and the only mode before DPDK v16.07. 21742683a7dSHuawei Xie 2182bfaec90SYuanhan Liu 2192bfaec90SYuanhan Liu* DPDK vhost-user acts as the client. 2202bfaec90SYuanhan Liu 2212bfaec90SYuanhan Liu Unlike the server mode, this mode doesn't create the socket file; 2222bfaec90SYuanhan Liu it just tries to connect to the server (which responses to create the 2232bfaec90SYuanhan Liu file instead). 2242bfaec90SYuanhan Liu 2252bfaec90SYuanhan Liu When the DPDK vhost-user application restarts, DPDK vhost-user will try to 2262bfaec90SYuanhan Liu connect to the server again. This is how the "reconnect" feature works. 2272bfaec90SYuanhan Liu 228f6ee75b5SYuanhan Liu .. Note:: 229f6ee75b5SYuanhan Liu * The "reconnect" feature requires **QEMU v2.7** (or above). 230f6ee75b5SYuanhan Liu 231f6ee75b5SYuanhan Liu * The vhost supported features must be exactly the same before and 232f6ee75b5SYuanhan Liu after the restart. For example, if TSO is disabled and then enabled, 233f6ee75b5SYuanhan Liu nothing will work and issues undefined might happen. 2342bfaec90SYuanhan Liu 2352bfaec90SYuanhan LiuNo matter which mode is used, once a connection is established, DPDK 2362bfaec90SYuanhan Liuvhost-user will start receiving and processing vhost messages from QEMU. 2372bfaec90SYuanhan Liu 2382bfaec90SYuanhan LiuFor messages with a file descriptor, the file descriptor can be used directly 2392bfaec90SYuanhan Liuin the vhost process as it is already installed by the Unix domain socket. 2402bfaec90SYuanhan Liu 2412bfaec90SYuanhan LiuThe supported vhost messages are: 2422bfaec90SYuanhan Liu 2432bfaec90SYuanhan Liu* ``VHOST_SET_MEM_TABLE`` 2442bfaec90SYuanhan Liu* ``VHOST_SET_VRING_KICK`` 2452bfaec90SYuanhan Liu* ``VHOST_SET_VRING_CALL`` 2462bfaec90SYuanhan Liu* ``VHOST_SET_LOG_FD`` 2472bfaec90SYuanhan Liu* ``VHOST_SET_VRING_ERR`` 2482bfaec90SYuanhan Liu 2492bfaec90SYuanhan LiuFor ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each 2502bfaec90SYuanhan Liumemory region and its file descriptor in the ancillary data of the message. 2512bfaec90SYuanhan LiuThe file descriptor is used to map that region. 2522bfaec90SYuanhan Liu 2532bfaec90SYuanhan Liu``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into 2542bfaec90SYuanhan Liuthe data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove 2552bfaec90SYuanhan Liuthe vhost device from the data plane. 25642683a7dSHuawei Xie 25742683a7dSHuawei XieWhen the socket connection is closed, vhost will destroy the device. 25842683a7dSHuawei Xie 259768274ebSJianfeng TanGuest memory requirement 260768274ebSJianfeng Tan------------------------ 261768274ebSJianfeng Tan 262768274ebSJianfeng Tan* Memory pre-allocation 263768274ebSJianfeng Tan 264768274ebSJianfeng Tan For non-zerocopy, guest memory pre-allocation is not a must. This can help 265768274ebSJianfeng Tan save of memory. If users really want the guest memory to be pre-allocated 266768274ebSJianfeng Tan (e.g., for performance reason), we can add option ``-mem-prealloc`` when 267768274ebSJianfeng Tan starting QEMU. Or, we can lock all memory at vhost side which will force 268768274ebSJianfeng Tan memory to be allocated when mmap at vhost side; option --mlockall in 269768274ebSJianfeng Tan ovs-dpdk is an example in hand. 270768274ebSJianfeng Tan 271768274ebSJianfeng Tan For zerocopy, we force the VM memory to be pre-allocated at vhost lib when 272768274ebSJianfeng Tan mapping the guest memory; and also we need to lock the memory to prevent 273768274ebSJianfeng Tan pages being swapped out to disk. 274768274ebSJianfeng Tan 275768274ebSJianfeng Tan* Memory sharing 276768274ebSJianfeng Tan 277768274ebSJianfeng Tan Make sure ``share=on`` QEMU option is given. vhost-user will not work with 278768274ebSJianfeng Tan a QEMU version without shared memory mapping. 279768274ebSJianfeng Tan 2800ee5e7fbSSiobhan ButlerVhost supported vSwitch reference 2810ee5e7fbSSiobhan Butler--------------------------------- 2820ee5e7fbSSiobhan Butler 2832bfaec90SYuanhan LiuFor more vhost details and how to support vhost in vSwitch, please refer to 2842bfaec90SYuanhan Liuthe vhost example in the DPDK Sample Applications Guide. 2856beea244SZhihong Wang 2866beea244SZhihong WangVhost data path acceleration (vDPA) 2876beea244SZhihong Wang----------------------------------- 2886beea244SZhihong Wang 2896beea244SZhihong WangvDPA supports selective datapath in vhost-user lib by enabling virtio ring 2906beea244SZhihong Wangcompatible devices to serve virtio driver directly for datapath acceleration. 2916beea244SZhihong Wang 2926beea244SZhihong Wang``rte_vhost_driver_attach_vdpa_device`` is used to configure the vhost device 2936beea244SZhihong Wangwith accelerated backend. 2946beea244SZhihong Wang 2956beea244SZhihong WangAlso vhost device capabilities are made configurable to adopt various devices. 2966beea244SZhihong WangSuch capabilities include supported features, protocol features, queue number. 2976beea244SZhihong Wang 2986beea244SZhihong WangFinally, a set of device ops is defined for device specific operations: 2996beea244SZhihong Wang 3006beea244SZhihong Wang* ``get_queue_num`` 3016beea244SZhihong Wang 3026beea244SZhihong Wang Called to get supported queue number of the device. 3036beea244SZhihong Wang 3046beea244SZhihong Wang* ``get_features`` 3056beea244SZhihong Wang 3066beea244SZhihong Wang Called to get supported features of the device. 3076beea244SZhihong Wang 3086beea244SZhihong Wang* ``get_protocol_features`` 3096beea244SZhihong Wang 3106beea244SZhihong Wang Called to get supported protocol features of the device. 3116beea244SZhihong Wang 3126beea244SZhihong Wang* ``dev_conf`` 3136beea244SZhihong Wang 3146beea244SZhihong Wang Called to configure the actual device when the virtio device becomes ready. 3156beea244SZhihong Wang 3166beea244SZhihong Wang* ``dev_close`` 3176beea244SZhihong Wang 3186beea244SZhihong Wang Called to close the actual device when the virtio device is stopped. 3196beea244SZhihong Wang 3206beea244SZhihong Wang* ``set_vring_state`` 3216beea244SZhihong Wang 3226beea244SZhihong Wang Called to change the state of the vring in the actual device when vring state 3236beea244SZhihong Wang changes. 3246beea244SZhihong Wang 3256beea244SZhihong Wang* ``set_features`` 3266beea244SZhihong Wang 3276beea244SZhihong Wang Called to set the negotiated features to device. 3286beea244SZhihong Wang 3296beea244SZhihong Wang* ``migration_done`` 3306beea244SZhihong Wang 3316beea244SZhihong Wang Called to allow the device to response to RARP sending. 3326beea244SZhihong Wang 3336beea244SZhihong Wang* ``get_vfio_group_fd`` 3346beea244SZhihong Wang 3356beea244SZhihong Wang Called to get the VFIO group fd of the device. 3366beea244SZhihong Wang 3376beea244SZhihong Wang* ``get_vfio_device_fd`` 3386beea244SZhihong Wang 3396beea244SZhihong Wang Called to get the VFIO device fd of the device. 3406beea244SZhihong Wang 3416beea244SZhihong Wang* ``get_notify_area`` 3426beea244SZhihong Wang 3436beea244SZhihong Wang Called to get the notify area info of the queue. 344