15630257fSFerruh Yigit.. SPDX-License-Identifier: BSD-3-Clause 25630257fSFerruh Yigit Copyright(c) 2010-2016 Intel Corporation. 30ee5e7fbSSiobhan Butler 40ee5e7fbSSiobhan ButlerVhost Library 50ee5e7fbSSiobhan Butler============= 60ee5e7fbSSiobhan Butler 72bfaec90SYuanhan LiuThe vhost library implements a user space virtio net server allowing the user 82bfaec90SYuanhan Liuto manipulate the virtio ring directly. In another words, it allows the user 92bfaec90SYuanhan Liuto fetch/put packets from/to the VM virtio net device. To achieve this, a 102bfaec90SYuanhan Liuvhost library should be able to: 112bfaec90SYuanhan Liu 122bfaec90SYuanhan Liu* Access the guest memory: 132bfaec90SYuanhan Liu 142bfaec90SYuanhan Liu For QEMU, this is done by using the ``-object memory-backend-file,share=on,...`` 152bfaec90SYuanhan Liu option. Which means QEMU will create a file to serve as the guest RAM. 162bfaec90SYuanhan Liu The ``share=on`` option allows another process to map that file, which 172bfaec90SYuanhan Liu means it can access the guest RAM. 182bfaec90SYuanhan Liu 192bfaec90SYuanhan Liu* Know all the necessary information about the vring: 202bfaec90SYuanhan Liu 212bfaec90SYuanhan Liu Information such as where the available ring is stored. Vhost defines some 22647e191bSYuanhan Liu messages (passed through a Unix domain socket file) to tell the backend all 23647e191bSYuanhan Liu the information it needs to know how to manipulate the vring. 242bfaec90SYuanhan Liu 250ee5e7fbSSiobhan Butler 260ee5e7fbSSiobhan ButlerVhost API Overview 270ee5e7fbSSiobhan Butler------------------ 280ee5e7fbSSiobhan Butler 295fbb3941SYuanhan LiuThe following is an overview of some key Vhost API functions: 300ee5e7fbSSiobhan Butler 312bfaec90SYuanhan Liu* ``rte_vhost_driver_register(path, flags)`` 320ee5e7fbSSiobhan Butler 33647e191bSYuanhan Liu This function registers a vhost driver into the system. ``path`` specifies 34647e191bSYuanhan Liu the Unix domain socket file path. 350ee5e7fbSSiobhan Butler 36647e191bSYuanhan Liu Currently supported flags are: 370ee5e7fbSSiobhan Butler 382bfaec90SYuanhan Liu - ``RTE_VHOST_USER_CLIENT`` 390ee5e7fbSSiobhan Butler 402bfaec90SYuanhan Liu DPDK vhost-user will act as the client when this flag is given. See below 412bfaec90SYuanhan Liu for an explanation. 420ee5e7fbSSiobhan Butler 432bfaec90SYuanhan Liu - ``RTE_VHOST_USER_NO_RECONNECT`` 440ee5e7fbSSiobhan Butler 452bfaec90SYuanhan Liu When DPDK vhost-user acts as the client it will keep trying to reconnect 462bfaec90SYuanhan Liu to the server (QEMU) until it succeeds. This is useful in two cases: 470ee5e7fbSSiobhan Butler 482bfaec90SYuanhan Liu * When QEMU is not started yet. 492bfaec90SYuanhan Liu * When QEMU restarts (for example due to a guest OS reboot). 500ee5e7fbSSiobhan Butler 512bfaec90SYuanhan Liu This reconnect option is enabled by default. However, it can be turned off 522bfaec90SYuanhan Liu by setting this flag. 530ee5e7fbSSiobhan Butler 54002d6a7eSMaxime Coquelin - ``RTE_VHOST_USER_IOMMU_SUPPORT`` 55002d6a7eSMaxime Coquelin 56002d6a7eSMaxime Coquelin IOMMU support will be enabled when this flag is set. It is disabled by 57002d6a7eSMaxime Coquelin default. 58002d6a7eSMaxime Coquelin 59002d6a7eSMaxime Coquelin Enabling this flag makes possible to use guest vIOMMU to protect vhost 60002d6a7eSMaxime Coquelin from accessing memory the virtio device isn't allowed to, when the feature 61002d6a7eSMaxime Coquelin is negotiated and an IOMMU device is declared. 62002d6a7eSMaxime Coquelin 63cd85039eSMaxime Coquelin - ``RTE_VHOST_USER_POSTCOPY_SUPPORT`` 64cd85039eSMaxime Coquelin 65cd85039eSMaxime Coquelin Postcopy live-migration support will be enabled when this flag is set. 66cd85039eSMaxime Coquelin It is disabled by default. 67cd85039eSMaxime Coquelin 68cd85039eSMaxime Coquelin Enabling this flag should only be done when the calling application does 69cd85039eSMaxime Coquelin not pre-fault the guest shared memory, otherwise migration would fail. 70cd85039eSMaxime Coquelin 71c3ff0ac7SFlavio Leitner - ``RTE_VHOST_USER_LINEARBUF_SUPPORT`` 72c3ff0ac7SFlavio Leitner 73c3ff0ac7SFlavio Leitner Enabling this flag forces vhost dequeue function to only provide linear 74c3ff0ac7SFlavio Leitner pktmbuf (no multi-segmented pktmbuf). 75c3ff0ac7SFlavio Leitner 76c3ff0ac7SFlavio Leitner The vhost library by default provides a single pktmbuf for given a 77c3ff0ac7SFlavio Leitner packet, but if for some reason the data doesn't fit into a single 78c3ff0ac7SFlavio Leitner pktmbuf (e.g., TSO is enabled), the library will allocate additional 79c3ff0ac7SFlavio Leitner pktmbufs from the same mempool and chain them together to create a 80c3ff0ac7SFlavio Leitner multi-segmented pktmbuf. 81c3ff0ac7SFlavio Leitner 82c3ff0ac7SFlavio Leitner However, the vhost application needs to support multi-segmented format. 83c3ff0ac7SFlavio Leitner If the vhost application does not support that format and requires large 84c3ff0ac7SFlavio Leitner buffers to be dequeue, this flag should be enabled to force only linear 85c3ff0ac7SFlavio Leitner buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or drop the packet. 86c3ff0ac7SFlavio Leitner 87c3ff0ac7SFlavio Leitner It is disabled by default. 88c3ff0ac7SFlavio Leitner 89c3ff0ac7SFlavio Leitner - ``RTE_VHOST_USER_EXTBUF_SUPPORT`` 90c3ff0ac7SFlavio Leitner 91c3ff0ac7SFlavio Leitner Enabling this flag allows vhost dequeue function to allocate and attach 92c3ff0ac7SFlavio Leitner an external buffer to a pktmbuf if the pkmbuf doesn't provide enough 93c3ff0ac7SFlavio Leitner space to store all data. 94c3ff0ac7SFlavio Leitner 95c3ff0ac7SFlavio Leitner This is useful when the vhost application wants to support large packets 96c3ff0ac7SFlavio Leitner but doesn't want to increase the default mempool object size nor to 97c3ff0ac7SFlavio Leitner support multi-segmented mbufs (non-linear). In this case, a fresh buffer 98c3ff0ac7SFlavio Leitner is allocated using rte_malloc() which gets attached to a pktmbuf using 99c3ff0ac7SFlavio Leitner rte_pktmbuf_attach_extbuf(). 100c3ff0ac7SFlavio Leitner 101c3ff0ac7SFlavio Leitner See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented 102c3ff0ac7SFlavio Leitner mbufs for application that doesn't support chained mbufs. 103c3ff0ac7SFlavio Leitner 104c3ff0ac7SFlavio Leitner It is disabled by default. 105c3ff0ac7SFlavio Leitner 106362f06f9SPatrick Fu - ``RTE_VHOST_USER_ASYNC_COPY`` 107362f06f9SPatrick Fu 10853d3f477SJiayu Hu Asynchronous data path will be enabled when this flag is set. Async 10953d3f477SJiayu Hu data path allows applications to enable DMA acceleration for vhost 11053d3f477SJiayu Hu queues. Vhost leverages the registered DMA channels to free CPU from 11153d3f477SJiayu Hu memory copy operations in data path. A set of async data path APIs are 11253d3f477SJiayu Hu defined for DPDK applications to make use of the async capability. Only 11353d3f477SJiayu Hu packets enqueued/dequeued by async APIs are processed through the async 11453d3f477SJiayu Hu data path. 115362f06f9SPatrick Fu 116362f06f9SPatrick Fu Currently this feature is only implemented on split ring enqueue data 117362f06f9SPatrick Fu path. 118362f06f9SPatrick Fu 119362f06f9SPatrick Fu It is disabled by default. 120362f06f9SPatrick Fu 121ca7036b4SDavid Marchand - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS`` 122ca7036b4SDavid Marchand 123ca7036b4SDavid Marchand Since v16.04, the vhost library forwards checksum and gso requests for 124ca7036b4SDavid Marchand packets received from a virtio driver by filling Tx offload metadata in 125ca7036b4SDavid Marchand the mbuf. This behavior is inconsistent with other drivers but it is left 126ca7036b4SDavid Marchand untouched for existing applications that might rely on it. 127ca7036b4SDavid Marchand 128ca7036b4SDavid Marchand This flag disables the legacy behavior and instead ask vhost to simply 129ca7036b4SDavid Marchand populate Rx offload metadata in the mbuf. 130ca7036b4SDavid Marchand 131ca7036b4SDavid Marchand It is disabled by default. 132ca7036b4SDavid Marchand 133be75dc99SMaxime Coquelin - ``RTE_VHOST_USER_NET_STATS_ENABLE`` 134be75dc99SMaxime Coquelin 135be75dc99SMaxime Coquelin Per-virtqueue statistics collection will be enabled when this flag is set. 136be75dc99SMaxime Coquelin When enabled, the application may use rte_vhost_stats_get_names() and 137be75dc99SMaxime Coquelin rte_vhost_stats_get() to collect statistics, and rte_vhost_stats_reset() to 138be75dc99SMaxime Coquelin reset them. 139be75dc99SMaxime Coquelin 140be75dc99SMaxime Coquelin It is disabled by default 141be75dc99SMaxime Coquelin 1425fbb3941SYuanhan Liu* ``rte_vhost_driver_set_features(path, features)`` 1435fbb3941SYuanhan Liu 1445fbb3941SYuanhan Liu This function sets the feature bits the vhost-user driver supports. The 1455fbb3941SYuanhan Liu vhost-user driver could be vhost-user net, yet it could be something else, 1465fbb3941SYuanhan Liu say, vhost-user SCSI. 1475fbb3941SYuanhan Liu 1487c129037SYuanhan Liu* ``rte_vhost_driver_callback_register(path, vhost_device_ops)`` 1492bfaec90SYuanhan Liu 1502bfaec90SYuanhan Liu This function registers a set of callbacks, to let DPDK applications take 1512bfaec90SYuanhan Liu the appropriate action when some events happen. The following events are 1522bfaec90SYuanhan Liu currently supported: 1532bfaec90SYuanhan Liu 1542bfaec90SYuanhan Liu * ``new_device(int vid)`` 1552bfaec90SYuanhan Liu 156cb043557SYuanhan Liu This callback is invoked when a virtio device becomes ready. ``vid`` 157cb043557SYuanhan Liu is the vhost device ID. 1582bfaec90SYuanhan Liu 1592bfaec90SYuanhan Liu * ``destroy_device(int vid)`` 1602bfaec90SYuanhan Liu 161efba12a7SDariusz Stojaczyk This callback is invoked when a virtio device is paused or shut down. 1622bfaec90SYuanhan Liu 1632bfaec90SYuanhan Liu * ``vring_state_changed(int vid, uint16_t queue_id, int enable)`` 1642bfaec90SYuanhan Liu 1652bfaec90SYuanhan Liu This callback is invoked when a specific queue's state is changed, for 1662bfaec90SYuanhan Liu example to enabled or disabled. 1672bfaec90SYuanhan Liu 168abd53c16SYuanhan Liu * ``features_changed(int vid, uint64_t features)`` 169abd53c16SYuanhan Liu 170abd53c16SYuanhan Liu This callback is invoked when the features is changed. For example, 171abd53c16SYuanhan Liu ``VHOST_F_LOG_ALL`` will be set/cleared at the start/end of live 172abd53c16SYuanhan Liu migration, respectively. 173abd53c16SYuanhan Liu 174efba12a7SDariusz Stojaczyk * ``new_connection(int vid)`` 175efba12a7SDariusz Stojaczyk 176efba12a7SDariusz Stojaczyk This callback is invoked on new vhost-user socket connection. If DPDK 177efba12a7SDariusz Stojaczyk acts as the server the device should not be deleted before 178efba12a7SDariusz Stojaczyk ``destroy_connection`` callback is received. 179efba12a7SDariusz Stojaczyk 180efba12a7SDariusz Stojaczyk * ``destroy_connection(int vid)`` 181efba12a7SDariusz Stojaczyk 182efba12a7SDariusz Stojaczyk This callback is invoked when vhost-user socket connection is closed. 183efba12a7SDariusz Stojaczyk It indicates that device with id ``vid`` is no longer in use and can be 184efba12a7SDariusz Stojaczyk safely deleted. 185efba12a7SDariusz Stojaczyk 186af147591SYuanhan Liu* ``rte_vhost_driver_disable/enable_features(path, features))`` 187af147591SYuanhan Liu 188af147591SYuanhan Liu This function disables/enables some features. For example, it can be used to 189af147591SYuanhan Liu disable mergeable buffers and TSO features, which both are enabled by 190af147591SYuanhan Liu default. 191af147591SYuanhan Liu 192af147591SYuanhan Liu* ``rte_vhost_driver_start(path)`` 193af147591SYuanhan Liu 194af147591SYuanhan Liu This function triggers the vhost-user negotiation. It should be invoked at 195af147591SYuanhan Liu the end of initializing a vhost-user driver. 196af147591SYuanhan Liu 1972bfaec90SYuanhan Liu* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)`` 1982bfaec90SYuanhan Liu 1992bfaec90SYuanhan Liu Transmits (enqueues) ``count`` packets from host to guest. 2002bfaec90SYuanhan Liu 2012bfaec90SYuanhan Liu* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)`` 2022bfaec90SYuanhan Liu 2032bfaec90SYuanhan Liu Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. 2042bfaec90SYuanhan Liu 205939066d9SFan Zhang* ``rte_vhost_crypto_create(vid, cryptodev_id, sess_mempool, socket_id)`` 206939066d9SFan Zhang 207939066d9SFan Zhang As an extension of new_device(), this function adds virtio-crypto workload 208939066d9SFan Zhang acceleration capability to the device. All crypto workload is processed by 209939066d9SFan Zhang DPDK cryptodev with the device ID of ``cryptodev_id``. 210939066d9SFan Zhang 211939066d9SFan Zhang* ``rte_vhost_crypto_free(vid)`` 212939066d9SFan Zhang 213939066d9SFan Zhang Frees the memory and vhost-user message handlers created in 214939066d9SFan Zhang rte_vhost_crypto_create(). 215939066d9SFan Zhang 216939066d9SFan Zhang* ``rte_vhost_crypto_fetch_requests(vid, queue_id, ops, nb_ops)`` 217939066d9SFan Zhang 218939066d9SFan Zhang Receives (dequeues) ``nb_ops`` virtio-crypto requests from guest, parses 219939066d9SFan Zhang them to DPDK Crypto Operations, and fills the ``ops`` with parsing results. 220939066d9SFan Zhang 221939066d9SFan Zhang* ``rte_vhost_crypto_finalize_requests(queue_id, ops, nb_ops)`` 222939066d9SFan Zhang 223939066d9SFan Zhang After the ``ops`` are dequeued from Cryptodev, finalizes the jobs and 224939066d9SFan Zhang notifies the guest(s). 225939066d9SFan Zhang 226939066d9SFan Zhang* ``rte_vhost_crypto_set_zero_copy(vid, option)`` 227939066d9SFan Zhang 228939066d9SFan Zhang Enable or disable zero copy feature of the vhost crypto backend. 229939066d9SFan Zhang 23053d3f477SJiayu Hu* ``rte_vhost_async_dma_configure(dma_id, vchan_id)`` 231362f06f9SPatrick Fu 23253d3f477SJiayu Hu Tell vhost which DMA vChannel is going to use. This function needs to 23353d3f477SJiayu Hu be called before register async data-path for vring. 234362f06f9SPatrick Fu 23553d3f477SJiayu Hu* ``rte_vhost_async_channel_register(vid, queue_id)`` 236362f06f9SPatrick Fu 23753d3f477SJiayu Hu Register async DMA acceleration for a vhost queue after vring is enabled. 238362f06f9SPatrick Fu 23953d3f477SJiayu Hu* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)`` 240acbc3888SJiayu Hu 24153d3f477SJiayu Hu Register async DMA acceleration for a vhost queue without performing 24253d3f477SJiayu Hu any locking. 243fa51f1aaSJiayu Hu 244fa51f1aaSJiayu Hu This function is only safe to call in vhost callback functions 245ab4bb424SMaxime Coquelin (i.e., struct rte_vhost_device_ops). 246fa51f1aaSJiayu Hu 247362f06f9SPatrick Fu* ``rte_vhost_async_channel_unregister(vid, queue_id)`` 248362f06f9SPatrick Fu 24953d3f477SJiayu Hu Unregister the async DMA acceleration from a vhost queue. 250bcabc70aSJiayu Hu Unregistration will fail, if the vhost queue has in-flight 251bcabc70aSJiayu Hu packets that are not completed. 252bcabc70aSJiayu Hu 25353d3f477SJiayu Hu Unregister async DMA acceleration in vring_state_changed() may 254bcabc70aSJiayu Hu fail, as this API tries to acquire the spinlock of vhost 255bcabc70aSJiayu Hu queue. The recommended way is to unregister async copy 256bcabc70aSJiayu Hu devices for all vhost queues in destroy_device(), when a 257bcabc70aSJiayu Hu virtio device is paused or shut down. 258362f06f9SPatrick Fu 259fa51f1aaSJiayu Hu* ``rte_vhost_async_channel_unregister_thread_unsafe(vid, queue_id)`` 260fa51f1aaSJiayu Hu 26153d3f477SJiayu Hu Unregister async DMA acceleration for a vhost queue without performing 26253d3f477SJiayu Hu any locking. 263fa51f1aaSJiayu Hu 264fa51f1aaSJiayu Hu This function is only safe to call in vhost callback functions 265ab4bb424SMaxime Coquelin (i.e., struct rte_vhost_device_ops). 266fa51f1aaSJiayu Hu 26753d3f477SJiayu Hu* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, vchan_id)`` 268362f06f9SPatrick Fu 269362f06f9SPatrick Fu Submit an enqueue request to transmit ``count`` packets from host to guest 27053d3f477SJiayu Hu by async data path. Applications must not free the packets submitted for 27153d3f477SJiayu Hu enqueue until the packets are completed. 272362f06f9SPatrick Fu 27353d3f477SJiayu Hu* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, vchan_id)`` 274362f06f9SPatrick Fu 275362f06f9SPatrick Fu Poll enqueue completion status from async data path. Completed packets 276362f06f9SPatrick Fu are returned to applications through ``pkts``. 277362f06f9SPatrick Fu 2780c0935c5SJiayu Hu* ``rte_vhost_async_get_inflight(vid, queue_id)`` 2790c0935c5SJiayu Hu 2800c0935c5SJiayu Hu This function returns the amount of in-flight packets for the vhost 2810c0935c5SJiayu Hu queue using async acceleration. 2820c0935c5SJiayu Hu 2831419e8d9SXuan Ding * ``rte_vhost_async_get_inflight_thread_unsafe(vid, queue_id)`` 2841419e8d9SXuan Ding 2851419e8d9SXuan Ding Get the number of inflight packets for a vhost queue without performing 2861419e8d9SXuan Ding any locking. It should only be used within the vhost ops, which already 2871419e8d9SXuan Ding holds the lock. 2881419e8d9SXuan Ding 28953d3f477SJiayu Hu* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, vchan_id)`` 290b737fd61SCheng Jiang 2913753ebf0SYuan Wang Clear in-flight packets which are submitted to async channel in vhost 2923753ebf0SYuan Wang async data path without performing locking on virtqueue. Completed 2933753ebf0SYuan Wang packets are returned to applications through ``pkts``. 2943753ebf0SYuan Wang 2953753ebf0SYuan Wang* ``rte_vhost_clear_queue(vid, queue_id, **pkts, count, dma_id, vchan_id)`` 2963753ebf0SYuan Wang 2973753ebf0SYuan Wang Clear in-flight packets which are submitted to async channel in vhost async data 298b737fd61SCheng Jiang path. Completed packets are returned to applications through ``pkts``. 299b737fd61SCheng Jiang 300830f7e79SChangpeng Liu* ``rte_vhost_vring_call_nonblock(int vid, uint16_t vring_idx)`` 301830f7e79SChangpeng Liu 302830f7e79SChangpeng Liu Notify the guest that used descriptors have been added to the vring. This function 303830f7e79SChangpeng Liu will return -EAGAIN when vq's access lock is held by other thread, user should try 304830f7e79SChangpeng Liu again later. 305830f7e79SChangpeng Liu 306be75dc99SMaxime Coquelin* ``rte_vhost_vring_stats_get_names(int vid, uint16_t queue_id, struct rte_vhost_stat_name *names, unsigned int size)`` 307be75dc99SMaxime Coquelin 308be75dc99SMaxime Coquelin This function returns the names of the queue statistics. It requires 309be75dc99SMaxime Coquelin statistics collection to be enabled at registration time. 310be75dc99SMaxime Coquelin 311be75dc99SMaxime Coquelin* ``rte_vhost_vring_stats_get(int vid, uint16_t queue_id, struct rte_vhost_stat *stats, unsigned int n)`` 312be75dc99SMaxime Coquelin 313be75dc99SMaxime Coquelin This function returns the queue statistics. It requires statistics 314be75dc99SMaxime Coquelin collection to be enabled at registration time. 315be75dc99SMaxime Coquelin 316be75dc99SMaxime Coquelin* ``rte_vhost_vring_stats_reset(int vid, uint16_t queue_id)`` 317be75dc99SMaxime Coquelin 318be75dc99SMaxime Coquelin This function resets the queue statistics. It requires statistics 319be75dc99SMaxime Coquelin collection to be enabled at registration time. 320be75dc99SMaxime Coquelin 32184d52043SXuan Ding* ``rte_vhost_async_try_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count, 32284d52043SXuan Ding nr_inflight, dma_id, vchan_id)`` 32384d52043SXuan Ding 32484d52043SXuan Ding Receive ``count`` packets from guest to host in async data path, 32584d52043SXuan Ding and store them at ``pkts``. 32684d52043SXuan Ding 327486f65e6SAndy Pei* ``rte_vhost_driver_get_vdpa_dev_type(path, type)`` 328486f65e6SAndy Pei 329486f65e6SAndy Pei Get device type of vDPA device, such as VDPA_DEVICE_TYPE_NET, 330486f65e6SAndy Pei VDPA_DEVICE_TYPE_BLK. 331486f65e6SAndy Pei 332e8c3d496SXuan Ding* ``rte_vhost_async_dma_unconfigure(dma_id, vchan_id)`` 333e8c3d496SXuan Ding 334e8c3d496SXuan Ding Clean DMA vChannel finished to use. After this function is called, 335e8c3d496SXuan Ding the specified DMA vChannel should no longer be used by the Vhost library. 336e8c3d496SXuan Ding 337d761d455SEelco Chaudron* ``rte_vhost_notify_guest(int vid, uint16_t queue_id)`` 338d761d455SEelco Chaudron 339d761d455SEelco Chaudron Inject the offloaded interrupt received by the 'guest_notify' callback, 340d761d455SEelco Chaudron into the vhost device's queue. 341d761d455SEelco Chaudron 342*4aa1f88aSMaxime Coquelin* ``rte_vhost_driver_set_max_queue_num(const char *path, uint32_t max_queue_pairs)`` 343*4aa1f88aSMaxime Coquelin 344*4aa1f88aSMaxime Coquelin Set the maximum number of queue pairs supported by the device. 345*4aa1f88aSMaxime Coquelin 346647e191bSYuanhan LiuVhost-user Implementations 347647e191bSYuanhan Liu-------------------------- 34842683a7dSHuawei Xie 3492bfaec90SYuanhan LiuVhost-user uses Unix domain sockets for passing messages. This means the DPDK 3502bfaec90SYuanhan Liuvhost-user implementation has two options: 35142683a7dSHuawei Xie 3522bfaec90SYuanhan Liu* DPDK vhost-user acts as the server. 35342683a7dSHuawei Xie 3542bfaec90SYuanhan Liu DPDK will create a Unix domain socket server file and listen for 3552bfaec90SYuanhan Liu connections from the frontend. 35642683a7dSHuawei Xie 3572bfaec90SYuanhan Liu Note, this is the default mode, and the only mode before DPDK v16.07. 35842683a7dSHuawei Xie 3592bfaec90SYuanhan Liu 3602bfaec90SYuanhan Liu* DPDK vhost-user acts as the client. 3612bfaec90SYuanhan Liu 3622bfaec90SYuanhan Liu Unlike the server mode, this mode doesn't create the socket file; 3632bfaec90SYuanhan Liu it just tries to connect to the server (which responses to create the 3642bfaec90SYuanhan Liu file instead). 3652bfaec90SYuanhan Liu 3662bfaec90SYuanhan Liu When the DPDK vhost-user application restarts, DPDK vhost-user will try to 3672bfaec90SYuanhan Liu connect to the server again. This is how the "reconnect" feature works. 3682bfaec90SYuanhan Liu 369f6ee75b5SYuanhan Liu .. Note:: 370f6ee75b5SYuanhan Liu * The "reconnect" feature requires **QEMU v2.7** (or above). 371f6ee75b5SYuanhan Liu 372f6ee75b5SYuanhan Liu * The vhost supported features must be exactly the same before and 373f6ee75b5SYuanhan Liu after the restart. For example, if TSO is disabled and then enabled, 374c8a3ee49SHerakliusz Lipiec nothing will work and undefined issues might happen. 3752bfaec90SYuanhan Liu 3762bfaec90SYuanhan LiuNo matter which mode is used, once a connection is established, DPDK 3772bfaec90SYuanhan Liuvhost-user will start receiving and processing vhost messages from QEMU. 3782bfaec90SYuanhan Liu 3792bfaec90SYuanhan LiuFor messages with a file descriptor, the file descriptor can be used directly 3802bfaec90SYuanhan Liuin the vhost process as it is already installed by the Unix domain socket. 3812bfaec90SYuanhan Liu 3822bfaec90SYuanhan LiuThe supported vhost messages are: 3832bfaec90SYuanhan Liu 3842bfaec90SYuanhan Liu* ``VHOST_SET_MEM_TABLE`` 3852bfaec90SYuanhan Liu* ``VHOST_SET_VRING_KICK`` 3862bfaec90SYuanhan Liu* ``VHOST_SET_VRING_CALL`` 3872bfaec90SYuanhan Liu* ``VHOST_SET_LOG_FD`` 3882bfaec90SYuanhan Liu* ``VHOST_SET_VRING_ERR`` 3892bfaec90SYuanhan Liu 3902bfaec90SYuanhan LiuFor ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each 3912bfaec90SYuanhan Liumemory region and its file descriptor in the ancillary data of the message. 3922bfaec90SYuanhan LiuThe file descriptor is used to map that region. 3932bfaec90SYuanhan Liu 3942bfaec90SYuanhan Liu``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into 3952bfaec90SYuanhan Liuthe data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove 3962bfaec90SYuanhan Liuthe vhost device from the data plane. 39742683a7dSHuawei Xie 39842683a7dSHuawei XieWhen the socket connection is closed, vhost will destroy the device. 39942683a7dSHuawei Xie 400768274ebSJianfeng TanGuest memory requirement 401768274ebSJianfeng Tan------------------------ 402768274ebSJianfeng Tan 403768274ebSJianfeng Tan* Memory pre-allocation 404768274ebSJianfeng Tan 405c8a3ee49SHerakliusz Lipiec For non-async data path guest memory pre-allocation is not a 406c8a3ee49SHerakliusz Lipiec must but can help save memory. To do this we can add option 407c8a3ee49SHerakliusz Lipiec ``-mem-prealloc`` when starting QEMU, or we can lock all memory at vhost 408c8a3ee49SHerakliusz Lipiec side which will force memory to be allocated when it calls mmap 409c8a3ee49SHerakliusz Lipiec (option --mlockall in ovs-dpdk is an example in hand). 410c8a3ee49SHerakliusz Lipiec 411768274ebSJianfeng Tan 412cacf8267SMaxime Coquelin For async data path, we force the VM memory to be pre-allocated at vhost 413cacf8267SMaxime Coquelin lib when mapping the guest memory; and also we need to lock the memory to 414cacf8267SMaxime Coquelin prevent pages being swapped out to disk. 415768274ebSJianfeng Tan 416768274ebSJianfeng Tan* Memory sharing 417768274ebSJianfeng Tan 418c8a3ee49SHerakliusz Lipiec Make sure ``share=on`` QEMU option is given. The vhost-user will not work with 419c8a3ee49SHerakliusz Lipiec a QEMU instance without shared memory mapping. 420768274ebSJianfeng Tan 4210ee5e7fbSSiobhan ButlerVhost supported vSwitch reference 4220ee5e7fbSSiobhan Butler--------------------------------- 4230ee5e7fbSSiobhan Butler 4242bfaec90SYuanhan LiuFor more vhost details and how to support vhost in vSwitch, please refer to 4252bfaec90SYuanhan Liuthe vhost example in the DPDK Sample Applications Guide. 4266beea244SZhihong Wang 4276beea244SZhihong WangVhost data path acceleration (vDPA) 4286beea244SZhihong Wang----------------------------------- 4296beea244SZhihong Wang 4306beea244SZhihong WangvDPA supports selective datapath in vhost-user lib by enabling virtio ring 4316beea244SZhihong Wangcompatible devices to serve virtio driver directly for datapath acceleration. 4326beea244SZhihong Wang 4336beea244SZhihong Wang``rte_vhost_driver_attach_vdpa_device`` is used to configure the vhost device 4346beea244SZhihong Wangwith accelerated backend. 4356beea244SZhihong Wang 4366beea244SZhihong WangAlso vhost device capabilities are made configurable to adopt various devices. 4376beea244SZhihong WangSuch capabilities include supported features, protocol features, queue number. 4386beea244SZhihong Wang 4396beea244SZhihong WangFinally, a set of device ops is defined for device specific operations: 4406beea244SZhihong Wang 4416beea244SZhihong Wang* ``get_queue_num`` 4426beea244SZhihong Wang 4436beea244SZhihong Wang Called to get supported queue number of the device. 4446beea244SZhihong Wang 4456beea244SZhihong Wang* ``get_features`` 4466beea244SZhihong Wang 4476beea244SZhihong Wang Called to get supported features of the device. 4486beea244SZhihong Wang 4496beea244SZhihong Wang* ``get_protocol_features`` 4506beea244SZhihong Wang 4516beea244SZhihong Wang Called to get supported protocol features of the device. 4526beea244SZhihong Wang 4536beea244SZhihong Wang* ``dev_conf`` 4546beea244SZhihong Wang 4556beea244SZhihong Wang Called to configure the actual device when the virtio device becomes ready. 4566beea244SZhihong Wang 4576beea244SZhihong Wang* ``dev_close`` 4586beea244SZhihong Wang 4596beea244SZhihong Wang Called to close the actual device when the virtio device is stopped. 4606beea244SZhihong Wang 4616beea244SZhihong Wang* ``set_vring_state`` 4626beea244SZhihong Wang 4636beea244SZhihong Wang Called to change the state of the vring in the actual device when vring state 4646beea244SZhihong Wang changes. 4656beea244SZhihong Wang 4666beea244SZhihong Wang* ``set_features`` 4676beea244SZhihong Wang 4686beea244SZhihong Wang Called to set the negotiated features to device. 4696beea244SZhihong Wang 4706beea244SZhihong Wang* ``migration_done`` 4716beea244SZhihong Wang 4726beea244SZhihong Wang Called to allow the device to response to RARP sending. 4736beea244SZhihong Wang 4746beea244SZhihong Wang* ``get_vfio_group_fd`` 4756beea244SZhihong Wang 4766beea244SZhihong Wang Called to get the VFIO group fd of the device. 4776beea244SZhihong Wang 4786beea244SZhihong Wang* ``get_vfio_device_fd`` 4796beea244SZhihong Wang 4806beea244SZhihong Wang Called to get the VFIO device fd of the device. 4816beea244SZhihong Wang 4826beea244SZhihong Wang* ``get_notify_area`` 4836beea244SZhihong Wang 4846beea244SZhihong Wang Called to get the notify area info of the queue. 48538e0f108SXuan Ding 48653d3f477SJiayu HuVhost asynchronous data path 48753d3f477SJiayu Hu---------------------------- 48853d3f477SJiayu Hu 48953d3f477SJiayu HuVhost asynchronous data path leverages DMA devices to offload memory 49053d3f477SJiayu Hucopies from the CPU and it is implemented in an asynchronous way. It 49153d3f477SJiayu Huenables applications, like OVS, to save CPU cycles and hide memory copy 49253d3f477SJiayu Huoverhead, thus achieving higher throughput. 49353d3f477SJiayu Hu 49453d3f477SJiayu HuVhost doesn't manage DMA devices and applications, like OVS, need to 49553d3f477SJiayu Humanage and configure DMA devices. Applications need to tell vhost what 49653d3f477SJiayu HuDMA devices to use in every data path function call. This design enables 49753d3f477SJiayu Huthe flexibility for applications to dynamically use DMA channels in 49853d3f477SJiayu Hudifferent function modules, not limited in vhost. 49953d3f477SJiayu Hu 50053d3f477SJiayu HuIn addition, vhost supports M:N mapping between vrings and DMA virtual 50153d3f477SJiayu Huchannels. Specifically, one vring can use multiple different DMA channels 50253d3f477SJiayu Huand one DMA channel can be shared by multiple vrings at the same time. 50353d3f477SJiayu HuThe reason of enabling one vring to use multiple DMA channels is that 50453d3f477SJiayu Huit's possible that more than one dataplane threads enqueue packets to 50553d3f477SJiayu Huthe same vring with their own DMA virtual channels. Besides, the number 50653d3f477SJiayu Huof DMA devices is limited. For the purpose of scaling, it's necessary to 50753d3f477SJiayu Husupport sharing DMA channels among vrings. 50853d3f477SJiayu Hu 5099851c4e3SXuan Ding* Async enqueue API usage 5109851c4e3SXuan Ding 5119851c4e3SXuan Ding In async enqueue path, rte_vhost_poll_enqueue_completed() needs to be 5129851c4e3SXuan Ding called in time to notify the guest of DMA copy completed packets. 5139851c4e3SXuan Ding Moreover, calling rte_vhost_submit_enqueue_burst() all the time but 5149851c4e3SXuan Ding not poll completed will cause the DMA ring to be full, which will 5159851c4e3SXuan Ding result in packet loss eventually. 5169851c4e3SXuan Ding 517741eda9dSXuan Ding* Recommended IOVA mode in async datapath 51838e0f108SXuan Ding 51938e0f108SXuan Ding When DMA devices are bound to VFIO driver, VA mode is recommended. 52038e0f108SXuan Ding For PA mode, page by page mapping may exceed IOMMU's max capability, 52138e0f108SXuan Ding better to use 1G guest hugepage. 52238e0f108SXuan Ding 523741eda9dSXuan Ding For UIO driver or kernel driver, any VFIO related error messages 524741eda9dSXuan Ding can be ignored. 525