10ee5e7fbSSiobhan Butler.. BSD LICENSE 22bfaec90SYuanhan Liu Copyright(c) 2010-2016 Intel Corporation. All rights reserved. 30ee5e7fbSSiobhan Butler All rights reserved. 40ee5e7fbSSiobhan Butler 50ee5e7fbSSiobhan Butler Redistribution and use in source and binary forms, with or without 60ee5e7fbSSiobhan Butler modification, are permitted provided that the following conditions 70ee5e7fbSSiobhan Butler are met: 80ee5e7fbSSiobhan Butler 90ee5e7fbSSiobhan Butler * Redistributions of source code must retain the above copyright 100ee5e7fbSSiobhan Butler notice, this list of conditions and the following disclaimer. 110ee5e7fbSSiobhan Butler * Redistributions in binary form must reproduce the above copyright 120ee5e7fbSSiobhan Butler notice, this list of conditions and the following disclaimer in 130ee5e7fbSSiobhan Butler the documentation and/or other materials provided with the 140ee5e7fbSSiobhan Butler distribution. 150ee5e7fbSSiobhan Butler * Neither the name of Intel Corporation nor the names of its 160ee5e7fbSSiobhan Butler contributors may be used to endorse or promote products derived 170ee5e7fbSSiobhan Butler from this software without specific prior written permission. 180ee5e7fbSSiobhan Butler 190ee5e7fbSSiobhan Butler THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 200ee5e7fbSSiobhan Butler "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 210ee5e7fbSSiobhan Butler LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 220ee5e7fbSSiobhan Butler A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 230ee5e7fbSSiobhan Butler OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 240ee5e7fbSSiobhan Butler SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 250ee5e7fbSSiobhan Butler LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 260ee5e7fbSSiobhan Butler DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 270ee5e7fbSSiobhan Butler THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 280ee5e7fbSSiobhan Butler (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 290ee5e7fbSSiobhan Butler OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 300ee5e7fbSSiobhan Butler 310ee5e7fbSSiobhan ButlerVhost Library 320ee5e7fbSSiobhan Butler============= 330ee5e7fbSSiobhan Butler 342bfaec90SYuanhan LiuThe vhost library implements a user space virtio net server allowing the user 352bfaec90SYuanhan Liuto manipulate the virtio ring directly. In another words, it allows the user 362bfaec90SYuanhan Liuto fetch/put packets from/to the VM virtio net device. To achieve this, a 372bfaec90SYuanhan Liuvhost library should be able to: 382bfaec90SYuanhan Liu 392bfaec90SYuanhan Liu* Access the guest memory: 402bfaec90SYuanhan Liu 412bfaec90SYuanhan Liu For QEMU, this is done by using the ``-object memory-backend-file,share=on,...`` 422bfaec90SYuanhan Liu option. Which means QEMU will create a file to serve as the guest RAM. 432bfaec90SYuanhan Liu The ``share=on`` option allows another process to map that file, which 442bfaec90SYuanhan Liu means it can access the guest RAM. 452bfaec90SYuanhan Liu 462bfaec90SYuanhan Liu* Know all the necessary information about the vring: 472bfaec90SYuanhan Liu 482bfaec90SYuanhan Liu Information such as where the available ring is stored. Vhost defines some 492bfaec90SYuanhan Liu messages to tell the backend all the information it needs to know how to 502bfaec90SYuanhan Liu manipulate the vring. 512bfaec90SYuanhan Liu 522bfaec90SYuanhan LiuCurrently, there are two ways to pass these messages and as a result there are 532bfaec90SYuanhan Liutwo Vhost implementations in DPDK: *vhost-cuse* (where the character devices 542bfaec90SYuanhan Liuare in user space) and *vhost-user*. 552bfaec90SYuanhan Liu 562bfaec90SYuanhan LiuVhost-cuse creates a user space character device and hook to a function ioctl, 572bfaec90SYuanhan Liuso that all ioctl commands that are sent from the frontend (QEMU) will be 582bfaec90SYuanhan Liucaptured and handled. 592bfaec90SYuanhan Liu 602bfaec90SYuanhan LiuVhost-user creates a Unix domain socket file through which messages are 612bfaec90SYuanhan Liupassed. 622bfaec90SYuanhan Liu 632bfaec90SYuanhan Liu.. Note:: 642bfaec90SYuanhan Liu 652bfaec90SYuanhan Liu Since DPDK v2.2, the majority of the development effort has gone into 662bfaec90SYuanhan Liu enhancing vhost-user, such as multiple queue, live migration, and 672bfaec90SYuanhan Liu reconnect. Thus, it is strongly advised to use vhost-user instead of 682bfaec90SYuanhan Liu vhost-cuse. 692bfaec90SYuanhan Liu 700ee5e7fbSSiobhan Butler 710ee5e7fbSSiobhan ButlerVhost API Overview 720ee5e7fbSSiobhan Butler------------------ 730ee5e7fbSSiobhan Butler 742bfaec90SYuanhan LiuThe following is an overview of the Vhost API functions: 750ee5e7fbSSiobhan Butler 762bfaec90SYuanhan Liu* ``rte_vhost_driver_register(path, flags)`` 770ee5e7fbSSiobhan Butler 782bfaec90SYuanhan Liu This function registers a vhost driver into the system. For vhost-cuse, a 792bfaec90SYuanhan Liu ``/dev/path`` character device file will be created. For vhost-user server 802bfaec90SYuanhan Liu mode, a Unix domain socket file ``path`` will be created. 810ee5e7fbSSiobhan Butler 82*9ba1e744SYuanhan Liu Currently supported flags are (these are valid for vhost-user only): 830ee5e7fbSSiobhan Butler 842bfaec90SYuanhan Liu - ``RTE_VHOST_USER_CLIENT`` 850ee5e7fbSSiobhan Butler 862bfaec90SYuanhan Liu DPDK vhost-user will act as the client when this flag is given. See below 872bfaec90SYuanhan Liu for an explanation. 880ee5e7fbSSiobhan Butler 892bfaec90SYuanhan Liu - ``RTE_VHOST_USER_NO_RECONNECT`` 900ee5e7fbSSiobhan Butler 912bfaec90SYuanhan Liu When DPDK vhost-user acts as the client it will keep trying to reconnect 922bfaec90SYuanhan Liu to the server (QEMU) until it succeeds. This is useful in two cases: 930ee5e7fbSSiobhan Butler 942bfaec90SYuanhan Liu * When QEMU is not started yet. 952bfaec90SYuanhan Liu * When QEMU restarts (for example due to a guest OS reboot). 960ee5e7fbSSiobhan Butler 972bfaec90SYuanhan Liu This reconnect option is enabled by default. However, it can be turned off 982bfaec90SYuanhan Liu by setting this flag. 990ee5e7fbSSiobhan Butler 100*9ba1e744SYuanhan Liu - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` 101*9ba1e744SYuanhan Liu 102*9ba1e744SYuanhan Liu Dequeue zero copy will be enabled when this flag is set. It is disabled by 103*9ba1e744SYuanhan Liu default. 104*9ba1e744SYuanhan Liu 105*9ba1e744SYuanhan Liu There are some truths (including limitations) you might want to know while 106*9ba1e744SYuanhan Liu setting this flag: 107*9ba1e744SYuanhan Liu 108*9ba1e744SYuanhan Liu * zero copy is not good for small packets (typically for packet size below 109*9ba1e744SYuanhan Liu 512). 110*9ba1e744SYuanhan Liu 111*9ba1e744SYuanhan Liu * zero copy is really good for VM2VM case. For iperf between two VMs, the 112*9ba1e744SYuanhan Liu boost could be above 70% (when TSO is enableld). 113*9ba1e744SYuanhan Liu 114*9ba1e744SYuanhan Liu * for VM2NIC case, the ``nb_tx_desc`` has to be small enough: <= 64 if virtio 115*9ba1e744SYuanhan Liu indirect feature is not enabled and <= 128 if it is enabled. 116*9ba1e744SYuanhan Liu 117*9ba1e744SYuanhan Liu The is because when dequeue zero copy is enabled, guest Tx used vring will 118*9ba1e744SYuanhan Liu be updated only when corresponding mbuf is freed. Thus, the nb_tx_desc 119*9ba1e744SYuanhan Liu has to be small enough so that the PMD driver will run out of available 120*9ba1e744SYuanhan Liu Tx descriptors and free mbufs timely. Otherwise, guest Tx vring would be 121*9ba1e744SYuanhan Liu starved. 122*9ba1e744SYuanhan Liu 123*9ba1e744SYuanhan Liu * Guest memory should be backended with huge pages to achieve better 124*9ba1e744SYuanhan Liu performance. Using 1G page size is the best. 125*9ba1e744SYuanhan Liu 126*9ba1e744SYuanhan Liu When dequeue zero copy is enabled, the guest phys address and host phys 127*9ba1e744SYuanhan Liu address mapping has to be established. Using non-huge pages means far 128*9ba1e744SYuanhan Liu more page segments. To make it simple, DPDK vhost does a linear search 129*9ba1e744SYuanhan Liu of those segments, thus the fewer the segments, the quicker we will get 130*9ba1e744SYuanhan Liu the mapping. NOTE: we may speed it by using tree searching in future. 131*9ba1e744SYuanhan Liu 1322bfaec90SYuanhan Liu* ``rte_vhost_driver_session_start()`` 1330ee5e7fbSSiobhan Butler 1342bfaec90SYuanhan Liu This function starts the vhost session loop to handle vhost messages. It 1352bfaec90SYuanhan Liu starts an infinite loop, therefore it should be called in a dedicated 1362bfaec90SYuanhan Liu thread. 1372bfaec90SYuanhan Liu 1382bfaec90SYuanhan Liu* ``rte_vhost_driver_callback_register(virtio_net_device_ops)`` 1392bfaec90SYuanhan Liu 1402bfaec90SYuanhan Liu This function registers a set of callbacks, to let DPDK applications take 1412bfaec90SYuanhan Liu the appropriate action when some events happen. The following events are 1422bfaec90SYuanhan Liu currently supported: 1432bfaec90SYuanhan Liu 1442bfaec90SYuanhan Liu * ``new_device(int vid)`` 1452bfaec90SYuanhan Liu 1462bfaec90SYuanhan Liu This callback is invoked when a virtio net device becomes ready. ``vid`` 1472bfaec90SYuanhan Liu is the virtio net device ID. 1482bfaec90SYuanhan Liu 1492bfaec90SYuanhan Liu * ``destroy_device(int vid)`` 1502bfaec90SYuanhan Liu 1512bfaec90SYuanhan Liu This callback is invoked when a virtio net device shuts down (or when the 1522bfaec90SYuanhan Liu vhost connection is broken). 1532bfaec90SYuanhan Liu 1542bfaec90SYuanhan Liu * ``vring_state_changed(int vid, uint16_t queue_id, int enable)`` 1552bfaec90SYuanhan Liu 1562bfaec90SYuanhan Liu This callback is invoked when a specific queue's state is changed, for 1572bfaec90SYuanhan Liu example to enabled or disabled. 1582bfaec90SYuanhan Liu 1592bfaec90SYuanhan Liu* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)`` 1602bfaec90SYuanhan Liu 1612bfaec90SYuanhan Liu Transmits (enqueues) ``count`` packets from host to guest. 1622bfaec90SYuanhan Liu 1632bfaec90SYuanhan Liu* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)`` 1642bfaec90SYuanhan Liu 1652bfaec90SYuanhan Liu Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. 1662bfaec90SYuanhan Liu 1672bfaec90SYuanhan Liu* ``rte_vhost_feature_disable/rte_vhost_feature_enable(feature_mask)`` 1682bfaec90SYuanhan Liu 1692bfaec90SYuanhan Liu This function disables/enables some features. For example, it can be used to 1702bfaec90SYuanhan Liu disable mergeable buffers and TSO features, which both are enabled by 1712bfaec90SYuanhan Liu default. 1722bfaec90SYuanhan Liu 1732bfaec90SYuanhan Liu 1742bfaec90SYuanhan LiuVhost Implementations 1752bfaec90SYuanhan Liu--------------------- 1762bfaec90SYuanhan Liu 1772bfaec90SYuanhan LiuVhost-cuse implementation 17842683a7dSHuawei Xie~~~~~~~~~~~~~~~~~~~~~~~~~ 1792bfaec90SYuanhan Liu 1800ee5e7fbSSiobhan ButlerWhen vSwitch registers the vhost driver, it will register a cuse device driver 1810ee5e7fbSSiobhan Butlerinto the system and creates a character device file. This cuse driver will 1822bfaec90SYuanhan Liureceive vhost open/release/IOCTL messages from the QEMU simulator. 1830ee5e7fbSSiobhan Butler 1842bfaec90SYuanhan LiuWhen the open call is received, the vhost driver will create a vhost device 1852bfaec90SYuanhan Liufor the virtio device in the guest. 1860ee5e7fbSSiobhan Butler 1872bfaec90SYuanhan LiuWhen the ``VHOST_SET_MEM_TABLE`` ioctl is received, vhost searches the memory 1882bfaec90SYuanhan Liuregion to find the starting user space virtual address that maps the memory of 1892bfaec90SYuanhan Liuthe guest virtual machine. Through this virtual address and the QEMU pid, 1902bfaec90SYuanhan Liuvhost can find the file QEMU uses to map the guest memory. Vhost maps this 1912bfaec90SYuanhan Liufile into its address space, in this way vhost can fully access the guest 1922bfaec90SYuanhan Liuphysical memory, which means vhost could access the shared virtio ring and the 1932bfaec90SYuanhan Liuguest physical address specified in the entry of the ring. 1940ee5e7fbSSiobhan Butler 1950ee5e7fbSSiobhan ButlerThe guest virtual machine tells the vhost whether the virtio device is ready 1962bfaec90SYuanhan Liufor processing or is de-activated through the ``VHOST_NET_SET_BACKEND`` 1972bfaec90SYuanhan Liumessage. The registered callback from vSwitch will be called. 1980ee5e7fbSSiobhan Butler 1992bfaec90SYuanhan LiuWhen the release call is made, vhost will destroy the device. 2000ee5e7fbSSiobhan Butler 2012bfaec90SYuanhan LiuVhost-user implementation 20242683a7dSHuawei Xie~~~~~~~~~~~~~~~~~~~~~~~~~ 20342683a7dSHuawei Xie 2042bfaec90SYuanhan LiuVhost-user uses Unix domain sockets for passing messages. This means the DPDK 2052bfaec90SYuanhan Liuvhost-user implementation has two options: 20642683a7dSHuawei Xie 2072bfaec90SYuanhan Liu* DPDK vhost-user acts as the server. 20842683a7dSHuawei Xie 2092bfaec90SYuanhan Liu DPDK will create a Unix domain socket server file and listen for 2102bfaec90SYuanhan Liu connections from the frontend. 21142683a7dSHuawei Xie 2122bfaec90SYuanhan Liu Note, this is the default mode, and the only mode before DPDK v16.07. 21342683a7dSHuawei Xie 2142bfaec90SYuanhan Liu 2152bfaec90SYuanhan Liu* DPDK vhost-user acts as the client. 2162bfaec90SYuanhan Liu 2172bfaec90SYuanhan Liu Unlike the server mode, this mode doesn't create the socket file; 2182bfaec90SYuanhan Liu it just tries to connect to the server (which responses to create the 2192bfaec90SYuanhan Liu file instead). 2202bfaec90SYuanhan Liu 2212bfaec90SYuanhan Liu When the DPDK vhost-user application restarts, DPDK vhost-user will try to 2222bfaec90SYuanhan Liu connect to the server again. This is how the "reconnect" feature works. 2232bfaec90SYuanhan Liu 224f6ee75b5SYuanhan Liu .. Note:: 225f6ee75b5SYuanhan Liu * The "reconnect" feature requires **QEMU v2.7** (or above). 226f6ee75b5SYuanhan Liu 227f6ee75b5SYuanhan Liu * The vhost supported features must be exactly the same before and 228f6ee75b5SYuanhan Liu after the restart. For example, if TSO is disabled and then enabled, 229f6ee75b5SYuanhan Liu nothing will work and issues undefined might happen. 2302bfaec90SYuanhan Liu 2312bfaec90SYuanhan LiuNo matter which mode is used, once a connection is established, DPDK 2322bfaec90SYuanhan Liuvhost-user will start receiving and processing vhost messages from QEMU. 2332bfaec90SYuanhan Liu 2342bfaec90SYuanhan LiuFor messages with a file descriptor, the file descriptor can be used directly 2352bfaec90SYuanhan Liuin the vhost process as it is already installed by the Unix domain socket. 2362bfaec90SYuanhan Liu 2372bfaec90SYuanhan LiuThe supported vhost messages are: 2382bfaec90SYuanhan Liu 2392bfaec90SYuanhan Liu* ``VHOST_SET_MEM_TABLE`` 2402bfaec90SYuanhan Liu* ``VHOST_SET_VRING_KICK`` 2412bfaec90SYuanhan Liu* ``VHOST_SET_VRING_CALL`` 2422bfaec90SYuanhan Liu* ``VHOST_SET_LOG_FD`` 2432bfaec90SYuanhan Liu* ``VHOST_SET_VRING_ERR`` 2442bfaec90SYuanhan Liu 2452bfaec90SYuanhan LiuFor ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each 2462bfaec90SYuanhan Liumemory region and its file descriptor in the ancillary data of the message. 2472bfaec90SYuanhan LiuThe file descriptor is used to map that region. 2482bfaec90SYuanhan Liu 2492bfaec90SYuanhan LiuThere is no ``VHOST_NET_SET_BACKEND`` message as in vhost-cuse to signal 2502bfaec90SYuanhan Liuwhether the virtio device is ready or stopped. Instead, 2512bfaec90SYuanhan Liu``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into 2522bfaec90SYuanhan Liuthe data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove 2532bfaec90SYuanhan Liuthe vhost device from the data plane. 25442683a7dSHuawei Xie 25542683a7dSHuawei XieWhen the socket connection is closed, vhost will destroy the device. 25642683a7dSHuawei Xie 2570ee5e7fbSSiobhan ButlerVhost supported vSwitch reference 2580ee5e7fbSSiobhan Butler--------------------------------- 2590ee5e7fbSSiobhan Butler 2602bfaec90SYuanhan LiuFor more vhost details and how to support vhost in vSwitch, please refer to 2612bfaec90SYuanhan Liuthe vhost example in the DPDK Sample Applications Guide. 262