10ee5e7fbSSiobhan Butler.. BSD LICENSE 2*2bfaec90SYuanhan Liu Copyright(c) 2010-2016 Intel Corporation. All rights reserved. 30ee5e7fbSSiobhan Butler All rights reserved. 40ee5e7fbSSiobhan Butler 50ee5e7fbSSiobhan Butler Redistribution and use in source and binary forms, with or without 60ee5e7fbSSiobhan Butler modification, are permitted provided that the following conditions 70ee5e7fbSSiobhan Butler are met: 80ee5e7fbSSiobhan Butler 90ee5e7fbSSiobhan Butler * Redistributions of source code must retain the above copyright 100ee5e7fbSSiobhan Butler notice, this list of conditions and the following disclaimer. 110ee5e7fbSSiobhan Butler * Redistributions in binary form must reproduce the above copyright 120ee5e7fbSSiobhan Butler notice, this list of conditions and the following disclaimer in 130ee5e7fbSSiobhan Butler the documentation and/or other materials provided with the 140ee5e7fbSSiobhan Butler distribution. 150ee5e7fbSSiobhan Butler * Neither the name of Intel Corporation nor the names of its 160ee5e7fbSSiobhan Butler contributors may be used to endorse or promote products derived 170ee5e7fbSSiobhan Butler from this software without specific prior written permission. 180ee5e7fbSSiobhan Butler 190ee5e7fbSSiobhan Butler THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 200ee5e7fbSSiobhan Butler "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 210ee5e7fbSSiobhan Butler LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 220ee5e7fbSSiobhan Butler A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 230ee5e7fbSSiobhan Butler OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 240ee5e7fbSSiobhan Butler SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 250ee5e7fbSSiobhan Butler LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 260ee5e7fbSSiobhan Butler DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 270ee5e7fbSSiobhan Butler THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 280ee5e7fbSSiobhan Butler (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 290ee5e7fbSSiobhan Butler OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 300ee5e7fbSSiobhan Butler 310ee5e7fbSSiobhan ButlerVhost Library 320ee5e7fbSSiobhan Butler============= 330ee5e7fbSSiobhan Butler 34*2bfaec90SYuanhan LiuThe vhost library implements a user space virtio net server allowing the user 35*2bfaec90SYuanhan Liuto manipulate the virtio ring directly. In another words, it allows the user 36*2bfaec90SYuanhan Liuto fetch/put packets from/to the VM virtio net device. To achieve this, a 37*2bfaec90SYuanhan Liuvhost library should be able to: 38*2bfaec90SYuanhan Liu 39*2bfaec90SYuanhan Liu* Access the guest memory: 40*2bfaec90SYuanhan Liu 41*2bfaec90SYuanhan Liu For QEMU, this is done by using the ``-object memory-backend-file,share=on,...`` 42*2bfaec90SYuanhan Liu option. Which means QEMU will create a file to serve as the guest RAM. 43*2bfaec90SYuanhan Liu The ``share=on`` option allows another process to map that file, which 44*2bfaec90SYuanhan Liu means it can access the guest RAM. 45*2bfaec90SYuanhan Liu 46*2bfaec90SYuanhan Liu* Know all the necessary information about the vring: 47*2bfaec90SYuanhan Liu 48*2bfaec90SYuanhan Liu Information such as where the available ring is stored. Vhost defines some 49*2bfaec90SYuanhan Liu messages to tell the backend all the information it needs to know how to 50*2bfaec90SYuanhan Liu manipulate the vring. 51*2bfaec90SYuanhan Liu 52*2bfaec90SYuanhan LiuCurrently, there are two ways to pass these messages and as a result there are 53*2bfaec90SYuanhan Liutwo Vhost implementations in DPDK: *vhost-cuse* (where the character devices 54*2bfaec90SYuanhan Liuare in user space) and *vhost-user*. 55*2bfaec90SYuanhan Liu 56*2bfaec90SYuanhan LiuVhost-cuse creates a user space character device and hook to a function ioctl, 57*2bfaec90SYuanhan Liuso that all ioctl commands that are sent from the frontend (QEMU) will be 58*2bfaec90SYuanhan Liucaptured and handled. 59*2bfaec90SYuanhan Liu 60*2bfaec90SYuanhan LiuVhost-user creates a Unix domain socket file through which messages are 61*2bfaec90SYuanhan Liupassed. 62*2bfaec90SYuanhan Liu 63*2bfaec90SYuanhan Liu.. Note:: 64*2bfaec90SYuanhan Liu 65*2bfaec90SYuanhan Liu Since DPDK v2.2, the majority of the development effort has gone into 66*2bfaec90SYuanhan Liu enhancing vhost-user, such as multiple queue, live migration, and 67*2bfaec90SYuanhan Liu reconnect. Thus, it is strongly advised to use vhost-user instead of 68*2bfaec90SYuanhan Liu vhost-cuse. 69*2bfaec90SYuanhan Liu 700ee5e7fbSSiobhan Butler 710ee5e7fbSSiobhan ButlerVhost API Overview 720ee5e7fbSSiobhan Butler------------------ 730ee5e7fbSSiobhan Butler 74*2bfaec90SYuanhan LiuThe following is an overview of the Vhost API functions: 750ee5e7fbSSiobhan Butler 76*2bfaec90SYuanhan Liu* ``rte_vhost_driver_register(path, flags)`` 770ee5e7fbSSiobhan Butler 78*2bfaec90SYuanhan Liu This function registers a vhost driver into the system. For vhost-cuse, a 79*2bfaec90SYuanhan Liu ``/dev/path`` character device file will be created. For vhost-user server 80*2bfaec90SYuanhan Liu mode, a Unix domain socket file ``path`` will be created. 810ee5e7fbSSiobhan Butler 82*2bfaec90SYuanhan Liu Currently two flags are supported (these are valid for vhost-user only): 830ee5e7fbSSiobhan Butler 84*2bfaec90SYuanhan Liu - ``RTE_VHOST_USER_CLIENT`` 850ee5e7fbSSiobhan Butler 86*2bfaec90SYuanhan Liu DPDK vhost-user will act as the client when this flag is given. See below 87*2bfaec90SYuanhan Liu for an explanation. 880ee5e7fbSSiobhan Butler 89*2bfaec90SYuanhan Liu - ``RTE_VHOST_USER_NO_RECONNECT`` 900ee5e7fbSSiobhan Butler 91*2bfaec90SYuanhan Liu When DPDK vhost-user acts as the client it will keep trying to reconnect 92*2bfaec90SYuanhan Liu to the server (QEMU) until it succeeds. This is useful in two cases: 930ee5e7fbSSiobhan Butler 94*2bfaec90SYuanhan Liu * When QEMU is not started yet. 95*2bfaec90SYuanhan Liu * When QEMU restarts (for example due to a guest OS reboot). 960ee5e7fbSSiobhan Butler 97*2bfaec90SYuanhan Liu This reconnect option is enabled by default. However, it can be turned off 98*2bfaec90SYuanhan Liu by setting this flag. 990ee5e7fbSSiobhan Butler 100*2bfaec90SYuanhan Liu* ``rte_vhost_driver_session_start()`` 1010ee5e7fbSSiobhan Butler 102*2bfaec90SYuanhan Liu This function starts the vhost session loop to handle vhost messages. It 103*2bfaec90SYuanhan Liu starts an infinite loop, therefore it should be called in a dedicated 104*2bfaec90SYuanhan Liu thread. 105*2bfaec90SYuanhan Liu 106*2bfaec90SYuanhan Liu* ``rte_vhost_driver_callback_register(virtio_net_device_ops)`` 107*2bfaec90SYuanhan Liu 108*2bfaec90SYuanhan Liu This function registers a set of callbacks, to let DPDK applications take 109*2bfaec90SYuanhan Liu the appropriate action when some events happen. The following events are 110*2bfaec90SYuanhan Liu currently supported: 111*2bfaec90SYuanhan Liu 112*2bfaec90SYuanhan Liu * ``new_device(int vid)`` 113*2bfaec90SYuanhan Liu 114*2bfaec90SYuanhan Liu This callback is invoked when a virtio net device becomes ready. ``vid`` 115*2bfaec90SYuanhan Liu is the virtio net device ID. 116*2bfaec90SYuanhan Liu 117*2bfaec90SYuanhan Liu * ``destroy_device(int vid)`` 118*2bfaec90SYuanhan Liu 119*2bfaec90SYuanhan Liu This callback is invoked when a virtio net device shuts down (or when the 120*2bfaec90SYuanhan Liu vhost connection is broken). 121*2bfaec90SYuanhan Liu 122*2bfaec90SYuanhan Liu * ``vring_state_changed(int vid, uint16_t queue_id, int enable)`` 123*2bfaec90SYuanhan Liu 124*2bfaec90SYuanhan Liu This callback is invoked when a specific queue's state is changed, for 125*2bfaec90SYuanhan Liu example to enabled or disabled. 126*2bfaec90SYuanhan Liu 127*2bfaec90SYuanhan Liu* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)`` 128*2bfaec90SYuanhan Liu 129*2bfaec90SYuanhan Liu Transmits (enqueues) ``count`` packets from host to guest. 130*2bfaec90SYuanhan Liu 131*2bfaec90SYuanhan Liu* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)`` 132*2bfaec90SYuanhan Liu 133*2bfaec90SYuanhan Liu Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. 134*2bfaec90SYuanhan Liu 135*2bfaec90SYuanhan Liu* ``rte_vhost_feature_disable/rte_vhost_feature_enable(feature_mask)`` 136*2bfaec90SYuanhan Liu 137*2bfaec90SYuanhan Liu This function disables/enables some features. For example, it can be used to 138*2bfaec90SYuanhan Liu disable mergeable buffers and TSO features, which both are enabled by 139*2bfaec90SYuanhan Liu default. 140*2bfaec90SYuanhan Liu 141*2bfaec90SYuanhan Liu 142*2bfaec90SYuanhan LiuVhost Implementations 143*2bfaec90SYuanhan Liu--------------------- 144*2bfaec90SYuanhan Liu 145*2bfaec90SYuanhan LiuVhost-cuse implementation 14642683a7dSHuawei Xie~~~~~~~~~~~~~~~~~~~~~~~~~ 147*2bfaec90SYuanhan Liu 1480ee5e7fbSSiobhan ButlerWhen vSwitch registers the vhost driver, it will register a cuse device driver 1490ee5e7fbSSiobhan Butlerinto the system and creates a character device file. This cuse driver will 150*2bfaec90SYuanhan Liureceive vhost open/release/IOCTL messages from the QEMU simulator. 1510ee5e7fbSSiobhan Butler 152*2bfaec90SYuanhan LiuWhen the open call is received, the vhost driver will create a vhost device 153*2bfaec90SYuanhan Liufor the virtio device in the guest. 1540ee5e7fbSSiobhan Butler 155*2bfaec90SYuanhan LiuWhen the ``VHOST_SET_MEM_TABLE`` ioctl is received, vhost searches the memory 156*2bfaec90SYuanhan Liuregion to find the starting user space virtual address that maps the memory of 157*2bfaec90SYuanhan Liuthe guest virtual machine. Through this virtual address and the QEMU pid, 158*2bfaec90SYuanhan Liuvhost can find the file QEMU uses to map the guest memory. Vhost maps this 159*2bfaec90SYuanhan Liufile into its address space, in this way vhost can fully access the guest 160*2bfaec90SYuanhan Liuphysical memory, which means vhost could access the shared virtio ring and the 161*2bfaec90SYuanhan Liuguest physical address specified in the entry of the ring. 1620ee5e7fbSSiobhan Butler 1630ee5e7fbSSiobhan ButlerThe guest virtual machine tells the vhost whether the virtio device is ready 164*2bfaec90SYuanhan Liufor processing or is de-activated through the ``VHOST_NET_SET_BACKEND`` 165*2bfaec90SYuanhan Liumessage. The registered callback from vSwitch will be called. 1660ee5e7fbSSiobhan Butler 167*2bfaec90SYuanhan LiuWhen the release call is made, vhost will destroy the device. 1680ee5e7fbSSiobhan Butler 169*2bfaec90SYuanhan LiuVhost-user implementation 17042683a7dSHuawei Xie~~~~~~~~~~~~~~~~~~~~~~~~~ 17142683a7dSHuawei Xie 172*2bfaec90SYuanhan LiuVhost-user uses Unix domain sockets for passing messages. This means the DPDK 173*2bfaec90SYuanhan Liuvhost-user implementation has two options: 17442683a7dSHuawei Xie 175*2bfaec90SYuanhan Liu* DPDK vhost-user acts as the server. 17642683a7dSHuawei Xie 177*2bfaec90SYuanhan Liu DPDK will create a Unix domain socket server file and listen for 178*2bfaec90SYuanhan Liu connections from the frontend. 17942683a7dSHuawei Xie 180*2bfaec90SYuanhan Liu Note, this is the default mode, and the only mode before DPDK v16.07. 18142683a7dSHuawei Xie 182*2bfaec90SYuanhan Liu 183*2bfaec90SYuanhan Liu* DPDK vhost-user acts as the client. 184*2bfaec90SYuanhan Liu 185*2bfaec90SYuanhan Liu Unlike the server mode, this mode doesn't create the socket file; 186*2bfaec90SYuanhan Liu it just tries to connect to the server (which responses to create the 187*2bfaec90SYuanhan Liu file instead). 188*2bfaec90SYuanhan Liu 189*2bfaec90SYuanhan Liu When the DPDK vhost-user application restarts, DPDK vhost-user will try to 190*2bfaec90SYuanhan Liu connect to the server again. This is how the "reconnect" feature works. 191*2bfaec90SYuanhan Liu 192*2bfaec90SYuanhan Liu Note: the "reconnect" feature requires **QEMU v2.7** (or above). 193*2bfaec90SYuanhan Liu 194*2bfaec90SYuanhan LiuNo matter which mode is used, once a connection is established, DPDK 195*2bfaec90SYuanhan Liuvhost-user will start receiving and processing vhost messages from QEMU. 196*2bfaec90SYuanhan Liu 197*2bfaec90SYuanhan LiuFor messages with a file descriptor, the file descriptor can be used directly 198*2bfaec90SYuanhan Liuin the vhost process as it is already installed by the Unix domain socket. 199*2bfaec90SYuanhan Liu 200*2bfaec90SYuanhan LiuThe supported vhost messages are: 201*2bfaec90SYuanhan Liu 202*2bfaec90SYuanhan Liu* ``VHOST_SET_MEM_TABLE`` 203*2bfaec90SYuanhan Liu* ``VHOST_SET_VRING_KICK`` 204*2bfaec90SYuanhan Liu* ``VHOST_SET_VRING_CALL`` 205*2bfaec90SYuanhan Liu* ``VHOST_SET_LOG_FD`` 206*2bfaec90SYuanhan Liu* ``VHOST_SET_VRING_ERR`` 207*2bfaec90SYuanhan Liu 208*2bfaec90SYuanhan LiuFor ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each 209*2bfaec90SYuanhan Liumemory region and its file descriptor in the ancillary data of the message. 210*2bfaec90SYuanhan LiuThe file descriptor is used to map that region. 211*2bfaec90SYuanhan Liu 212*2bfaec90SYuanhan LiuThere is no ``VHOST_NET_SET_BACKEND`` message as in vhost-cuse to signal 213*2bfaec90SYuanhan Liuwhether the virtio device is ready or stopped. Instead, 214*2bfaec90SYuanhan Liu``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into 215*2bfaec90SYuanhan Liuthe data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove 216*2bfaec90SYuanhan Liuthe vhost device from the data plane. 21742683a7dSHuawei Xie 21842683a7dSHuawei XieWhen the socket connection is closed, vhost will destroy the device. 21942683a7dSHuawei Xie 2200ee5e7fbSSiobhan ButlerVhost supported vSwitch reference 2210ee5e7fbSSiobhan Butler--------------------------------- 2220ee5e7fbSSiobhan Butler 223*2bfaec90SYuanhan LiuFor more vhost details and how to support vhost in vSwitch, please refer to 224*2bfaec90SYuanhan Liuthe vhost example in the DPDK Sample Applications Guide. 225