10ee5e7fbSSiobhan Butler.. BSD LICENSE 22bfaec90SYuanhan Liu Copyright(c) 2010-2016 Intel Corporation. All rights reserved. 30ee5e7fbSSiobhan Butler All rights reserved. 40ee5e7fbSSiobhan Butler 50ee5e7fbSSiobhan Butler Redistribution and use in source and binary forms, with or without 60ee5e7fbSSiobhan Butler modification, are permitted provided that the following conditions 70ee5e7fbSSiobhan Butler are met: 80ee5e7fbSSiobhan Butler 90ee5e7fbSSiobhan Butler * Redistributions of source code must retain the above copyright 100ee5e7fbSSiobhan Butler notice, this list of conditions and the following disclaimer. 110ee5e7fbSSiobhan Butler * Redistributions in binary form must reproduce the above copyright 120ee5e7fbSSiobhan Butler notice, this list of conditions and the following disclaimer in 130ee5e7fbSSiobhan Butler the documentation and/or other materials provided with the 140ee5e7fbSSiobhan Butler distribution. 150ee5e7fbSSiobhan Butler * Neither the name of Intel Corporation nor the names of its 160ee5e7fbSSiobhan Butler contributors may be used to endorse or promote products derived 170ee5e7fbSSiobhan Butler from this software without specific prior written permission. 180ee5e7fbSSiobhan Butler 190ee5e7fbSSiobhan Butler THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 200ee5e7fbSSiobhan Butler "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 210ee5e7fbSSiobhan Butler LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 220ee5e7fbSSiobhan Butler A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 230ee5e7fbSSiobhan Butler OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 240ee5e7fbSSiobhan Butler SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 250ee5e7fbSSiobhan Butler LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 260ee5e7fbSSiobhan Butler DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 270ee5e7fbSSiobhan Butler THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 280ee5e7fbSSiobhan Butler (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 290ee5e7fbSSiobhan Butler OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 300ee5e7fbSSiobhan Butler 310ee5e7fbSSiobhan ButlerVhost Library 320ee5e7fbSSiobhan Butler============= 330ee5e7fbSSiobhan Butler 342bfaec90SYuanhan LiuThe vhost library implements a user space virtio net server allowing the user 352bfaec90SYuanhan Liuto manipulate the virtio ring directly. In another words, it allows the user 362bfaec90SYuanhan Liuto fetch/put packets from/to the VM virtio net device. To achieve this, a 372bfaec90SYuanhan Liuvhost library should be able to: 382bfaec90SYuanhan Liu 392bfaec90SYuanhan Liu* Access the guest memory: 402bfaec90SYuanhan Liu 412bfaec90SYuanhan Liu For QEMU, this is done by using the ``-object memory-backend-file,share=on,...`` 422bfaec90SYuanhan Liu option. Which means QEMU will create a file to serve as the guest RAM. 432bfaec90SYuanhan Liu The ``share=on`` option allows another process to map that file, which 442bfaec90SYuanhan Liu means it can access the guest RAM. 452bfaec90SYuanhan Liu 462bfaec90SYuanhan Liu* Know all the necessary information about the vring: 472bfaec90SYuanhan Liu 482bfaec90SYuanhan Liu Information such as where the available ring is stored. Vhost defines some 49647e191bSYuanhan Liu messages (passed through a Unix domain socket file) to tell the backend all 50647e191bSYuanhan Liu the information it needs to know how to manipulate the vring. 512bfaec90SYuanhan Liu 520ee5e7fbSSiobhan Butler 530ee5e7fbSSiobhan ButlerVhost API Overview 540ee5e7fbSSiobhan Butler------------------ 550ee5e7fbSSiobhan Butler 565fbb3941SYuanhan LiuThe following is an overview of some key Vhost API functions: 570ee5e7fbSSiobhan Butler 582bfaec90SYuanhan Liu* ``rte_vhost_driver_register(path, flags)`` 590ee5e7fbSSiobhan Butler 60647e191bSYuanhan Liu This function registers a vhost driver into the system. ``path`` specifies 61647e191bSYuanhan Liu the Unix domain socket file path. 620ee5e7fbSSiobhan Butler 63647e191bSYuanhan Liu Currently supported flags are: 640ee5e7fbSSiobhan Butler 652bfaec90SYuanhan Liu - ``RTE_VHOST_USER_CLIENT`` 660ee5e7fbSSiobhan Butler 672bfaec90SYuanhan Liu DPDK vhost-user will act as the client when this flag is given. See below 682bfaec90SYuanhan Liu for an explanation. 690ee5e7fbSSiobhan Butler 702bfaec90SYuanhan Liu - ``RTE_VHOST_USER_NO_RECONNECT`` 710ee5e7fbSSiobhan Butler 722bfaec90SYuanhan Liu When DPDK vhost-user acts as the client it will keep trying to reconnect 732bfaec90SYuanhan Liu to the server (QEMU) until it succeeds. This is useful in two cases: 740ee5e7fbSSiobhan Butler 752bfaec90SYuanhan Liu * When QEMU is not started yet. 762bfaec90SYuanhan Liu * When QEMU restarts (for example due to a guest OS reboot). 770ee5e7fbSSiobhan Butler 782bfaec90SYuanhan Liu This reconnect option is enabled by default. However, it can be turned off 792bfaec90SYuanhan Liu by setting this flag. 800ee5e7fbSSiobhan Butler 819ba1e744SYuanhan Liu - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` 829ba1e744SYuanhan Liu 839ba1e744SYuanhan Liu Dequeue zero copy will be enabled when this flag is set. It is disabled by 849ba1e744SYuanhan Liu default. 859ba1e744SYuanhan Liu 869ba1e744SYuanhan Liu There are some truths (including limitations) you might want to know while 879ba1e744SYuanhan Liu setting this flag: 889ba1e744SYuanhan Liu 899ba1e744SYuanhan Liu * zero copy is not good for small packets (typically for packet size below 909ba1e744SYuanhan Liu 512). 919ba1e744SYuanhan Liu 929ba1e744SYuanhan Liu * zero copy is really good for VM2VM case. For iperf between two VMs, the 939ba1e744SYuanhan Liu boost could be above 70% (when TSO is enableld). 949ba1e744SYuanhan Liu 959ba1e744SYuanhan Liu * for VM2NIC case, the ``nb_tx_desc`` has to be small enough: <= 64 if virtio 969ba1e744SYuanhan Liu indirect feature is not enabled and <= 128 if it is enabled. 979ba1e744SYuanhan Liu 98580f8e36SYong Wang This is because when dequeue zero copy is enabled, guest Tx used vring will 999ba1e744SYuanhan Liu be updated only when corresponding mbuf is freed. Thus, the nb_tx_desc 1009ba1e744SYuanhan Liu has to be small enough so that the PMD driver will run out of available 1019ba1e744SYuanhan Liu Tx descriptors and free mbufs timely. Otherwise, guest Tx vring would be 1029ba1e744SYuanhan Liu starved. 1039ba1e744SYuanhan Liu 1049ba1e744SYuanhan Liu * Guest memory should be backended with huge pages to achieve better 1059ba1e744SYuanhan Liu performance. Using 1G page size is the best. 1069ba1e744SYuanhan Liu 1079ba1e744SYuanhan Liu When dequeue zero copy is enabled, the guest phys address and host phys 1089ba1e744SYuanhan Liu address mapping has to be established. Using non-huge pages means far 1099ba1e744SYuanhan Liu more page segments. To make it simple, DPDK vhost does a linear search 1109ba1e744SYuanhan Liu of those segments, thus the fewer the segments, the quicker we will get 1119ba1e744SYuanhan Liu the mapping. NOTE: we may speed it by using tree searching in future. 1129ba1e744SYuanhan Liu 1135fbb3941SYuanhan Liu* ``rte_vhost_driver_set_features(path, features)`` 1145fbb3941SYuanhan Liu 1155fbb3941SYuanhan Liu This function sets the feature bits the vhost-user driver supports. The 1165fbb3941SYuanhan Liu vhost-user driver could be vhost-user net, yet it could be something else, 1175fbb3941SYuanhan Liu say, vhost-user SCSI. 1185fbb3941SYuanhan Liu 1197c129037SYuanhan Liu* ``rte_vhost_driver_callback_register(path, vhost_device_ops)`` 1202bfaec90SYuanhan Liu 1212bfaec90SYuanhan Liu This function registers a set of callbacks, to let DPDK applications take 1222bfaec90SYuanhan Liu the appropriate action when some events happen. The following events are 1232bfaec90SYuanhan Liu currently supported: 1242bfaec90SYuanhan Liu 1252bfaec90SYuanhan Liu * ``new_device(int vid)`` 1262bfaec90SYuanhan Liu 127cb043557SYuanhan Liu This callback is invoked when a virtio device becomes ready. ``vid`` 128cb043557SYuanhan Liu is the vhost device ID. 1292bfaec90SYuanhan Liu 1302bfaec90SYuanhan Liu * ``destroy_device(int vid)`` 1312bfaec90SYuanhan Liu 132*efba12a7SDariusz Stojaczyk This callback is invoked when a virtio device is paused or shut down. 1332bfaec90SYuanhan Liu 1342bfaec90SYuanhan Liu * ``vring_state_changed(int vid, uint16_t queue_id, int enable)`` 1352bfaec90SYuanhan Liu 1362bfaec90SYuanhan Liu This callback is invoked when a specific queue's state is changed, for 1372bfaec90SYuanhan Liu example to enabled or disabled. 1382bfaec90SYuanhan Liu 139abd53c16SYuanhan Liu * ``features_changed(int vid, uint64_t features)`` 140abd53c16SYuanhan Liu 141abd53c16SYuanhan Liu This callback is invoked when the features is changed. For example, 142abd53c16SYuanhan Liu ``VHOST_F_LOG_ALL`` will be set/cleared at the start/end of live 143abd53c16SYuanhan Liu migration, respectively. 144abd53c16SYuanhan Liu 145*efba12a7SDariusz Stojaczyk * ``new_connection(int vid)`` 146*efba12a7SDariusz Stojaczyk 147*efba12a7SDariusz Stojaczyk This callback is invoked on new vhost-user socket connection. If DPDK 148*efba12a7SDariusz Stojaczyk acts as the server the device should not be deleted before 149*efba12a7SDariusz Stojaczyk ``destroy_connection`` callback is received. 150*efba12a7SDariusz Stojaczyk 151*efba12a7SDariusz Stojaczyk * ``destroy_connection(int vid)`` 152*efba12a7SDariusz Stojaczyk 153*efba12a7SDariusz Stojaczyk This callback is invoked when vhost-user socket connection is closed. 154*efba12a7SDariusz Stojaczyk It indicates that device with id ``vid`` is no longer in use and can be 155*efba12a7SDariusz Stojaczyk safely deleted. 156*efba12a7SDariusz Stojaczyk 157af147591SYuanhan Liu* ``rte_vhost_driver_disable/enable_features(path, features))`` 158af147591SYuanhan Liu 159af147591SYuanhan Liu This function disables/enables some features. For example, it can be used to 160af147591SYuanhan Liu disable mergeable buffers and TSO features, which both are enabled by 161af147591SYuanhan Liu default. 162af147591SYuanhan Liu 163af147591SYuanhan Liu* ``rte_vhost_driver_start(path)`` 164af147591SYuanhan Liu 165af147591SYuanhan Liu This function triggers the vhost-user negotiation. It should be invoked at 166af147591SYuanhan Liu the end of initializing a vhost-user driver. 167af147591SYuanhan Liu 1682bfaec90SYuanhan Liu* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)`` 1692bfaec90SYuanhan Liu 1702bfaec90SYuanhan Liu Transmits (enqueues) ``count`` packets from host to guest. 1712bfaec90SYuanhan Liu 1722bfaec90SYuanhan Liu* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)`` 1732bfaec90SYuanhan Liu 1742bfaec90SYuanhan Liu Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``. 1752bfaec90SYuanhan Liu 176647e191bSYuanhan LiuVhost-user Implementations 177647e191bSYuanhan Liu-------------------------- 17842683a7dSHuawei Xie 1792bfaec90SYuanhan LiuVhost-user uses Unix domain sockets for passing messages. This means the DPDK 1802bfaec90SYuanhan Liuvhost-user implementation has two options: 18142683a7dSHuawei Xie 1822bfaec90SYuanhan Liu* DPDK vhost-user acts as the server. 18342683a7dSHuawei Xie 1842bfaec90SYuanhan Liu DPDK will create a Unix domain socket server file and listen for 1852bfaec90SYuanhan Liu connections from the frontend. 18642683a7dSHuawei Xie 1872bfaec90SYuanhan Liu Note, this is the default mode, and the only mode before DPDK v16.07. 18842683a7dSHuawei Xie 1892bfaec90SYuanhan Liu 1902bfaec90SYuanhan Liu* DPDK vhost-user acts as the client. 1912bfaec90SYuanhan Liu 1922bfaec90SYuanhan Liu Unlike the server mode, this mode doesn't create the socket file; 1932bfaec90SYuanhan Liu it just tries to connect to the server (which responses to create the 1942bfaec90SYuanhan Liu file instead). 1952bfaec90SYuanhan Liu 1962bfaec90SYuanhan Liu When the DPDK vhost-user application restarts, DPDK vhost-user will try to 1972bfaec90SYuanhan Liu connect to the server again. This is how the "reconnect" feature works. 1982bfaec90SYuanhan Liu 199f6ee75b5SYuanhan Liu .. Note:: 200f6ee75b5SYuanhan Liu * The "reconnect" feature requires **QEMU v2.7** (or above). 201f6ee75b5SYuanhan Liu 202f6ee75b5SYuanhan Liu * The vhost supported features must be exactly the same before and 203f6ee75b5SYuanhan Liu after the restart. For example, if TSO is disabled and then enabled, 204f6ee75b5SYuanhan Liu nothing will work and issues undefined might happen. 2052bfaec90SYuanhan Liu 2062bfaec90SYuanhan LiuNo matter which mode is used, once a connection is established, DPDK 2072bfaec90SYuanhan Liuvhost-user will start receiving and processing vhost messages from QEMU. 2082bfaec90SYuanhan Liu 2092bfaec90SYuanhan LiuFor messages with a file descriptor, the file descriptor can be used directly 2102bfaec90SYuanhan Liuin the vhost process as it is already installed by the Unix domain socket. 2112bfaec90SYuanhan Liu 2122bfaec90SYuanhan LiuThe supported vhost messages are: 2132bfaec90SYuanhan Liu 2142bfaec90SYuanhan Liu* ``VHOST_SET_MEM_TABLE`` 2152bfaec90SYuanhan Liu* ``VHOST_SET_VRING_KICK`` 2162bfaec90SYuanhan Liu* ``VHOST_SET_VRING_CALL`` 2172bfaec90SYuanhan Liu* ``VHOST_SET_LOG_FD`` 2182bfaec90SYuanhan Liu* ``VHOST_SET_VRING_ERR`` 2192bfaec90SYuanhan Liu 2202bfaec90SYuanhan LiuFor ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each 2212bfaec90SYuanhan Liumemory region and its file descriptor in the ancillary data of the message. 2222bfaec90SYuanhan LiuThe file descriptor is used to map that region. 2232bfaec90SYuanhan Liu 2242bfaec90SYuanhan Liu``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into 2252bfaec90SYuanhan Liuthe data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove 2262bfaec90SYuanhan Liuthe vhost device from the data plane. 22742683a7dSHuawei Xie 22842683a7dSHuawei XieWhen the socket connection is closed, vhost will destroy the device. 22942683a7dSHuawei Xie 2300ee5e7fbSSiobhan ButlerVhost supported vSwitch reference 2310ee5e7fbSSiobhan Butler--------------------------------- 2320ee5e7fbSSiobhan Butler 2332bfaec90SYuanhan LiuFor more vhost details and how to support vhost in vSwitch, please refer to 2342bfaec90SYuanhan Liuthe vhost example in the DPDK Sample Applications Guide. 235