xref: /dpdk/doc/guides/prog_guide/vhost_lib.rst (revision 580f8e368278d738bb69fb1963d8a93c9dbaebff)
10ee5e7fbSSiobhan Butler..  BSD LICENSE
22bfaec90SYuanhan Liu    Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
30ee5e7fbSSiobhan Butler    All rights reserved.
40ee5e7fbSSiobhan Butler
50ee5e7fbSSiobhan Butler    Redistribution and use in source and binary forms, with or without
60ee5e7fbSSiobhan Butler    modification, are permitted provided that the following conditions
70ee5e7fbSSiobhan Butler    are met:
80ee5e7fbSSiobhan Butler
90ee5e7fbSSiobhan Butler    * Redistributions of source code must retain the above copyright
100ee5e7fbSSiobhan Butler    notice, this list of conditions and the following disclaimer.
110ee5e7fbSSiobhan Butler    * Redistributions in binary form must reproduce the above copyright
120ee5e7fbSSiobhan Butler    notice, this list of conditions and the following disclaimer in
130ee5e7fbSSiobhan Butler    the documentation and/or other materials provided with the
140ee5e7fbSSiobhan Butler    distribution.
150ee5e7fbSSiobhan Butler    * Neither the name of Intel Corporation nor the names of its
160ee5e7fbSSiobhan Butler    contributors may be used to endorse or promote products derived
170ee5e7fbSSiobhan Butler    from this software without specific prior written permission.
180ee5e7fbSSiobhan Butler
190ee5e7fbSSiobhan Butler    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
200ee5e7fbSSiobhan Butler    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
210ee5e7fbSSiobhan Butler    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
220ee5e7fbSSiobhan Butler    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
230ee5e7fbSSiobhan Butler    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
240ee5e7fbSSiobhan Butler    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
250ee5e7fbSSiobhan Butler    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
260ee5e7fbSSiobhan Butler    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
270ee5e7fbSSiobhan Butler    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
280ee5e7fbSSiobhan Butler    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
290ee5e7fbSSiobhan Butler    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
300ee5e7fbSSiobhan Butler
310ee5e7fbSSiobhan ButlerVhost Library
320ee5e7fbSSiobhan Butler=============
330ee5e7fbSSiobhan Butler
342bfaec90SYuanhan LiuThe vhost library implements a user space virtio net server allowing the user
352bfaec90SYuanhan Liuto manipulate the virtio ring directly. In another words, it allows the user
362bfaec90SYuanhan Liuto fetch/put packets from/to the VM virtio net device. To achieve this, a
372bfaec90SYuanhan Liuvhost library should be able to:
382bfaec90SYuanhan Liu
392bfaec90SYuanhan Liu* Access the guest memory:
402bfaec90SYuanhan Liu
412bfaec90SYuanhan Liu  For QEMU, this is done by using the ``-object memory-backend-file,share=on,...``
422bfaec90SYuanhan Liu  option. Which means QEMU will create a file to serve as the guest RAM.
432bfaec90SYuanhan Liu  The ``share=on`` option allows another process to map that file, which
442bfaec90SYuanhan Liu  means it can access the guest RAM.
452bfaec90SYuanhan Liu
462bfaec90SYuanhan Liu* Know all the necessary information about the vring:
472bfaec90SYuanhan Liu
482bfaec90SYuanhan Liu  Information such as where the available ring is stored. Vhost defines some
49647e191bSYuanhan Liu  messages (passed through a Unix domain socket file) to tell the backend all
50647e191bSYuanhan Liu  the information it needs to know how to manipulate the vring.
512bfaec90SYuanhan Liu
520ee5e7fbSSiobhan Butler
530ee5e7fbSSiobhan ButlerVhost API Overview
540ee5e7fbSSiobhan Butler------------------
550ee5e7fbSSiobhan Butler
562bfaec90SYuanhan LiuThe following is an overview of the Vhost API functions:
570ee5e7fbSSiobhan Butler
582bfaec90SYuanhan Liu* ``rte_vhost_driver_register(path, flags)``
590ee5e7fbSSiobhan Butler
60647e191bSYuanhan Liu  This function registers a vhost driver into the system. ``path`` specifies
61647e191bSYuanhan Liu  the Unix domain socket file path.
620ee5e7fbSSiobhan Butler
63647e191bSYuanhan Liu  Currently supported flags are:
640ee5e7fbSSiobhan Butler
652bfaec90SYuanhan Liu  - ``RTE_VHOST_USER_CLIENT``
660ee5e7fbSSiobhan Butler
672bfaec90SYuanhan Liu    DPDK vhost-user will act as the client when this flag is given. See below
682bfaec90SYuanhan Liu    for an explanation.
690ee5e7fbSSiobhan Butler
702bfaec90SYuanhan Liu  - ``RTE_VHOST_USER_NO_RECONNECT``
710ee5e7fbSSiobhan Butler
722bfaec90SYuanhan Liu    When DPDK vhost-user acts as the client it will keep trying to reconnect
732bfaec90SYuanhan Liu    to the server (QEMU) until it succeeds. This is useful in two cases:
740ee5e7fbSSiobhan Butler
752bfaec90SYuanhan Liu    * When QEMU is not started yet.
762bfaec90SYuanhan Liu    * When QEMU restarts (for example due to a guest OS reboot).
770ee5e7fbSSiobhan Butler
782bfaec90SYuanhan Liu    This reconnect option is enabled by default. However, it can be turned off
792bfaec90SYuanhan Liu    by setting this flag.
800ee5e7fbSSiobhan Butler
819ba1e744SYuanhan Liu  - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY``
829ba1e744SYuanhan Liu
839ba1e744SYuanhan Liu    Dequeue zero copy will be enabled when this flag is set. It is disabled by
849ba1e744SYuanhan Liu    default.
859ba1e744SYuanhan Liu
869ba1e744SYuanhan Liu    There are some truths (including limitations) you might want to know while
879ba1e744SYuanhan Liu    setting this flag:
889ba1e744SYuanhan Liu
899ba1e744SYuanhan Liu    * zero copy is not good for small packets (typically for packet size below
909ba1e744SYuanhan Liu      512).
919ba1e744SYuanhan Liu
929ba1e744SYuanhan Liu    * zero copy is really good for VM2VM case. For iperf between two VMs, the
939ba1e744SYuanhan Liu      boost could be above 70% (when TSO is enableld).
949ba1e744SYuanhan Liu
959ba1e744SYuanhan Liu    * for VM2NIC case, the ``nb_tx_desc`` has to be small enough: <= 64 if virtio
969ba1e744SYuanhan Liu      indirect feature is not enabled and <= 128 if it is enabled.
979ba1e744SYuanhan Liu
98*580f8e36SYong Wang      This is because when dequeue zero copy is enabled, guest Tx used vring will
999ba1e744SYuanhan Liu      be updated only when corresponding mbuf is freed. Thus, the nb_tx_desc
1009ba1e744SYuanhan Liu      has to be small enough so that the PMD driver will run out of available
1019ba1e744SYuanhan Liu      Tx descriptors and free mbufs timely. Otherwise, guest Tx vring would be
1029ba1e744SYuanhan Liu      starved.
1039ba1e744SYuanhan Liu
1049ba1e744SYuanhan Liu    * Guest memory should be backended with huge pages to achieve better
1059ba1e744SYuanhan Liu      performance. Using 1G page size is the best.
1069ba1e744SYuanhan Liu
1079ba1e744SYuanhan Liu      When dequeue zero copy is enabled, the guest phys address and host phys
1089ba1e744SYuanhan Liu      address mapping has to be established. Using non-huge pages means far
1099ba1e744SYuanhan Liu      more page segments. To make it simple, DPDK vhost does a linear search
1109ba1e744SYuanhan Liu      of those segments, thus the fewer the segments, the quicker we will get
1119ba1e744SYuanhan Liu      the mapping. NOTE: we may speed it by using tree searching in future.
1129ba1e744SYuanhan Liu
1132bfaec90SYuanhan Liu* ``rte_vhost_driver_session_start()``
1140ee5e7fbSSiobhan Butler
1152bfaec90SYuanhan Liu  This function starts the vhost session loop to handle vhost messages. It
1162bfaec90SYuanhan Liu  starts an infinite loop, therefore it should be called in a dedicated
1172bfaec90SYuanhan Liu  thread.
1182bfaec90SYuanhan Liu
1192bfaec90SYuanhan Liu* ``rte_vhost_driver_callback_register(virtio_net_device_ops)``
1202bfaec90SYuanhan Liu
1212bfaec90SYuanhan Liu  This function registers a set of callbacks, to let DPDK applications take
1222bfaec90SYuanhan Liu  the appropriate action when some events happen. The following events are
1232bfaec90SYuanhan Liu  currently supported:
1242bfaec90SYuanhan Liu
1252bfaec90SYuanhan Liu  * ``new_device(int vid)``
1262bfaec90SYuanhan Liu
1272bfaec90SYuanhan Liu    This callback is invoked when a virtio net device becomes ready. ``vid``
1282bfaec90SYuanhan Liu    is the virtio net device ID.
1292bfaec90SYuanhan Liu
1302bfaec90SYuanhan Liu  * ``destroy_device(int vid)``
1312bfaec90SYuanhan Liu
1322bfaec90SYuanhan Liu    This callback is invoked when a virtio net device shuts down (or when the
1332bfaec90SYuanhan Liu    vhost connection is broken).
1342bfaec90SYuanhan Liu
1352bfaec90SYuanhan Liu  * ``vring_state_changed(int vid, uint16_t queue_id, int enable)``
1362bfaec90SYuanhan Liu
1372bfaec90SYuanhan Liu    This callback is invoked when a specific queue's state is changed, for
1382bfaec90SYuanhan Liu    example to enabled or disabled.
1392bfaec90SYuanhan Liu
1402bfaec90SYuanhan Liu* ``rte_vhost_enqueue_burst(vid, queue_id, pkts, count)``
1412bfaec90SYuanhan Liu
1422bfaec90SYuanhan Liu  Transmits (enqueues) ``count`` packets from host to guest.
1432bfaec90SYuanhan Liu
1442bfaec90SYuanhan Liu* ``rte_vhost_dequeue_burst(vid, queue_id, mbuf_pool, pkts, count)``
1452bfaec90SYuanhan Liu
1462bfaec90SYuanhan Liu  Receives (dequeues) ``count`` packets from guest, and stored them at ``pkts``.
1472bfaec90SYuanhan Liu
1482bfaec90SYuanhan Liu* ``rte_vhost_feature_disable/rte_vhost_feature_enable(feature_mask)``
1492bfaec90SYuanhan Liu
1502bfaec90SYuanhan Liu  This function disables/enables some features. For example, it can be used to
1512bfaec90SYuanhan Liu  disable mergeable buffers and TSO features, which both are enabled by
1522bfaec90SYuanhan Liu  default.
1532bfaec90SYuanhan Liu
1542bfaec90SYuanhan Liu
155647e191bSYuanhan LiuVhost-user Implementations
156647e191bSYuanhan Liu--------------------------
15742683a7dSHuawei Xie
1582bfaec90SYuanhan LiuVhost-user uses Unix domain sockets for passing messages. This means the DPDK
1592bfaec90SYuanhan Liuvhost-user implementation has two options:
16042683a7dSHuawei Xie
1612bfaec90SYuanhan Liu* DPDK vhost-user acts as the server.
16242683a7dSHuawei Xie
1632bfaec90SYuanhan Liu  DPDK will create a Unix domain socket server file and listen for
1642bfaec90SYuanhan Liu  connections from the frontend.
16542683a7dSHuawei Xie
1662bfaec90SYuanhan Liu  Note, this is the default mode, and the only mode before DPDK v16.07.
16742683a7dSHuawei Xie
1682bfaec90SYuanhan Liu
1692bfaec90SYuanhan Liu* DPDK vhost-user acts as the client.
1702bfaec90SYuanhan Liu
1712bfaec90SYuanhan Liu  Unlike the server mode, this mode doesn't create the socket file;
1722bfaec90SYuanhan Liu  it just tries to connect to the server (which responses to create the
1732bfaec90SYuanhan Liu  file instead).
1742bfaec90SYuanhan Liu
1752bfaec90SYuanhan Liu  When the DPDK vhost-user application restarts, DPDK vhost-user will try to
1762bfaec90SYuanhan Liu  connect to the server again. This is how the "reconnect" feature works.
1772bfaec90SYuanhan Liu
178f6ee75b5SYuanhan Liu  .. Note::
179f6ee75b5SYuanhan Liu     * The "reconnect" feature requires **QEMU v2.7** (or above).
180f6ee75b5SYuanhan Liu
181f6ee75b5SYuanhan Liu     * The vhost supported features must be exactly the same before and
182f6ee75b5SYuanhan Liu       after the restart. For example, if TSO is disabled and then enabled,
183f6ee75b5SYuanhan Liu       nothing will work and issues undefined might happen.
1842bfaec90SYuanhan Liu
1852bfaec90SYuanhan LiuNo matter which mode is used, once a connection is established, DPDK
1862bfaec90SYuanhan Liuvhost-user will start receiving and processing vhost messages from QEMU.
1872bfaec90SYuanhan Liu
1882bfaec90SYuanhan LiuFor messages with a file descriptor, the file descriptor can be used directly
1892bfaec90SYuanhan Liuin the vhost process as it is already installed by the Unix domain socket.
1902bfaec90SYuanhan Liu
1912bfaec90SYuanhan LiuThe supported vhost messages are:
1922bfaec90SYuanhan Liu
1932bfaec90SYuanhan Liu* ``VHOST_SET_MEM_TABLE``
1942bfaec90SYuanhan Liu* ``VHOST_SET_VRING_KICK``
1952bfaec90SYuanhan Liu* ``VHOST_SET_VRING_CALL``
1962bfaec90SYuanhan Liu* ``VHOST_SET_LOG_FD``
1972bfaec90SYuanhan Liu* ``VHOST_SET_VRING_ERR``
1982bfaec90SYuanhan Liu
1992bfaec90SYuanhan LiuFor ``VHOST_SET_MEM_TABLE`` message, QEMU will send information for each
2002bfaec90SYuanhan Liumemory region and its file descriptor in the ancillary data of the message.
2012bfaec90SYuanhan LiuThe file descriptor is used to map that region.
2022bfaec90SYuanhan Liu
2032bfaec90SYuanhan Liu``VHOST_SET_VRING_KICK`` is used as the signal to put the vhost device into
2042bfaec90SYuanhan Liuthe data plane, and ``VHOST_GET_VRING_BASE`` is used as the signal to remove
2052bfaec90SYuanhan Liuthe vhost device from the data plane.
20642683a7dSHuawei Xie
20742683a7dSHuawei XieWhen the socket connection is closed, vhost will destroy the device.
20842683a7dSHuawei Xie
2090ee5e7fbSSiobhan ButlerVhost supported vSwitch reference
2100ee5e7fbSSiobhan Butler---------------------------------
2110ee5e7fbSSiobhan Butler
2122bfaec90SYuanhan LiuFor more vhost details and how to support vhost in vSwitch, please refer to
2132bfaec90SYuanhan Liuthe vhost example in the DPDK Sample Applications Guide.
214