1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2018 Intel Corporation. 3 4NTB Rawdev Driver 5================= 6 7The ``ntb`` rawdev driver provides a non-transparent bridge between two 8separate hosts so that they can communicate with each other. Thus, many 9user cases can benefit from this, such as fault tolerance and visual 10acceleration. 11 12This PMD allows two hosts to handshake for device start and stop, memory 13allocation for the peer to access and read/write allocated memory from peer. 14Also, the PMD allows to use doorbell registers to notify the peer and share 15some information by using scratchpad registers. 16 17BIOS setting on Intel Xeon 18-------------------------- 19 20Intel Non-transparent Bridge needs special BIOS setting. The referencce for 21Skylake is https://www.intel.com/content/dam/support/us/en/documents/server-products/Intel_Xeon_Processor_Scalable_Family_BIOS_User_Guide.pdf 22 23- Set the needed PCIe port as NTB to NTB mode on both hosts. 24- Enable NTB bars and set bar size of bar 23 and bar 45 as 12-29 (4K-512M) 25 on both hosts (for Ice Lake, bar size can be set as 12-51, namely 4K-128PB). 26 Note that bar size on both hosts should be the same. 27- Disable split bars for both hosts. 28- Set crosslink control override as DSD/USP on one host, USD/DSP on 29 another host. 30- Disable PCIe PII SSC (Spread Spectrum Clocking) for both hosts. This 31 is a hardware requirement. 32 33 34Device Setup 35------------ 36 37The Intel NTB devices need to be bound to a DPDK-supported kernel driver 38to use, i.e. igb_uio, vfio. The ``dpdk-devbind.py`` script can be used to 39show devices status and to bind them to a suitable kernel driver. They will 40appear under the category of "Misc (rawdev) devices". 41 42Prerequisites 43------------- 44 45NTB PMD needs kernel PCI driver to support write combining (WC) to get 46better performance. The difference will be more than 10 times. 47To enable WC, there are 2 ways. 48 49- Insert igb_uio with ``wc_activate=1`` flag if use igb_uio driver. 50 51.. code-block:: console 52 53 insmod igb_uio.ko wc_activate=1 54 55- Enable WC for NTB device's Bar 2 and Bar 4 (Mapped memory) manually. 56 The reference is https://www.kernel.org/doc/html/latest/x86/mtrr.html 57 Get bar base address using ``lspci -vvv -s ae:00.0 | grep Region``. 58 59.. code-block:: console 60 61 # lspci -vvv -s ae:00.0 | grep Region 62 Region 0: Memory at 39bfe0000000 (64-bit, prefetchable) [size=64K] 63 Region 2: Memory at 39bfa0000000 (64-bit, prefetchable) [size=512M] 64 Region 4: Memory at 39bfc0000000 (64-bit, prefetchable) [size=512M] 65 66Using the following command to enable WC. 67 68.. code-block:: console 69 70 echo "base=0x39bfa0000000 size=0x20000000 type=write-combining" >> /proc/mtrr 71 echo "base=0x39bfc0000000 size=0x20000000 type=write-combining" >> /proc/mtrr 72 73And the results: 74 75.. code-block:: console 76 77 # cat /proc/mtrr 78 reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back 79 reg01: base=0x07f000000 ( 2032MB), size= 16MB, count=1: uncachable 80 reg02: base=0x39bfa0000000 (60553728MB), size= 512MB, count=1: write-combining 81 reg03: base=0x39bfc0000000 (60554240MB), size= 512MB, count=1: write-combining 82 83To disable WC for these regions, using the following. 84 85.. code-block:: console 86 87 echo "disable=2" >> /proc/mtrr 88 echo "disable=3" >> /proc/mtrr 89 90Ring Layout 91----------- 92 93Since read/write remote system's memory are through PCI bus, remote read 94is much more expensive than remote write. Thus, the enqueue and dequeue 95based on ntb ring should avoid remote read. The ring layout for ntb is 96like the following: 97 98- Ring Format:: 99 100 desc_ring: 101 102 0 16 64 103 +---------------------------------------------------------------+ 104 | buffer address | 105 +---------------+-----------------------------------------------+ 106 | buffer length | resv | 107 +---------------+-----------------------------------------------+ 108 109 used_ring: 110 111 0 16 32 112 +---------------+---------------+ 113 | packet length | flags | 114 +---------------+---------------+ 115 116- Ring Layout:: 117 118 +------------------------+ +------------------------+ 119 | used_ring | | desc_ring | 120 | +---+ | | +---+ | 121 | | | | | | | | 122 | +---+ +--------+ | | +---+ | 123 | | | ---> | buffer | <+---+-| | | 124 | +---+ +--------+ | | +---+ | 125 | | | | | | | | 126 | +---+ | | +---+ | 127 | ... | | ... | 128 | | | | 129 | +---------+ | | +---------+ | 130 | | tx_tail | | | | rx_tail | | 131 | System A +---------+ | | System B +---------+ | 132 +------------------------+ +------------------------+ 133 <---------traffic--------- 134 135- Enqueue and Dequeue 136 Based on this ring layout, enqueue reads rx_tail to get how many free 137 buffers and writes used_ring and tx_tail to tell the peer which buffers 138 are filled with data. 139 And dequeue reads tx_tail to get how many packets are arrived, and 140 writes desc_ring and rx_tail to tell the peer about the new allocated 141 buffers. 142 So in this way, only remote write happens and remote read can be avoid 143 to get better performance. 144 145Limitation 146---------- 147 148- This PMD only supports Intel Skylake and Ice Lake platforms. 149