1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2018 Intel Corporation. 3 4.. include:: <isonum.txt> 5 6NTB Rawdev Driver 7================= 8 9The ``ntb`` rawdev driver provides a non-transparent bridge between two 10separate hosts so that they can communicate with each other. Thus, many 11user cases can benefit from this, such as fault tolerance and visual 12acceleration. 13 14This PMD allows two hosts to handshake for device start and stop, memory 15allocation for the peer to access and read/write allocated memory from peer. 16Also, the PMD allows to use doorbell registers to notify the peer and share 17some information by using scratchpad registers. 18 19BIOS setting on Intel Xeon 20-------------------------- 21 22Intel Non-transparent Bridge (NTB) needs special BIOS settings on both systems. 23Note that for 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors, 24option ``Port Subsystem Mode`` should be changed from ``Gen5`` to ``Gen4 Only``, 25then reboot. 26 27- Set ``Non-Transparent Bridge PCIe Port Definition`` for needed PCIe ports 28 as ``NTB to NTB`` mode, on both hosts. 29- Set ``Enable NTB BARs`` as ``Enabled``, on both hosts. 30- Set ``Enable SPLIT BARs`` as ``Disabled``, on both hosts. 31- Set ``Imbar1 Size``, ``Imbar2 Size``, ``Embar1 Size`` and ``Embar2 Size``, 32 as 12-29 (i.e., 4K-512M) for 2nd Generation Intel\ |reg| Xeon\ |reg| Scalable Processors; 33 as 12-51 (i.e., 4K-128PB) for 3rd and 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors. 34 Note that those bar sizes on both hosts should be the same. 35- Set ``Crosslink Control override`` as ``DSD/USP`` on one host, 36 ``USD/DSP`` on another host. 37- Set ``PCIe PLL SSC (Spread Spectrum Clocking)`` as ``Disabled``, on both hosts. 38 This is a hardware requirement when using Re-timer Cards. 39 40Device Setup 41------------ 42 43The Intel NTB devices need to be bound to a DPDK-supported kernel driver 44to use, i.e. igb_uio, vfio. The ``dpdk-devbind.py`` script can be used to 45show devices status and to bind them to a suitable kernel driver. They will 46appear under the category of "Misc (rawdev) devices". 47 48Prerequisites 49------------- 50 51NTB PMD needs kernel PCI driver to support write combining (WC) to get 52better performance. The difference will be more than 10 times. 53To enable WC, there are 2 ways. 54 55- Insert igb_uio with ``wc_activate=1`` flag if use igb_uio driver. 56 57.. code-block:: console 58 59 insmod igb_uio.ko wc_activate=1 60 61- Enable WC for NTB device's Bar 2 and Bar 4 (Mapped memory) manually. 62 The reference is https://www.kernel.org/doc/html/latest/x86/mtrr.html 63 Get bar base address using ``lspci -vvv -s ae:00.0 | grep Region``. 64 65.. code-block:: console 66 67 # lspci -vvv -s ae:00.0 | grep Region 68 Region 0: Memory at 39bfe0000000 (64-bit, prefetchable) [size=64K] 69 Region 2: Memory at 39bfa0000000 (64-bit, prefetchable) [size=512M] 70 Region 4: Memory at 39bfc0000000 (64-bit, prefetchable) [size=512M] 71 72Using the following command to enable WC. 73 74.. code-block:: console 75 76 echo "base=0x39bfa0000000 size=0x20000000 type=write-combining" >> /proc/mtrr 77 echo "base=0x39bfc0000000 size=0x20000000 type=write-combining" >> /proc/mtrr 78 79And the results: 80 81.. code-block:: console 82 83 # cat /proc/mtrr 84 reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back 85 reg01: base=0x07f000000 ( 2032MB), size= 16MB, count=1: uncachable 86 reg02: base=0x39bfa0000000 (60553728MB), size= 512MB, count=1: write-combining 87 reg03: base=0x39bfc0000000 (60554240MB), size= 512MB, count=1: write-combining 88 89To disable WC for these regions, using the following. 90 91.. code-block:: console 92 93 echo "disable=2" >> /proc/mtrr 94 echo "disable=3" >> /proc/mtrr 95 96Ring Layout 97----------- 98 99Since read/write remote system's memory are through PCI bus, remote read 100is much more expensive than remote write. Thus, the enqueue and dequeue 101based on ntb ring should avoid remote read. The ring layout for ntb is 102like the following: 103 104- Ring Format:: 105 106 desc_ring: 107 108 0 16 64 109 +---------------------------------------------------------------+ 110 | buffer address | 111 +---------------+-----------------------------------------------+ 112 | buffer length | resv | 113 +---------------+-----------------------------------------------+ 114 115 used_ring: 116 117 0 16 32 118 +---------------+---------------+ 119 | packet length | flags | 120 +---------------+---------------+ 121 122- Ring Layout:: 123 124 +------------------------+ +------------------------+ 125 | used_ring | | desc_ring | 126 | +---+ | | +---+ | 127 | | | | | | | | 128 | +---+ +--------+ | | +---+ | 129 | | | ---> | buffer | <+---+-| | | 130 | +---+ +--------+ | | +---+ | 131 | | | | | | | | 132 | +---+ | | +---+ | 133 | ... | | ... | 134 | | | | 135 | +---------+ | | +---------+ | 136 | | tx_tail | | | | rx_tail | | 137 | System A +---------+ | | System B +---------+ | 138 +------------------------+ +------------------------+ 139 <---------traffic--------- 140 141- Enqueue and Dequeue 142 Based on this ring layout, enqueue reads rx_tail to get how many free 143 buffers and writes used_ring and tx_tail to tell the peer which buffers 144 are filled with data. 145 And dequeue reads tx_tail to get how many packets are arrived, and 146 writes desc_ring and rx_tail to tell the peer about the new allocated 147 buffers. 148 So in this way, only remote write happens and remote read can be avoid 149 to get better performance. 150 151Limitation 152---------- 153 154This PMD is only supported on Intel Xeon Platforms: 155 156- 4th Generation Intel® Xeon® Scalable Processors. 157- 3rd Generation Intel® Xeon® Scalable Processors. 158- 2nd Generation Intel® Xeon® Scalable Processors. 159