127731002SXiaoyun Li.. SPDX-License-Identifier: BSD-3-Clause 227731002SXiaoyun Li Copyright(c) 2018 Intel Corporation. 327731002SXiaoyun Li 4*00e57b0eSJunfeng Guo.. include:: <isonum.txt> 5*00e57b0eSJunfeng Guo 627731002SXiaoyun LiNTB Rawdev Driver 727731002SXiaoyun Li================= 827731002SXiaoyun Li 927731002SXiaoyun LiThe ``ntb`` rawdev driver provides a non-transparent bridge between two 1027731002SXiaoyun Liseparate hosts so that they can communicate with each other. Thus, many 1127731002SXiaoyun Liuser cases can benefit from this, such as fault tolerance and visual 1227731002SXiaoyun Liacceleration. 1327731002SXiaoyun Li 1462012a76SXiaoyun LiThis PMD allows two hosts to handshake for device start and stop, memory 1562012a76SXiaoyun Liallocation for the peer to access and read/write allocated memory from peer. 1662012a76SXiaoyun LiAlso, the PMD allows to use doorbell registers to notify the peer and share 1762012a76SXiaoyun Lisome information by using scratchpad registers. 1862012a76SXiaoyun Li 19f5057be3SXiaoyun LiBIOS setting on Intel Xeon 20f5057be3SXiaoyun Li-------------------------- 21034c328eSXiaoyun Li 22*00e57b0eSJunfeng GuoIntel Non-transparent Bridge (NTB) needs special BIOS settings on both systems. 23*00e57b0eSJunfeng GuoNote that for 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors, 24*00e57b0eSJunfeng Guooption ``Port Subsystem Mode`` should be changed from ``Gen5`` to ``Gen4 Only``, 25*00e57b0eSJunfeng Guothen reboot. 26034c328eSXiaoyun Li 27*00e57b0eSJunfeng Guo- Set ``Non-Transparent Bridge PCIe Port Definition`` for needed PCIe ports 28*00e57b0eSJunfeng Guo as ``NTB to NTB`` mode, on both hosts. 29*00e57b0eSJunfeng Guo- Set ``Enable NTB BARs`` as ``Enabled``, on both hosts. 30*00e57b0eSJunfeng Guo- Set ``Enable SPLIT BARs`` as ``Disabled``, on both hosts. 31*00e57b0eSJunfeng Guo- Set ``Imbar1 Size``, ``Imbar2 Size``, ``Embar1 Size`` and ``Embar2 Size``, 32*00e57b0eSJunfeng Guo as 12-29 (i.e., 4K-512M) for 2nd Generation Intel\ |reg| Xeon\ |reg| Scalable Processors; 33*00e57b0eSJunfeng Guo as 12-51 (i.e., 4K-128PB) for 3rd and 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors. 34*00e57b0eSJunfeng Guo Note that those bar sizes on both hosts should be the same. 35*00e57b0eSJunfeng Guo- Set ``Crosslink Control override`` as ``DSD/USP`` on one host, 36*00e57b0eSJunfeng Guo ``USD/DSP`` on another host. 37*00e57b0eSJunfeng Guo- Set ``PCIe PLL SSC (Spread Spectrum Clocking)`` as ``Disabled``, on both hosts. 38*00e57b0eSJunfeng Guo This is a hardware requirement when using Re-timer Cards. 3927731002SXiaoyun Li 40034c328eSXiaoyun LiDevice Setup 41034c328eSXiaoyun Li------------ 42034c328eSXiaoyun Li 43034c328eSXiaoyun LiThe Intel NTB devices need to be bound to a DPDK-supported kernel driver 44034c328eSXiaoyun Lito use, i.e. igb_uio, vfio. The ``dpdk-devbind.py`` script can be used to 45034c328eSXiaoyun Lishow devices status and to bind them to a suitable kernel driver. They will 46034c328eSXiaoyun Liappear under the category of "Misc (rawdev) devices". 47034c328eSXiaoyun Li 4811b5c7daSXiaoyun LiPrerequisites 4911b5c7daSXiaoyun Li------------- 5011b5c7daSXiaoyun Li 5111b5c7daSXiaoyun LiNTB PMD needs kernel PCI driver to support write combining (WC) to get 5211b5c7daSXiaoyun Libetter performance. The difference will be more than 10 times. 5311b5c7daSXiaoyun LiTo enable WC, there are 2 ways. 5411b5c7daSXiaoyun Li 55971a48e7SXiaoyun Li- Insert igb_uio with ``wc_activate=1`` flag if use igb_uio driver. 5611b5c7daSXiaoyun Li 5711b5c7daSXiaoyun Li.. code-block:: console 5811b5c7daSXiaoyun Li 59971a48e7SXiaoyun Li insmod igb_uio.ko wc_activate=1 6011b5c7daSXiaoyun Li 6111b5c7daSXiaoyun Li- Enable WC for NTB device's Bar 2 and Bar 4 (Mapped memory) manually. 6211b5c7daSXiaoyun Li The reference is https://www.kernel.org/doc/html/latest/x86/mtrr.html 6311b5c7daSXiaoyun Li Get bar base address using ``lspci -vvv -s ae:00.0 | grep Region``. 6411b5c7daSXiaoyun Li 6511b5c7daSXiaoyun Li.. code-block:: console 6611b5c7daSXiaoyun Li 6711b5c7daSXiaoyun Li # lspci -vvv -s ae:00.0 | grep Region 6811b5c7daSXiaoyun Li Region 0: Memory at 39bfe0000000 (64-bit, prefetchable) [size=64K] 6911b5c7daSXiaoyun Li Region 2: Memory at 39bfa0000000 (64-bit, prefetchable) [size=512M] 7011b5c7daSXiaoyun Li Region 4: Memory at 39bfc0000000 (64-bit, prefetchable) [size=512M] 7111b5c7daSXiaoyun Li 7211b5c7daSXiaoyun LiUsing the following command to enable WC. 7311b5c7daSXiaoyun Li 7411b5c7daSXiaoyun Li.. code-block:: console 7511b5c7daSXiaoyun Li 7611b5c7daSXiaoyun Li echo "base=0x39bfa0000000 size=0x20000000 type=write-combining" >> /proc/mtrr 7711b5c7daSXiaoyun Li echo "base=0x39bfc0000000 size=0x20000000 type=write-combining" >> /proc/mtrr 7811b5c7daSXiaoyun Li 7911b5c7daSXiaoyun LiAnd the results: 8011b5c7daSXiaoyun Li 8111b5c7daSXiaoyun Li.. code-block:: console 8211b5c7daSXiaoyun Li 8311b5c7daSXiaoyun Li # cat /proc/mtrr 8411b5c7daSXiaoyun Li reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back 8511b5c7daSXiaoyun Li reg01: base=0x07f000000 ( 2032MB), size= 16MB, count=1: uncachable 8611b5c7daSXiaoyun Li reg02: base=0x39bfa0000000 (60553728MB), size= 512MB, count=1: write-combining 8711b5c7daSXiaoyun Li reg03: base=0x39bfc0000000 (60554240MB), size= 512MB, count=1: write-combining 8811b5c7daSXiaoyun Li 8911b5c7daSXiaoyun LiTo disable WC for these regions, using the following. 9011b5c7daSXiaoyun Li 9111b5c7daSXiaoyun Li.. code-block:: console 9211b5c7daSXiaoyun Li 9311b5c7daSXiaoyun Li echo "disable=2" >> /proc/mtrr 9411b5c7daSXiaoyun Li echo "disable=3" >> /proc/mtrr 9511b5c7daSXiaoyun Li 96c39d1e08SXiaoyun LiRing Layout 97c39d1e08SXiaoyun Li----------- 98c39d1e08SXiaoyun Li 99c39d1e08SXiaoyun LiSince read/write remote system's memory are through PCI bus, remote read 100c39d1e08SXiaoyun Liis much more expensive than remote write. Thus, the enqueue and dequeue 101c39d1e08SXiaoyun Libased on ntb ring should avoid remote read. The ring layout for ntb is 102c39d1e08SXiaoyun Lilike the following: 103c39d1e08SXiaoyun Li 104c39d1e08SXiaoyun Li- Ring Format:: 105c39d1e08SXiaoyun Li 106c39d1e08SXiaoyun Li desc_ring: 107c39d1e08SXiaoyun Li 108c39d1e08SXiaoyun Li 0 16 64 109c39d1e08SXiaoyun Li +---------------------------------------------------------------+ 110c39d1e08SXiaoyun Li | buffer address | 111c39d1e08SXiaoyun Li +---------------+-----------------------------------------------+ 112c39d1e08SXiaoyun Li | buffer length | resv | 113c39d1e08SXiaoyun Li +---------------+-----------------------------------------------+ 114c39d1e08SXiaoyun Li 115c39d1e08SXiaoyun Li used_ring: 116c39d1e08SXiaoyun Li 117c39d1e08SXiaoyun Li 0 16 32 118c39d1e08SXiaoyun Li +---------------+---------------+ 119c39d1e08SXiaoyun Li | packet length | flags | 120c39d1e08SXiaoyun Li +---------------+---------------+ 121c39d1e08SXiaoyun Li 122c39d1e08SXiaoyun Li- Ring Layout:: 123c39d1e08SXiaoyun Li 124c39d1e08SXiaoyun Li +------------------------+ +------------------------+ 125c39d1e08SXiaoyun Li | used_ring | | desc_ring | 126c39d1e08SXiaoyun Li | +---+ | | +---+ | 127c39d1e08SXiaoyun Li | | | | | | | | 128c39d1e08SXiaoyun Li | +---+ +--------+ | | +---+ | 129c39d1e08SXiaoyun Li | | | ---> | buffer | <+---+-| | | 130c39d1e08SXiaoyun Li | +---+ +--------+ | | +---+ | 131c39d1e08SXiaoyun Li | | | | | | | | 132c39d1e08SXiaoyun Li | +---+ | | +---+ | 133c39d1e08SXiaoyun Li | ... | | ... | 134c39d1e08SXiaoyun Li | | | | 135c39d1e08SXiaoyun Li | +---------+ | | +---------+ | 136c39d1e08SXiaoyun Li | | tx_tail | | | | rx_tail | | 137c39d1e08SXiaoyun Li | System A +---------+ | | System B +---------+ | 138c39d1e08SXiaoyun Li +------------------------+ +------------------------+ 139c39d1e08SXiaoyun Li <---------traffic--------- 140c39d1e08SXiaoyun Li 14111b5c7daSXiaoyun Li- Enqueue and Dequeue 14211b5c7daSXiaoyun Li Based on this ring layout, enqueue reads rx_tail to get how many free 14311b5c7daSXiaoyun Li buffers and writes used_ring and tx_tail to tell the peer which buffers 14411b5c7daSXiaoyun Li are filled with data. 14511b5c7daSXiaoyun Li And dequeue reads tx_tail to get how many packets are arrived, and 14611b5c7daSXiaoyun Li writes desc_ring and rx_tail to tell the peer about the new allocated 14711b5c7daSXiaoyun Li buffers. 14811b5c7daSXiaoyun Li So in this way, only remote write happens and remote read can be avoid 14911b5c7daSXiaoyun Li to get better performance. 15011b5c7daSXiaoyun Li 15127731002SXiaoyun LiLimitation 15227731002SXiaoyun Li---------- 15327731002SXiaoyun Li 154*00e57b0eSJunfeng GuoThis PMD is only supported on Intel Xeon Platforms: 155*00e57b0eSJunfeng Guo 156*00e57b0eSJunfeng Guo- 4th Generation Intel® Xeon® Scalable Processors. 157*00e57b0eSJunfeng Guo- 3rd Generation Intel® Xeon® Scalable Processors. 158*00e57b0eSJunfeng Guo- 2nd Generation Intel® Xeon® Scalable Processors. 159