xref: /dpdk/doc/guides/rawdevs/ntb.rst (revision 00e57b0e550b7df2047e6d0bde8965c7ae17d203)
127731002SXiaoyun Li..  SPDX-License-Identifier: BSD-3-Clause
227731002SXiaoyun Li    Copyright(c) 2018 Intel Corporation.
327731002SXiaoyun Li
4*00e57b0eSJunfeng Guo.. include:: <isonum.txt>
5*00e57b0eSJunfeng Guo
627731002SXiaoyun LiNTB Rawdev Driver
727731002SXiaoyun Li=================
827731002SXiaoyun Li
927731002SXiaoyun LiThe ``ntb`` rawdev driver provides a non-transparent bridge between two
1027731002SXiaoyun Liseparate hosts so that they can communicate with each other. Thus, many
1127731002SXiaoyun Liuser cases can benefit from this, such as fault tolerance and visual
1227731002SXiaoyun Liacceleration.
1327731002SXiaoyun Li
1462012a76SXiaoyun LiThis PMD allows two hosts to handshake for device start and stop, memory
1562012a76SXiaoyun Liallocation for the peer to access and read/write allocated memory from peer.
1662012a76SXiaoyun LiAlso, the PMD allows to use doorbell registers to notify the peer and share
1762012a76SXiaoyun Lisome information by using scratchpad registers.
1862012a76SXiaoyun Li
19f5057be3SXiaoyun LiBIOS setting on Intel Xeon
20f5057be3SXiaoyun Li--------------------------
21034c328eSXiaoyun Li
22*00e57b0eSJunfeng GuoIntel Non-transparent Bridge (NTB) needs special BIOS settings on both systems.
23*00e57b0eSJunfeng GuoNote that for 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors,
24*00e57b0eSJunfeng Guooption ``Port Subsystem Mode`` should be changed from ``Gen5`` to ``Gen4 Only``,
25*00e57b0eSJunfeng Guothen reboot.
26034c328eSXiaoyun Li
27*00e57b0eSJunfeng Guo- Set ``Non-Transparent Bridge PCIe Port Definition`` for needed PCIe ports
28*00e57b0eSJunfeng Guo  as ``NTB to NTB`` mode, on both hosts.
29*00e57b0eSJunfeng Guo- Set ``Enable NTB BARs`` as ``Enabled``, on both hosts.
30*00e57b0eSJunfeng Guo- Set ``Enable SPLIT BARs`` as ``Disabled``, on both hosts.
31*00e57b0eSJunfeng Guo- Set ``Imbar1 Size``, ``Imbar2 Size``, ``Embar1 Size`` and ``Embar2 Size``,
32*00e57b0eSJunfeng Guo  as 12-29 (i.e., 4K-512M) for 2nd Generation Intel\ |reg| Xeon\ |reg| Scalable Processors;
33*00e57b0eSJunfeng Guo  as 12-51 (i.e., 4K-128PB) for 3rd and 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors.
34*00e57b0eSJunfeng Guo  Note that those bar sizes on both hosts should be the same.
35*00e57b0eSJunfeng Guo- Set ``Crosslink Control override`` as ``DSD/USP`` on one host,
36*00e57b0eSJunfeng Guo  ``USD/DSP`` on another host.
37*00e57b0eSJunfeng Guo- Set ``PCIe PLL SSC (Spread Spectrum Clocking)`` as ``Disabled``, on both hosts.
38*00e57b0eSJunfeng Guo  This is a hardware requirement when using Re-timer Cards.
3927731002SXiaoyun Li
40034c328eSXiaoyun LiDevice Setup
41034c328eSXiaoyun Li------------
42034c328eSXiaoyun Li
43034c328eSXiaoyun LiThe Intel NTB devices need to be bound to a DPDK-supported kernel driver
44034c328eSXiaoyun Lito use, i.e. igb_uio, vfio. The ``dpdk-devbind.py`` script can be used to
45034c328eSXiaoyun Lishow devices status and to bind them to a suitable kernel driver. They will
46034c328eSXiaoyun Liappear under the category of "Misc (rawdev) devices".
47034c328eSXiaoyun Li
4811b5c7daSXiaoyun LiPrerequisites
4911b5c7daSXiaoyun Li-------------
5011b5c7daSXiaoyun Li
5111b5c7daSXiaoyun LiNTB PMD needs kernel PCI driver to support write combining (WC) to get
5211b5c7daSXiaoyun Libetter performance. The difference will be more than 10 times.
5311b5c7daSXiaoyun LiTo enable WC, there are 2 ways.
5411b5c7daSXiaoyun Li
55971a48e7SXiaoyun Li- Insert igb_uio with ``wc_activate=1`` flag if use igb_uio driver.
5611b5c7daSXiaoyun Li
5711b5c7daSXiaoyun Li.. code-block:: console
5811b5c7daSXiaoyun Li
59971a48e7SXiaoyun Li  insmod igb_uio.ko wc_activate=1
6011b5c7daSXiaoyun Li
6111b5c7daSXiaoyun Li- Enable WC for NTB device's Bar 2 and Bar 4 (Mapped memory) manually.
6211b5c7daSXiaoyun Li  The reference is https://www.kernel.org/doc/html/latest/x86/mtrr.html
6311b5c7daSXiaoyun Li  Get bar base address using ``lspci -vvv -s ae:00.0 | grep Region``.
6411b5c7daSXiaoyun Li
6511b5c7daSXiaoyun Li.. code-block:: console
6611b5c7daSXiaoyun Li
6711b5c7daSXiaoyun Li  # lspci -vvv -s ae:00.0 | grep Region
6811b5c7daSXiaoyun Li  Region 0: Memory at 39bfe0000000 (64-bit, prefetchable) [size=64K]
6911b5c7daSXiaoyun Li  Region 2: Memory at 39bfa0000000 (64-bit, prefetchable) [size=512M]
7011b5c7daSXiaoyun Li  Region 4: Memory at 39bfc0000000 (64-bit, prefetchable) [size=512M]
7111b5c7daSXiaoyun Li
7211b5c7daSXiaoyun LiUsing the following command to enable WC.
7311b5c7daSXiaoyun Li
7411b5c7daSXiaoyun Li.. code-block:: console
7511b5c7daSXiaoyun Li
7611b5c7daSXiaoyun Li  echo "base=0x39bfa0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
7711b5c7daSXiaoyun Li  echo "base=0x39bfc0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
7811b5c7daSXiaoyun Li
7911b5c7daSXiaoyun LiAnd the results:
8011b5c7daSXiaoyun Li
8111b5c7daSXiaoyun Li.. code-block:: console
8211b5c7daSXiaoyun Li
8311b5c7daSXiaoyun Li  # cat /proc/mtrr
8411b5c7daSXiaoyun Li  reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
8511b5c7daSXiaoyun Li  reg01: base=0x07f000000 ( 2032MB), size=   16MB, count=1: uncachable
8611b5c7daSXiaoyun Li  reg02: base=0x39bfa0000000 (60553728MB), size=  512MB, count=1: write-combining
8711b5c7daSXiaoyun Li  reg03: base=0x39bfc0000000 (60554240MB), size=  512MB, count=1: write-combining
8811b5c7daSXiaoyun Li
8911b5c7daSXiaoyun LiTo disable WC for these regions, using the following.
9011b5c7daSXiaoyun Li
9111b5c7daSXiaoyun Li.. code-block:: console
9211b5c7daSXiaoyun Li
9311b5c7daSXiaoyun Li     echo "disable=2" >> /proc/mtrr
9411b5c7daSXiaoyun Li     echo "disable=3" >> /proc/mtrr
9511b5c7daSXiaoyun Li
96c39d1e08SXiaoyun LiRing Layout
97c39d1e08SXiaoyun Li-----------
98c39d1e08SXiaoyun Li
99c39d1e08SXiaoyun LiSince read/write remote system's memory are through PCI bus, remote read
100c39d1e08SXiaoyun Liis much more expensive than remote write. Thus, the enqueue and dequeue
101c39d1e08SXiaoyun Libased on ntb ring should avoid remote read. The ring layout for ntb is
102c39d1e08SXiaoyun Lilike the following:
103c39d1e08SXiaoyun Li
104c39d1e08SXiaoyun Li- Ring Format::
105c39d1e08SXiaoyun Li
106c39d1e08SXiaoyun Li   desc_ring:
107c39d1e08SXiaoyun Li
108c39d1e08SXiaoyun Li      0               16                                              64
109c39d1e08SXiaoyun Li      +---------------------------------------------------------------+
110c39d1e08SXiaoyun Li      |                        buffer address                         |
111c39d1e08SXiaoyun Li      +---------------+-----------------------------------------------+
112c39d1e08SXiaoyun Li      | buffer length |                      resv                     |
113c39d1e08SXiaoyun Li      +---------------+-----------------------------------------------+
114c39d1e08SXiaoyun Li
115c39d1e08SXiaoyun Li   used_ring:
116c39d1e08SXiaoyun Li
117c39d1e08SXiaoyun Li      0               16              32
118c39d1e08SXiaoyun Li      +---------------+---------------+
119c39d1e08SXiaoyun Li      | packet length |     flags     |
120c39d1e08SXiaoyun Li      +---------------+---------------+
121c39d1e08SXiaoyun Li
122c39d1e08SXiaoyun Li- Ring Layout::
123c39d1e08SXiaoyun Li
124c39d1e08SXiaoyun Li      +------------------------+   +------------------------+
125c39d1e08SXiaoyun Li      | used_ring              |   | desc_ring              |
126c39d1e08SXiaoyun Li      | +---+                  |   | +---+                  |
127c39d1e08SXiaoyun Li      | |   |                  |   | |   |                  |
128c39d1e08SXiaoyun Li      | +---+      +--------+  |   | +---+                  |
129c39d1e08SXiaoyun Li      | |   | ---> | buffer | <+---+-|   |                  |
130c39d1e08SXiaoyun Li      | +---+      +--------+  |   | +---+                  |
131c39d1e08SXiaoyun Li      | |   |                  |   | |   |                  |
132c39d1e08SXiaoyun Li      | +---+                  |   | +---+                  |
133c39d1e08SXiaoyun Li      |  ...                   |   |  ...                   |
134c39d1e08SXiaoyun Li      |                        |   |                        |
135c39d1e08SXiaoyun Li      |            +---------+ |   |            +---------+ |
136c39d1e08SXiaoyun Li      |            | tx_tail | |   |            | rx_tail | |
137c39d1e08SXiaoyun Li      | System A   +---------+ |   | System B   +---------+ |
138c39d1e08SXiaoyun Li      +------------------------+   +------------------------+
139c39d1e08SXiaoyun Li                    <---------traffic---------
140c39d1e08SXiaoyun Li
14111b5c7daSXiaoyun Li- Enqueue and Dequeue
14211b5c7daSXiaoyun Li  Based on this ring layout, enqueue reads rx_tail to get how many free
14311b5c7daSXiaoyun Li  buffers and writes used_ring and tx_tail to tell the peer which buffers
14411b5c7daSXiaoyun Li  are filled with data.
14511b5c7daSXiaoyun Li  And dequeue reads tx_tail to get how many packets are arrived, and
14611b5c7daSXiaoyun Li  writes desc_ring and rx_tail to tell the peer about the new allocated
14711b5c7daSXiaoyun Li  buffers.
14811b5c7daSXiaoyun Li  So in this way, only remote write happens and remote read can be avoid
14911b5c7daSXiaoyun Li  to get better performance.
15011b5c7daSXiaoyun Li
15127731002SXiaoyun LiLimitation
15227731002SXiaoyun Li----------
15327731002SXiaoyun Li
154*00e57b0eSJunfeng GuoThis PMD is only supported on Intel Xeon Platforms:
155*00e57b0eSJunfeng Guo
156*00e57b0eSJunfeng Guo- 4th Generation Intel® Xeon® Scalable Processors.
157*00e57b0eSJunfeng Guo- 3rd Generation Intel® Xeon® Scalable Processors.
158*00e57b0eSJunfeng Guo- 2nd Generation Intel® Xeon® Scalable Processors.
159