xref: /dpdk/doc/guides/rawdevs/ntb.rst (revision 68a03efeed657e6e05f281479b33b51102797e15)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2018 Intel Corporation.
3
4NTB Rawdev Driver
5=================
6
7The ``ntb`` rawdev driver provides a non-transparent bridge between two
8separate hosts so that they can communicate with each other. Thus, many
9user cases can benefit from this, such as fault tolerance and visual
10acceleration.
11
12This PMD allows two hosts to handshake for device start and stop, memory
13allocation for the peer to access and read/write allocated memory from peer.
14Also, the PMD allows to use doorbell registers to notify the peer and share
15some information by using scratchpad registers.
16
17BIOS setting on Intel Xeon
18--------------------------
19
20Intel Non-transparent Bridge needs special BIOS setting. The referencce for
21Skylake is https://www.intel.com/content/dam/support/us/en/documents/server-products/Intel_Xeon_Processor_Scalable_Family_BIOS_User_Guide.pdf
22
23- Set the needed PCIe port as NTB to NTB mode on both hosts.
24- Enable NTB bars and set bar size of bar 23 and bar 45 as 12-29 (4K-512M)
25  on both hosts (for Ice Lake, bar size can be set as 12-51, namely 4K-128PB).
26  Note that bar size on both hosts should be the same.
27- Disable split bars for both hosts.
28- Set crosslink control override as DSD/USP on one host, USD/DSP on
29  another host.
30- Disable PCIe PII SSC (Spread Spectrum Clocking) for both hosts. This
31  is a hardware requirement.
32
33
34Device Setup
35------------
36
37The Intel NTB devices need to be bound to a DPDK-supported kernel driver
38to use, i.e. igb_uio, vfio. The ``dpdk-devbind.py`` script can be used to
39show devices status and to bind them to a suitable kernel driver. They will
40appear under the category of "Misc (rawdev) devices".
41
42Prerequisites
43-------------
44
45NTB PMD needs kernel PCI driver to support write combining (WC) to get
46better performance. The difference will be more than 10 times.
47To enable WC, there are 2 ways.
48
49- Insert igb_uio with ``wc_activate=1`` flag if use igb_uio driver.
50
51.. code-block:: console
52
53  insmod igb_uio.ko wc_activate=1
54
55- Enable WC for NTB device's Bar 2 and Bar 4 (Mapped memory) manually.
56  The reference is https://www.kernel.org/doc/html/latest/x86/mtrr.html
57  Get bar base address using ``lspci -vvv -s ae:00.0 | grep Region``.
58
59.. code-block:: console
60
61  # lspci -vvv -s ae:00.0 | grep Region
62  Region 0: Memory at 39bfe0000000 (64-bit, prefetchable) [size=64K]
63  Region 2: Memory at 39bfa0000000 (64-bit, prefetchable) [size=512M]
64  Region 4: Memory at 39bfc0000000 (64-bit, prefetchable) [size=512M]
65
66Using the following command to enable WC.
67
68.. code-block:: console
69
70  echo "base=0x39bfa0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
71  echo "base=0x39bfc0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
72
73And the results:
74
75.. code-block:: console
76
77  # cat /proc/mtrr
78  reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
79  reg01: base=0x07f000000 ( 2032MB), size=   16MB, count=1: uncachable
80  reg02: base=0x39bfa0000000 (60553728MB), size=  512MB, count=1: write-combining
81  reg03: base=0x39bfc0000000 (60554240MB), size=  512MB, count=1: write-combining
82
83To disable WC for these regions, using the following.
84
85.. code-block:: console
86
87     echo "disable=2" >> /proc/mtrr
88     echo "disable=3" >> /proc/mtrr
89
90Ring Layout
91-----------
92
93Since read/write remote system's memory are through PCI bus, remote read
94is much more expensive than remote write. Thus, the enqueue and dequeue
95based on ntb ring should avoid remote read. The ring layout for ntb is
96like the following:
97
98- Ring Format::
99
100   desc_ring:
101
102      0               16                                              64
103      +---------------------------------------------------------------+
104      |                        buffer address                         |
105      +---------------+-----------------------------------------------+
106      | buffer length |                      resv                     |
107      +---------------+-----------------------------------------------+
108
109   used_ring:
110
111      0               16              32
112      +---------------+---------------+
113      | packet length |     flags     |
114      +---------------+---------------+
115
116- Ring Layout::
117
118      +------------------------+   +------------------------+
119      | used_ring              |   | desc_ring              |
120      | +---+                  |   | +---+                  |
121      | |   |                  |   | |   |                  |
122      | +---+      +--------+  |   | +---+                  |
123      | |   | ---> | buffer | <+---+-|   |                  |
124      | +---+      +--------+  |   | +---+                  |
125      | |   |                  |   | |   |                  |
126      | +---+                  |   | +---+                  |
127      |  ...                   |   |  ...                   |
128      |                        |   |                        |
129      |            +---------+ |   |            +---------+ |
130      |            | tx_tail | |   |            | rx_tail | |
131      | System A   +---------+ |   | System B   +---------+ |
132      +------------------------+   +------------------------+
133                    <---------traffic---------
134
135- Enqueue and Dequeue
136  Based on this ring layout, enqueue reads rx_tail to get how many free
137  buffers and writes used_ring and tx_tail to tell the peer which buffers
138  are filled with data.
139  And dequeue reads tx_tail to get how many packets are arrived, and
140  writes desc_ring and rx_tail to tell the peer about the new allocated
141  buffers.
142  So in this way, only remote write happens and remote read can be avoid
143  to get better performance.
144
145Limitation
146----------
147
148- This PMD only supports Intel Skylake and Ice Lake platforms.
149