xref: /dpdk/doc/guides/rawdevs/ntb.rst (revision 00e57b0e550b7df2047e6d0bde8965c7ae17d203)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2018 Intel Corporation.
3
4.. include:: <isonum.txt>
5
6NTB Rawdev Driver
7=================
8
9The ``ntb`` rawdev driver provides a non-transparent bridge between two
10separate hosts so that they can communicate with each other. Thus, many
11user cases can benefit from this, such as fault tolerance and visual
12acceleration.
13
14This PMD allows two hosts to handshake for device start and stop, memory
15allocation for the peer to access and read/write allocated memory from peer.
16Also, the PMD allows to use doorbell registers to notify the peer and share
17some information by using scratchpad registers.
18
19BIOS setting on Intel Xeon
20--------------------------
21
22Intel Non-transparent Bridge (NTB) needs special BIOS settings on both systems.
23Note that for 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors,
24option ``Port Subsystem Mode`` should be changed from ``Gen5`` to ``Gen4 Only``,
25then reboot.
26
27- Set ``Non-Transparent Bridge PCIe Port Definition`` for needed PCIe ports
28  as ``NTB to NTB`` mode, on both hosts.
29- Set ``Enable NTB BARs`` as ``Enabled``, on both hosts.
30- Set ``Enable SPLIT BARs`` as ``Disabled``, on both hosts.
31- Set ``Imbar1 Size``, ``Imbar2 Size``, ``Embar1 Size`` and ``Embar2 Size``,
32  as 12-29 (i.e., 4K-512M) for 2nd Generation Intel\ |reg| Xeon\ |reg| Scalable Processors;
33  as 12-51 (i.e., 4K-128PB) for 3rd and 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors.
34  Note that those bar sizes on both hosts should be the same.
35- Set ``Crosslink Control override`` as ``DSD/USP`` on one host,
36  ``USD/DSP`` on another host.
37- Set ``PCIe PLL SSC (Spread Spectrum Clocking)`` as ``Disabled``, on both hosts.
38  This is a hardware requirement when using Re-timer Cards.
39
40Device Setup
41------------
42
43The Intel NTB devices need to be bound to a DPDK-supported kernel driver
44to use, i.e. igb_uio, vfio. The ``dpdk-devbind.py`` script can be used to
45show devices status and to bind them to a suitable kernel driver. They will
46appear under the category of "Misc (rawdev) devices".
47
48Prerequisites
49-------------
50
51NTB PMD needs kernel PCI driver to support write combining (WC) to get
52better performance. The difference will be more than 10 times.
53To enable WC, there are 2 ways.
54
55- Insert igb_uio with ``wc_activate=1`` flag if use igb_uio driver.
56
57.. code-block:: console
58
59  insmod igb_uio.ko wc_activate=1
60
61- Enable WC for NTB device's Bar 2 and Bar 4 (Mapped memory) manually.
62  The reference is https://www.kernel.org/doc/html/latest/x86/mtrr.html
63  Get bar base address using ``lspci -vvv -s ae:00.0 | grep Region``.
64
65.. code-block:: console
66
67  # lspci -vvv -s ae:00.0 | grep Region
68  Region 0: Memory at 39bfe0000000 (64-bit, prefetchable) [size=64K]
69  Region 2: Memory at 39bfa0000000 (64-bit, prefetchable) [size=512M]
70  Region 4: Memory at 39bfc0000000 (64-bit, prefetchable) [size=512M]
71
72Using the following command to enable WC.
73
74.. code-block:: console
75
76  echo "base=0x39bfa0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
77  echo "base=0x39bfc0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
78
79And the results:
80
81.. code-block:: console
82
83  # cat /proc/mtrr
84  reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
85  reg01: base=0x07f000000 ( 2032MB), size=   16MB, count=1: uncachable
86  reg02: base=0x39bfa0000000 (60553728MB), size=  512MB, count=1: write-combining
87  reg03: base=0x39bfc0000000 (60554240MB), size=  512MB, count=1: write-combining
88
89To disable WC for these regions, using the following.
90
91.. code-block:: console
92
93     echo "disable=2" >> /proc/mtrr
94     echo "disable=3" >> /proc/mtrr
95
96Ring Layout
97-----------
98
99Since read/write remote system's memory are through PCI bus, remote read
100is much more expensive than remote write. Thus, the enqueue and dequeue
101based on ntb ring should avoid remote read. The ring layout for ntb is
102like the following:
103
104- Ring Format::
105
106   desc_ring:
107
108      0               16                                              64
109      +---------------------------------------------------------------+
110      |                        buffer address                         |
111      +---------------+-----------------------------------------------+
112      | buffer length |                      resv                     |
113      +---------------+-----------------------------------------------+
114
115   used_ring:
116
117      0               16              32
118      +---------------+---------------+
119      | packet length |     flags     |
120      +---------------+---------------+
121
122- Ring Layout::
123
124      +------------------------+   +------------------------+
125      | used_ring              |   | desc_ring              |
126      | +---+                  |   | +---+                  |
127      | |   |                  |   | |   |                  |
128      | +---+      +--------+  |   | +---+                  |
129      | |   | ---> | buffer | <+---+-|   |                  |
130      | +---+      +--------+  |   | +---+                  |
131      | |   |                  |   | |   |                  |
132      | +---+                  |   | +---+                  |
133      |  ...                   |   |  ...                   |
134      |                        |   |                        |
135      |            +---------+ |   |            +---------+ |
136      |            | tx_tail | |   |            | rx_tail | |
137      | System A   +---------+ |   | System B   +---------+ |
138      +------------------------+   +------------------------+
139                    <---------traffic---------
140
141- Enqueue and Dequeue
142  Based on this ring layout, enqueue reads rx_tail to get how many free
143  buffers and writes used_ring and tx_tail to tell the peer which buffers
144  are filled with data.
145  And dequeue reads tx_tail to get how many packets are arrived, and
146  writes desc_ring and rx_tail to tell the peer about the new allocated
147  buffers.
148  So in this way, only remote write happens and remote read can be avoid
149  to get better performance.
150
151Limitation
152----------
153
154This PMD is only supported on Intel Xeon Platforms:
155
156- 4th Generation Intel® Xeon® Scalable Processors.
157- 3rd Generation Intel® Xeon® Scalable Processors.
158- 2nd Generation Intel® Xeon® Scalable Processors.
159