xref: /dpdk/doc/guides/prog_guide/toeplitz_hash_lib.rst (revision daa02b5cddbb8e11b31d41e2bf7bb1ae64dcae2f)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2021 Intel Corporation.
3
4Toeplitz Hash Library
5=====================
6
7DPDK provides a Toeplitz Hash Library
8to calculate the Toeplitz hash function and to use its properties.
9The Toeplitz hash function is commonly used in a wide range of NICs
10to calculate the RSS hash sum to spread the traffic among the queues.
11
12.. _figure_rss_queue_assign:
13
14.. figure:: img/rss_queue_assign.*
15
16   RSS queue assignment example
17
18
19Toeplitz hash function API
20--------------------------
21
22There are two functions that provide calculation of the Toeplitz hash sum:
23
24* ``rte_softrss()``
25* ``rte_softrss_be()``
26
27Both of these functions take the parameters:
28
29* A pointer to the tuple, containing fields extracted from the packet.
30* A length of this tuple counted in double words.
31* A pointer to the RSS hash key corresponding to the one installed on the NIC.
32
33Both functions expect the tuple to be in "host" byte order
34and a multiple of 4 bytes in length.
35The ``rte_softrss()`` function expects the ``rss_key``
36to be exactly the same as the one installed on the NIC.
37The ``rte_softrss_be`` function is a faster implementation,
38but it expects ``rss_key`` to be converted to the host byte order.
39
40
41Predictable RSS
42---------------
43
44In some use cases it is useful to have a way to find partial collisions of the
45Toeplitz hash function. In figure :numref:`figure_rss_queue_assign` only a few
46of the least significant bits (LSB) of the hash value are used to indicate an
47entry in the RSS Redirection Table (ReTa) and thus the index of the queue. So,
48in this case it would be useful to find another tuple whose hash has the same
49LSB's as the hash from the original tuple.
50
51For example:
52
53- In the case of SNAT (Source Network Address Translation) it is possible to
54  find a special source port number on translation so that the hash of
55  returning packets, of the given connection, will have desired LSB's.
56- In the case of MPLS (Multiprotocol Label Switching), if the MPLS tag is used
57  in the hash calculation, the Label Switching router can allocate a special
58  MPLS tag to bind an LSP (Label Switching Path) to a given queue. This method
59  can be used with the allocation of IPSec SPI, VXLan VNI, etc., to bind the
60  tunnel to the desired queue.
61- In the case of a TCP stack, a special source port could be chosen for
62  outgoing connections, such that the response packets will be assigned to the
63  desired queue.
64
65This functionality is provided by the API shown below.
66The API consists of 3 parts:
67
68* Create the thash context.
69
70* Create the thash helper, associated with a context.
71
72* Use the helper run time to calculate the adjustable bits of the tuple to
73  ensure a collision.
74
75
76Thash context
77~~~~~~~~~~~~~
78
79The function ``rte_thash_init_ctx()`` initializes the context struct
80associated with a particular NIC or a set of NICs. It expects:
81
82* The log2 value of the size of the RSS redirection table for the
83  corresponding NIC. It reflects the number of least significant bits of the
84  hash value to produce a collision for.
85
86* A predefined RSS hash key. This is optional, if ``NULL`` then a random key
87  will be initialized.
88
89* The length of the RSS hash key. This value is usually hardware/driver
90  specific and can be found in the NIC datasheet.
91
92* Optional flags, as shown below.
93
94Supported flags:
95
96* ``RTE_THASH_IGNORE_PERIOD_OVERFLOW`` - By default, and for security reasons,
97  the library prohibits generating a repeatable sequence in the hash key. This
98  flag disables such checking. The flag is mainly used for testing in the lab
99  to generate an RSS hash key with a uniform hash distribution, if the input
100  traffic also has a uniform distribution.
101
102* ``RTE_THASH_MINIMAL_SEQ`` - By default, the library generates a special bit
103  sequence in the hash key for all the bits of the subtuple. However, the
104  collision generation task requires only the ``log2(RETA_SZ)`` bits in the
105  subtuple. This flag forces the minimum bit sequence in the hash key to be
106  generated for the required ``log2(RETA_SZ)`` least significant bits of the
107  subtuple. The flag can be used in the case of a relatively large number of
108  helpers that may overlap with their corresponding bit sequences of RSS hash
109  keys.
110
111
112Thash helper
113~~~~~~~~~~~~
114
115The function ``rte_thash_add_helper()`` initializes the helper struct
116associated with a given context and a part of a target tuple of interest which
117could be altered to produce a hash collision. On success it writes a specially
118calculated bit sequence into the RSS hash key which is stored in the context
119and calculates a table with values to be XORed with a subtuple.
120
121It expects:
122
123* A pointer to the Thash context to be associated with.
124
125* A length of the subtuple to be modified. The length is counted in bits.
126
127* An offset of the subtuple to be modified from the beginning of the tuple. It
128  is also counted in bits.
129
130.. note::
131
132   Adding a helper changes the key stored in the corresponding context. So the
133   updated RSS hash key must be uploaded into the NIC after creating all the
134   required helpers.
135
136
137Calculation of the complementary bits to adjust the subtuple
138~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
139
140The ``rte_thash_get_complement()`` function returns a special bit sequence
141with length ``N = log2(rss_reta_sz)`` (for the ``rss_reta_sz`` provided at
142context initialization) to be xored with N least significant bits of the
143subtuple.
144
145It expects:
146
147* A corresponding helper created for a given subtuple of the tuple.
148
149* A hash value of the tuple we want to alter.
150
151* The desired LSB's of the hash value the user expects to have.
152
153After the returned bit sequence has been XORed with the subtuple, the resulted
154LSB's of the new hash value, calculated from the altered tuple, will be the
155same as in ``desired_hash``.
156
157
158Adjust tuple API
159~~~~~~~~~~~~~~~~~
160
161The ``rte_thash_get_complement()`` function is a user-friendly wrapper around
162a number of other functions. It alters a passed tuple to meet the above
163mentioned requirements around the desired hash LSB's.
164
165It expects:
166
167* A Thash context and helper.
168
169* A pointer to the tuple to be changed.
170
171* The length of the tuple.
172
173* A callback function and its userdata to check the tuple after it has been
174  changed.
175
176* The number of attempts to change the tuple. Basically, it makes sense if
177  there is a callback and a limit on the number of attempts to change the
178  tuple, if the callback function returns an error.
179
180
181Use case example
182----------------
183
184There could be a number of different use cases, such as NAT, TCP stack, MPLS
185tag allocation, etc. In the following we will consider a SNAT application.
186
187Packets of a single bidirectional flow belonging to different directions can
188end up being assigned to different queues and thus processed by different
189lcores, as shown in :numref:`figure_predictable_snat_1`:
190
191.. _figure_predictable_snat_1:
192
193.. figure:: img/predictable_snat_1.*
194
195   Bidirectional flow packets distribution in general
196
197That leads to a situation where the same packet flow can be shared between two
198cores. Such a situation is not ideal from a performance perspective and
199requires extra synchronization efforts that might lead to various performance
200penalties, for example:
201
202* The connections table is global so locking/RCU on the flow insertion/removal
203  is required.
204
205* Connection metadata must be protected to avoid race conditions.
206
207* More cache pressure if a single connection metadata is kept in different
208  L1/L2 caches of a different CPU core.
209
210* Cache pressure/less cache locality on packet handover to the different cores.
211
212We can avoid all these penalties if it can be guaranteed that packets
213belonging to one bidirectional flow will be assigned to the same queue, as
214shown in :numref:`figure_predictable_snat_2`:
215
216.. _figure_predictable_snat_2:
217
218.. figure:: img/predictable_snat_2.*
219
220   Bidirectional flow packets distribution with predictable RSS
221
222
223To achieve this in a SNAT scenario it is possible to choose a source port not
224randomly, but using the predictable RSS library to produce a partial hash
225collision. This is shown in the code below.
226
227.. code-block:: c
228
229   int key_len = 40; /* The default Niantic RSS key length. */
230
231   /** The default Niantic RSS reta size = 2^7 entries, LSBs of hash value are
232    *  used as an indexes in RSS ReTa. */
233   int reta_sz = 7;
234   int ret;
235   struct rte_thash_ctx *ctx;
236
237   uint8_t initial_key[key_len] = {0}; /* Default empty key. */
238
239   /* Create and initialize a new thash context. */
240   ctx = rte_thash_init_ctx("SNAT", key_len, reta_sz, initial_key, 0);
241
242   /** Add a helper and specify the variable tuple part and its length. In the
243    *  SNAT case we want to choose a new source port on SNAT translation in a
244    *  way that the reverse tuple will have the same LSBs as the original
245    *  direction tuple so that the selected source port will be the
246    *  destination port on reply.
247    */
248   ret = rte_thash_add_helper(ctx, "snat", sizeof(uint16_t) * 8,
249                              offsetof(union rte_thash_tuple, v4.dport) * 8);
250
251   if (ret != 0)
252       return ret;
253
254   /* Get handler of the required helper. */
255   struct rte_thash_subtuple_helper *h = rte_thash_get_helper(ctx, "snat");
256
257   /** After calling rte_thash_add_helper() the initial_key passed on ctx
258    *  creation has been changed so we get the new one.
259    */
260   uint8_t *new_key = rte_thash_get_key(ctx);
261
262   union rte_thash_tuple tuple, rev_tuple;
263
264   /* A complete tuple from the packet. */
265   complete_tuple(mbuf, &tuple);
266
267   /* Calculate the RSS hash or get it from mbuf->hash.rss. */
268   uint32_t orig_hash = rte_softrss((uint32_t *)&tuple, RTE_THASH_V4_L4_LEN, new_key);
269
270   /** Complete the reverse tuple by translating the SRC address and swapping
271    *  src and dst addresses and ports.
272    */
273   get_rev_tuple(&rev_tuple, &tuple, new_ip);
274
275   /* Calculate the expected rss hash for the reverse tuple. */
276   uint32_t rev_hash = rte_softrss((uint32_t *)&rev_tuple, RTE_THASH_V4_L4_LEN, new_key);
277
278   /* Get the adjustment bits for the src port to get a new port. */
279   uint32_t adj = rte_thash_get_compliment(h, rev_hash, orig_hash);
280
281   /* Adjust the source port bits. */
282   uint16_t new_sport = tuple.v4.sport ^ adj;
283
284   /* Make an actual packet translation. */
285   do_snat(mbuf, new_ip, new_sport);
286