xref: /dpdk/doc/guides/prog_guide/toeplitz_hash_lib.rst (revision 31d7c069478dd94f40c7e4360b2a0953313ffe0e)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2021 Intel Corporation.
3
4Toeplitz Hash Library
5=====================
6
7DPDK provides a Toeplitz Hash Library
8to calculate the Toeplitz hash function and to use its properties.
9The Toeplitz hash function is commonly used in a wide range of NICs
10to calculate the RSS hash sum to spread the traffic among the queues.
11
12.. _figure_rss_queue_assign:
13
14.. figure:: img/rss_queue_assign.*
15
16   RSS queue assignment example
17
18
19Toeplitz hash function API
20--------------------------
21
22There are four functions that provide calculation of the Toeplitz hash sum:
23
24* ``rte_softrss()``
25* ``rte_softrss_be()``
26* ``rte_thash_gfni()``
27* ``rte_thash_gfni_bulk()``
28
29First two functions are scalar implementation and take the parameters:
30
31* A pointer to the tuple, containing fields extracted from the packet.
32* A length of this tuple counted in double words.
33* A pointer to the RSS hash key corresponding to the one installed on the NIC.
34
35Both of above mentioned _softrss_ functions expect the tuple to be in
36"host" byte order and a multiple of 4 bytes in length.
37The ``rte_softrss()`` function expects the ``rss_key``
38to be exactly the same as the one installed on the NIC.
39The ``rte_softrss_be`` function is a faster implementation,
40but it expects ``rss_key`` to be converted to the host byte order.
41
42The last two functions are vectorized implementations using
43Galois Fields New Instructions. Could be used if ``rte_thash_gfni_supported`` is true.
44They expect the tuple to be in network byte order.
45
46``rte_thash_gfni()`` calculates the hash value for a single tuple, and
47``rte_thash_gfni_bulk()`` bulk implementation of the rte_thash_gfni().
48
49``rte_thash_gfni()`` takes the parameters:
50
51* A pointer to the matrices derived from the RSS hash key using ``rte_thash_complete_matrix()``.
52* A pointer to the tuple.
53* A length of the tuple in bytes.
54
55``rte_thash_gfni_bulk()`` takes the parameters:
56
57* A pointer to the matrices derived from the RSS hash key using ``rte_thash_complete_matrix()``.
58* A length of the longest tuple in bytes.
59* Array of the pointers on data to be hashed.
60* Array of ``uint32_t`` where to put calculated Toeplitz hash values
61* Number of tuples in a bulk.
62
63``rte_thash_complete_matrix()`` is a function that calculates matrices required by
64GFNI implementations from the RSS hash key. It takes the parameters:
65
66* A pointer to the memory where the matrices will be written.
67* A pointer to the RSS hash key.
68* Length of the RSS hash key in bytes.
69
70
71Predictable RSS
72---------------
73
74In some use cases it is useful to have a way to find partial collisions of the
75Toeplitz hash function. In figure :numref:`figure_rss_queue_assign` only a few
76of the least significant bits (LSB) of the hash value are used to indicate an
77entry in the RSS Redirection Table (ReTa) and thus the index of the queue. So,
78in this case it would be useful to find another tuple whose hash has the same
79LSB's as the hash from the original tuple.
80
81For example:
82
83- In the case of SNAT (Source Network Address Translation) it is possible to
84  find a special source port number on translation so that the hash of
85  returning packets, of the given connection, will have desired LSB's.
86- In the case of MPLS (Multiprotocol Label Switching), if the MPLS tag is used
87  in the hash calculation, the Label Switching router can allocate a special
88  MPLS tag to bind an LSP (Label Switching Path) to a given queue. This method
89  can be used with the allocation of IPSec SPI, VXLan VNI, etc., to bind the
90  tunnel to the desired queue.
91- In the case of a TCP stack, a special source port could be chosen for
92  outgoing connections, such that the response packets will be assigned to the
93  desired queue.
94
95This functionality is provided by the API shown below.
96The API consists of 3 parts:
97
98* Create the thash context.
99
100* Create the thash helper, associated with a context.
101
102* Use the helper run time to calculate the adjustable bits of the tuple to
103  ensure a collision.
104
105
106Thash context
107~~~~~~~~~~~~~
108
109The function ``rte_thash_init_ctx()`` initializes the context struct
110associated with a particular NIC or a set of NICs. It expects:
111
112* The log2 value of the size of the RSS redirection table for the
113  corresponding NIC. It reflects the number of least significant bits of the
114  hash value to produce a collision for.
115
116* A predefined RSS hash key. This is optional, if ``NULL`` then a random key
117  will be initialized.
118
119* The length of the RSS hash key. This value is usually hardware/driver
120  specific and can be found in the NIC datasheet.
121
122* Optional flags, as shown below.
123
124Supported flags:
125
126* ``RTE_THASH_IGNORE_PERIOD_OVERFLOW`` - By default, and for security reasons,
127  the library prohibits generating a repeatable sequence in the hash key. This
128  flag disables such checking. The flag is mainly used for testing in the lab
129  to generate an RSS hash key with a uniform hash distribution, if the input
130  traffic also has a uniform distribution.
131
132* ``RTE_THASH_MINIMAL_SEQ`` - By default, the library generates a special bit
133  sequence in the hash key for all the bits of the subtuple. However, the
134  collision generation task requires only the ``log2(RETA_SZ)`` bits in the
135  subtuple. This flag forces the minimum bit sequence in the hash key to be
136  generated for the required ``log2(RETA_SZ)`` least significant bits of the
137  subtuple. The flag can be used in the case of a relatively large number of
138  helpers that may overlap with their corresponding bit sequences of RSS hash
139  keys.
140
141
142Thash helper
143~~~~~~~~~~~~
144
145The function ``rte_thash_add_helper()`` initializes the helper struct
146associated with a given context and a part of a target tuple of interest which
147could be altered to produce a hash collision. On success it writes a specially
148calculated bit sequence into the RSS hash key which is stored in the context
149and calculates a table with values to be XORed with a subtuple.
150
151It expects:
152
153* A pointer to the Thash context to be associated with.
154
155* A length of the subtuple to be modified. The length is counted in bits.
156
157* An offset of the subtuple to be modified from the beginning of the tuple. It
158  is also counted in bits.
159
160.. note::
161
162   Adding a helper changes the key stored in the corresponding context. So the
163   updated RSS hash key must be uploaded into the NIC after creating all the
164   required helpers.
165
166
167Calculation of the complementary bits to adjust the subtuple
168~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169
170The ``rte_thash_get_complement()`` function returns a special bit sequence
171with length ``N = log2(rss_reta_sz)`` (for the ``rss_reta_sz`` provided at
172context initialization) to be xored with N least significant bits of the
173subtuple.
174
175It expects:
176
177* A corresponding helper created for a given subtuple of the tuple.
178
179* A hash value of the tuple we want to alter.
180
181* The desired LSB's of the hash value the user expects to have.
182
183After the returned bit sequence has been XORed with the subtuple, the resulted
184LSB's of the new hash value, calculated from the altered tuple, will be the
185same as in ``desired_hash``.
186
187
188Adjust tuple API
189~~~~~~~~~~~~~~~~~
190
191The ``rte_thash_get_complement()`` function is a user-friendly wrapper around
192a number of other functions. It alters a passed tuple to meet the above
193mentioned requirements around the desired hash LSB's.
194
195It expects:
196
197* A Thash context and helper.
198
199* A pointer to the tuple to be changed.
200
201* The length of the tuple.
202
203* A callback function and its userdata to check the tuple after it has been
204  changed.
205
206* The number of attempts to change the tuple. Basically, it makes sense if
207  there is a callback and a limit on the number of attempts to change the
208  tuple, if the callback function returns an error.
209
210
211Use case example
212----------------
213
214There could be a number of different use cases, such as NAT, TCP stack, MPLS
215tag allocation, etc. In the following we will consider a SNAT application.
216
217Packets of a single bidirectional flow belonging to different directions can
218end up being assigned to different queues and thus processed by different
219lcores, as shown in :numref:`figure_predictable_snat_1`:
220
221.. _figure_predictable_snat_1:
222
223.. figure:: img/predictable_snat_1.*
224
225   Bidirectional flow packets distribution in general
226
227That leads to a situation where the same packet flow can be shared between two
228cores. Such a situation is not ideal from a performance perspective and
229requires extra synchronization efforts that might lead to various performance
230penalties, for example:
231
232* The connections table is global so locking/RCU on the flow insertion/removal
233  is required.
234
235* Connection metadata must be protected to avoid race conditions.
236
237* More cache pressure if a single connection metadata is kept in different
238  L1/L2 caches of a different CPU core.
239
240* Cache pressure/less cache locality on packet handover to the different cores.
241
242We can avoid all these penalties if it can be guaranteed that packets
243belonging to one bidirectional flow will be assigned to the same queue, as
244shown in :numref:`figure_predictable_snat_2`:
245
246.. _figure_predictable_snat_2:
247
248.. figure:: img/predictable_snat_2.*
249
250   Bidirectional flow packets distribution with predictable RSS
251
252
253To achieve this in a SNAT scenario it is possible to choose a source port not
254randomly, but using the predictable RSS library to produce a partial hash
255collision. This is shown in the code below.
256
257.. code-block:: c
258
259   int key_len = 40; /* The default Niantic RSS key length. */
260
261   /** The default Niantic RSS reta size = 2^7 entries, LSBs of hash value are
262    *  used as an indexes in RSS ReTa. */
263   int reta_sz = 7;
264   int ret;
265   struct rte_thash_ctx *ctx;
266
267   uint8_t initial_key[key_len] = {0}; /* Default empty key. */
268
269   /* Create and initialize a new thash context. */
270   ctx = rte_thash_init_ctx("SNAT", key_len, reta_sz, initial_key, 0);
271
272   /** Add a helper and specify the variable tuple part and its length. In the
273    *  SNAT case we want to choose a new source port on SNAT translation in a
274    *  way that the reverse tuple will have the same LSBs as the original
275    *  direction tuple so that the selected source port will be the
276    *  destination port on reply.
277    */
278   ret = rte_thash_add_helper(ctx, "snat", sizeof(uint16_t) * 8,
279                              offsetof(union rte_thash_tuple, v4.dport) * 8);
280
281   if (ret != 0)
282       return ret;
283
284   /* Get handler of the required helper. */
285   struct rte_thash_subtuple_helper *h = rte_thash_get_helper(ctx, "snat");
286
287   /** After calling rte_thash_add_helper() the initial_key passed on ctx
288    *  creation has been changed so we get the new one.
289    */
290   uint8_t *new_key = rte_thash_get_key(ctx);
291
292   union rte_thash_tuple tuple, rev_tuple;
293
294   /* A complete tuple from the packet. */
295   complete_tuple(mbuf, &tuple);
296
297   /* Calculate the RSS hash or get it from mbuf->hash.rss. */
298   uint32_t orig_hash = rte_softrss((uint32_t *)&tuple, RTE_THASH_V4_L4_LEN, new_key);
299
300   /** Complete the reverse tuple by translating the SRC address and swapping
301    *  src and dst addresses and ports.
302    */
303   get_rev_tuple(&rev_tuple, &tuple, new_ip);
304
305   /* Calculate the expected rss hash for the reverse tuple. */
306   uint32_t rev_hash = rte_softrss((uint32_t *)&rev_tuple, RTE_THASH_V4_L4_LEN, new_key);
307
308   /* Get the adjustment bits for the src port to get a new port. */
309   uint32_t adj = rte_thash_get_compliment(h, rev_hash, orig_hash);
310
311   /* Adjust the source port bits. */
312   uint16_t new_sport = tuple.v4.sport ^ adj;
313
314   /* Make an actual packet translation. */
315   do_snat(mbuf, new_ip, new_sport);
316