1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2021 Intel Corporation. 3 4Toeplitz Hash Library 5===================== 6 7DPDK provides a Toeplitz Hash Library 8to calculate the Toeplitz hash function and to use its properties. 9The Toeplitz hash function is commonly used in a wide range of NICs 10to calculate the RSS hash sum to spread the traffic among the queues. 11 12.. _figure_rss_queue_assign: 13 14.. figure:: img/rss_queue_assign.* 15 16 RSS queue assignment example 17 18 19Toeplitz hash function API 20-------------------------- 21 22There are two functions that provide calculation of the Toeplitz hash sum: 23 24* ``rte_softrss()`` 25* ``rte_softrss_be()`` 26 27Both of these functions take the parameters: 28 29* A pointer to the tuple, containing fields extracted from the packet. 30* A length of this tuple counted in double words. 31* A pointer to the RSS hash key corresponding to the one installed on the NIC. 32 33Both functions expect the tuple to be in "host" byte order 34and a multiple of 4 bytes in length. 35The ``rte_softrss()`` function expects the ``rss_key`` 36to be exactly the same as the one installed on the NIC. 37The ``rte_softrss_be`` function is a faster implementation, 38but it expects ``rss_key`` to be converted to the host byte order. 39 40 41Predictable RSS 42--------------- 43 44In some use cases it is useful to have a way to find partial collisions of the 45Toeplitz hash function. In figure :numref:`figure_rss_queue_assign` only a few 46of the least significant bits (LSB) of the hash value are used to indicate an 47entry in the RSS Redirection Table (ReTa) and thus the index of the queue. So, 48in this case it would be useful to find another tuple whose hash has the same 49LSB's as the hash from the original tuple. 50 51For example: 52 53- In the case of SNAT (Source Network Address Translation) it is possible to 54 find a special source port number on translation so that the hash of 55 returning packets, of the given connection, will have desired LSB's. 56- In the case of MPLS (Multiprotocol Label Switching), if the MPLS tag is used 57 in the hash calculation, the Label Switching router can allocate a special 58 MPLS tag to bind an LSP (Label Switching Path) to a given queue. This method 59 can be used with the allocation of IPSec SPI, VXLan VNI, etc., to bind the 60 tunnel to the desired queue. 61- In the case of a TCP stack, a special source port could be chosen for 62 outgoing connections, such that the response packets will be assigned to the 63 desired queue. 64 65This functionality is provided by the API shown below. 66The API consists of 3 parts: 67 68* Create the thash context. 69 70* Create the thash helper, associated with a context. 71 72* Use the helper run time to calculate the adjustable bits of the tuple to 73 ensure a collision. 74 75 76Thash context 77~~~~~~~~~~~~~ 78 79The function ``rte_thash_init_ctx()`` initializes the context struct 80associated with a particular NIC or a set of NICs. It expects: 81 82* The log2 value of the size of the RSS redirection table for the 83 corresponding NIC. It reflects the number of least significant bits of the 84 hash value to produce a collision for. 85 86* A predefined RSS hash key. This is optional, if ``NULL`` then a random key 87 will be initialized. 88 89* The length of the RSS hash key. This value is usually hardware/driver 90 specific and can be found in the NIC datasheet. 91 92* Optional flags, as shown below. 93 94Supported flags: 95 96* ``RTE_THASH_IGNORE_PERIOD_OVERFLOW`` - By default, and for security reasons, 97 the library prohibits generating a repeatable sequence in the hash key. This 98 flag disables such checking. The flag is mainly used for testing in the lab 99 to generate an RSS hash key with a uniform hash distribution, if the input 100 traffic also has a uniform distribution. 101 102* ``RTE_THASH_MINIMAL_SEQ`` - By default, the library generates a special bit 103 sequence in the hash key for all the bits of the subtuple. However, the 104 collision generation task requires only the ``log2(RETA_SZ)`` bits in the 105 subtuple. This flag forces the minimum bit sequence in the hash key to be 106 generated for the required ``log2(RETA_SZ)`` least significant bits of the 107 subtuple. The flag can be used in the case of a relatively large number of 108 helpers that may overlap with their corresponding bit sequences of RSS hash 109 keys. 110 111 112Thash helper 113~~~~~~~~~~~~ 114 115The function ``rte_thash_add_helper()`` initializes the helper struct 116associated with a given context and a part of a target tuple of interest which 117could be altered to produce a hash collision. On success it writes a specially 118calculated bit sequence into the RSS hash key which is stored in the context 119and calculates a table with values to be XORed with a subtuple. 120 121It expects: 122 123* A pointer to the Thash context to be associated with. 124 125* A length of the subtuple to be modified. The length is counted in bits. 126 127* An offset of the subtuple to be modified from the beginning of the tuple. It 128 is also counted in bits. 129 130.. note:: 131 132 Adding a helper changes the key stored in the corresponding context. So the 133 updated RSS hash key must be uploaded into the NIC after creating all the 134 required helpers. 135 136 137Calculation of the complementary bits to adjust the subtuple 138~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 139 140The ``rte_thash_get_complement()`` function returns a special bit sequence 141with length ``N = log2(rss_reta_sz)`` (for the ``rss_reta_sz`` provided at 142context initialization) to be xored with N least significant bits of the 143subtuple. 144 145It expects: 146 147* A corresponding helper created for a given subtuple of the tuple. 148 149* A hash value of the tuple we want to alter. 150 151* The desired LSB's of the hash value the user expects to have. 152 153After the returned bit sequence has been XORed with the subtuple, the resulted 154LSB's of the new hash value, calculated from the altered tuple, will be the 155same as in ``desired_hash``. 156 157 158Adjust tuple API 159~~~~~~~~~~~~~~~~~ 160 161The ``rte_thash_get_complement()`` function is a user-friendly wrapper around 162a number of other functions. It alters a passed tuple to meet the above 163mentioned requirements around the desired hash LSB's. 164 165It expects: 166 167* A Thash context and helper. 168 169* A pointer to the tuple to be changed. 170 171* The length of the tuple. 172 173* A callback function and its userdata to check the tuple after it has been 174 changed. 175 176* The number of attempts to change the tuple. Basically, it makes sense if 177 there is a callback and a limit on the number of attempts to change the 178 tuple, if the callback function returns an error. 179 180 181Use case example 182---------------- 183 184There could be a number of different use cases, such as NAT, TCP stack, MPLS 185tag allocation, etc. In the following we will consider a SNAT application. 186 187Packets of a single bidirectional flow belonging to different directions can 188end up being assigned to different queues and thus processed by different 189lcores, as shown in :numref:`figure_predictable_snat_1`: 190 191.. _figure_predictable_snat_1: 192 193.. figure:: img/predictable_snat_1.* 194 195 Bidirectional flow packets distribution in general 196 197That leads to a situation where the same packet flow can be shared between two 198cores. Such a situation is not ideal from a performance perspective and 199requires extra synchronization efforts that might lead to various performance 200penalties, for example: 201 202* The connections table is global so locking/RCU on the flow insertion/removal 203 is required. 204 205* Connection metadata must be protected to avoid race conditions. 206 207* More cache pressure if a single connection metadata is kept in different 208 L1/L2 caches of a different CPU core. 209 210* Cache pressure/less cache locality on packet handover to the different cores. 211 212We can avoid all these penalties if it can be guaranteed that packets 213belonging to one bidirectional flow will be assigned to the same queue, as 214shown in :numref:`figure_predictable_snat_2`: 215 216.. _figure_predictable_snat_2: 217 218.. figure:: img/predictable_snat_2.* 219 220 Bidirectional flow packets distribution with predictable RSS 221 222 223To achieve this in a SNAT scenario it is possible to choose a source port not 224randomly, but using the predictable RSS library to produce a partial hash 225collision. This is shown in the code below. 226 227.. code-block:: c 228 229 int key_len = 40; /* The default Niantic RSS key length. */ 230 231 /** The default Niantic RSS reta size = 2^7 entries, LSBs of hash value are 232 * used as an indexes in RSS ReTa. */ 233 int reta_sz = 7; 234 int ret; 235 struct rte_thash_ctx *ctx; 236 237 uint8_t initial_key[key_len] = {0}; /* Default empty key. */ 238 239 /* Create and initialize a new thash context. */ 240 ctx = rte_thash_init_ctx("SNAT", key_len, reta_sz, initial_key, 0); 241 242 /** Add a helper and specify the variable tuple part and its length. In the 243 * SNAT case we want to choose a new source port on SNAT translation in a 244 * way that the reverse tuple will have the same LSBs as the original 245 * direction tuple so that the selected source port will be the 246 * destination port on reply. 247 */ 248 ret = rte_thash_add_helper(ctx, "snat", sizeof(uint16_t) * 8, 249 offsetof(union rte_thash_tuple, v4.dport) * 8); 250 251 if (ret != 0) 252 return ret; 253 254 /* Get handler of the required helper. */ 255 struct rte_thash_subtuple_helper *h = rte_thash_get_helper(ctx, "snat"); 256 257 /** After calling rte_thash_add_helper() the initial_key passed on ctx 258 * creation has been changed so we get the new one. 259 */ 260 uint8_t *new_key = rte_thash_get_key(ctx); 261 262 union rte_thash_tuple tuple, rev_tuple; 263 264 /* A complete tuple from the packet. */ 265 complete_tuple(mbuf, &tuple); 266 267 /* Calculate the RSS hash or get it from mbuf->hash.rss. */ 268 uint32_t orig_hash = rte_softrss((uint32_t *)&tuple, RTE_THASH_V4_L4_LEN, new_key); 269 270 /** Complete the reverse tuple by translating the SRC address and swapping 271 * src and dst addresses and ports. 272 */ 273 get_rev_tuple(&rev_tuple, &tuple, new_ip); 274 275 /* Calculate the expected rss hash for the reverse tuple. */ 276 uint32_t rev_hash = rte_softrss((uint32_t *)&rev_tuple, RTE_THASH_V4_L4_LEN, new_key); 277 278 /* Get the adjustment bits for the src port to get a new port. */ 279 uint32_t adj = rte_thash_get_compliment(h, rev_hash, orig_hash); 280 281 /* Adjust the source port bits. */ 282 uint16_t new_sport = tuple.v4.sport ^ adj; 283 284 /* Make an actual packet translation. */ 285 do_snat(mbuf, new_ip, new_sport); 286