1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2021 Intel Corporation. 3 4Toeplitz Hash Library 5===================== 6 7DPDK provides a Toeplitz Hash Library 8to calculate the Toeplitz hash function and to use its properties. 9The Toeplitz hash function is commonly used in a wide range of NICs 10to calculate the RSS hash sum to spread the traffic among the queues. 11 12.. _figure_rss_queue_assign: 13 14.. figure:: img/rss_queue_assign.* 15 16 RSS queue assignment example 17 18 19Toeplitz hash function API 20-------------------------- 21 22There are four functions that provide calculation of the Toeplitz hash sum: 23 24* ``rte_softrss()`` 25* ``rte_softrss_be()`` 26* ``rte_thash_gfni()`` 27* ``rte_thash_gfni_bulk()`` 28 29First two functions are scalar implementation and take the parameters: 30 31* A pointer to the tuple, containing fields extracted from the packet. 32* A length of this tuple counted in double words. 33* A pointer to the RSS hash key corresponding to the one installed on the NIC. 34 35Both of above mentioned _softrss_ functions expect the tuple to be in 36"host" byte order and a multiple of 4 bytes in length. 37The ``rte_softrss()`` function expects the ``rss_key`` 38to be exactly the same as the one installed on the NIC. 39The ``rte_softrss_be`` function is a faster implementation, 40but it expects ``rss_key`` to be converted to the host byte order. 41 42The last two functions are vectorized implementations using 43Galois Fields New Instructions. Could be used if ``rte_thash_gfni_supported`` is true. 44They expect the tuple to be in network byte order. 45 46``rte_thash_gfni()`` calculates the hash value for a single tuple, and 47``rte_thash_gfni_bulk()`` bulk implementation of the rte_thash_gfni(). 48 49``rte_thash_gfni()`` takes the parameters: 50 51* A pointer to the matrices derived from the RSS hash key using ``rte_thash_complete_matrix()``. 52* A pointer to the tuple. 53* A length of the tuple in bytes. 54 55``rte_thash_gfni_bulk()`` takes the parameters: 56 57* A pointer to the matrices derived from the RSS hash key using ``rte_thash_complete_matrix()``. 58* A length of the longest tuple in bytes. 59* Array of the pointers on data to be hashed. 60* Array of ``uint32_t`` where to put calculated Toeplitz hash values 61* Number of tuples in a bulk. 62 63``rte_thash_complete_matrix()`` is a function that calculates matrices required by 64GFNI implementations from the RSS hash key. It takes the parameters: 65 66* A pointer to the memory where the matrices will be written. 67* A pointer to the RSS hash key. 68* Length of the RSS hash key in bytes. 69 70 71Predictable RSS 72--------------- 73 74In some use cases it is useful to have a way to find partial collisions of the 75Toeplitz hash function. In figure :numref:`figure_rss_queue_assign` only a few 76of the least significant bits (LSB) of the hash value are used to indicate an 77entry in the RSS Redirection Table (ReTa) and thus the index of the queue. So, 78in this case it would be useful to find another tuple whose hash has the same 79LSB's as the hash from the original tuple. 80 81For example: 82 83- In the case of SNAT (Source Network Address Translation) it is possible to 84 find a special source port number on translation so that the hash of 85 returning packets, of the given connection, will have desired LSB's. 86- In the case of MPLS (Multiprotocol Label Switching), if the MPLS tag is used 87 in the hash calculation, the Label Switching router can allocate a special 88 MPLS tag to bind an LSP (Label Switching Path) to a given queue. This method 89 can be used with the allocation of IPSec SPI, VXLan VNI, etc., to bind the 90 tunnel to the desired queue. 91- In the case of a TCP stack, a special source port could be chosen for 92 outgoing connections, such that the response packets will be assigned to the 93 desired queue. 94 95This functionality is provided by the API shown below. 96The API consists of 3 parts: 97 98* Create the thash context. 99 100* Create the thash helper, associated with a context. 101 102* Use the helper run time to calculate the adjustable bits of the tuple to 103 ensure a collision. 104 105 106Thash context 107~~~~~~~~~~~~~ 108 109The function ``rte_thash_init_ctx()`` initializes the context struct 110associated with a particular NIC or a set of NICs. It expects: 111 112* The log2 value of the size of the RSS redirection table for the 113 corresponding NIC. It reflects the number of least significant bits of the 114 hash value to produce a collision for. 115 116* A predefined RSS hash key. This is optional, if ``NULL`` then a random key 117 will be initialized. 118 119* The length of the RSS hash key. This value is usually hardware/driver 120 specific and can be found in the NIC datasheet. 121 122* Optional flags, as shown below. 123 124Supported flags: 125 126* ``RTE_THASH_IGNORE_PERIOD_OVERFLOW`` - By default, and for security reasons, 127 the library prohibits generating a repeatable sequence in the hash key. This 128 flag disables such checking. The flag is mainly used for testing in the lab 129 to generate an RSS hash key with a uniform hash distribution, if the input 130 traffic also has a uniform distribution. 131 132* ``RTE_THASH_MINIMAL_SEQ`` - By default, the library generates a special bit 133 sequence in the hash key for all the bits of the subtuple. However, the 134 collision generation task requires only the ``log2(RETA_SZ)`` bits in the 135 subtuple. This flag forces the minimum bit sequence in the hash key to be 136 generated for the required ``log2(RETA_SZ)`` least significant bits of the 137 subtuple. The flag can be used in the case of a relatively large number of 138 helpers that may overlap with their corresponding bit sequences of RSS hash 139 keys. 140 141 142Thash helper 143~~~~~~~~~~~~ 144 145The function ``rte_thash_add_helper()`` initializes the helper struct 146associated with a given context and a part of a target tuple of interest which 147could be altered to produce a hash collision. On success it writes a specially 148calculated bit sequence into the RSS hash key which is stored in the context 149and calculates a table with values to be XORed with a subtuple. 150 151It expects: 152 153* A pointer to the Thash context to be associated with. 154 155* A length of the subtuple to be modified. The length is counted in bits. 156 157* An offset of the subtuple to be modified from the beginning of the tuple. It 158 is also counted in bits. 159 160.. note:: 161 162 Adding a helper changes the key stored in the corresponding context. So the 163 updated RSS hash key must be uploaded into the NIC after creating all the 164 required helpers. 165 166 167Calculation of the complementary bits to adjust the subtuple 168~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 169 170The ``rte_thash_get_complement()`` function returns a special bit sequence 171with length ``N = log2(rss_reta_sz)`` (for the ``rss_reta_sz`` provided at 172context initialization) to be xored with N least significant bits of the 173subtuple. 174 175It expects: 176 177* A corresponding helper created for a given subtuple of the tuple. 178 179* A hash value of the tuple we want to alter. 180 181* The desired LSB's of the hash value the user expects to have. 182 183After the returned bit sequence has been XORed with the subtuple, the resulted 184LSB's of the new hash value, calculated from the altered tuple, will be the 185same as in ``desired_hash``. 186 187 188Adjust tuple API 189~~~~~~~~~~~~~~~~~ 190 191The ``rte_thash_get_complement()`` function is a user-friendly wrapper around 192a number of other functions. It alters a passed tuple to meet the above 193mentioned requirements around the desired hash LSB's. 194 195It expects: 196 197* A Thash context and helper. 198 199* A pointer to the tuple to be changed. 200 201* The length of the tuple. 202 203* A callback function and its userdata to check the tuple after it has been 204 changed. 205 206* The number of attempts to change the tuple. Basically, it makes sense if 207 there is a callback and a limit on the number of attempts to change the 208 tuple, if the callback function returns an error. 209 210 211Use case example 212---------------- 213 214There could be a number of different use cases, such as NAT, TCP stack, MPLS 215tag allocation, etc. In the following we will consider a SNAT application. 216 217Packets of a single bidirectional flow belonging to different directions can 218end up being assigned to different queues and thus processed by different 219lcores, as shown in :numref:`figure_predictable_snat_1`: 220 221.. _figure_predictable_snat_1: 222 223.. figure:: img/predictable_snat_1.* 224 225 Bidirectional flow packets distribution in general 226 227That leads to a situation where the same packet flow can be shared between two 228cores. Such a situation is not ideal from a performance perspective and 229requires extra synchronization efforts that might lead to various performance 230penalties, for example: 231 232* The connections table is global so locking/RCU on the flow insertion/removal 233 is required. 234 235* Connection metadata must be protected to avoid race conditions. 236 237* More cache pressure if a single connection metadata is kept in different 238 L1/L2 caches of a different CPU core. 239 240* Cache pressure/less cache locality on packet handover to the different cores. 241 242We can avoid all these penalties if it can be guaranteed that packets 243belonging to one bidirectional flow will be assigned to the same queue, as 244shown in :numref:`figure_predictable_snat_2`: 245 246.. _figure_predictable_snat_2: 247 248.. figure:: img/predictable_snat_2.* 249 250 Bidirectional flow packets distribution with predictable RSS 251 252 253To achieve this in a SNAT scenario it is possible to choose a source port not 254randomly, but using the predictable RSS library to produce a partial hash 255collision. This is shown in the code below. 256 257.. code-block:: c 258 259 int key_len = 40; /* The default Niantic RSS key length. */ 260 261 /** The default Niantic RSS reta size = 2^7 entries, LSBs of hash value are 262 * used as an indexes in RSS ReTa. */ 263 int reta_sz = 7; 264 int ret; 265 struct rte_thash_ctx *ctx; 266 267 uint8_t initial_key[key_len] = {0}; /* Default empty key. */ 268 269 /* Create and initialize a new thash context. */ 270 ctx = rte_thash_init_ctx("SNAT", key_len, reta_sz, initial_key, 0); 271 272 /** Add a helper and specify the variable tuple part and its length. In the 273 * SNAT case we want to choose a new source port on SNAT translation in a 274 * way that the reverse tuple will have the same LSBs as the original 275 * direction tuple so that the selected source port will be the 276 * destination port on reply. 277 */ 278 ret = rte_thash_add_helper(ctx, "snat", sizeof(uint16_t) * 8, 279 offsetof(union rte_thash_tuple, v4.dport) * 8); 280 281 if (ret != 0) 282 return ret; 283 284 /* Get handler of the required helper. */ 285 struct rte_thash_subtuple_helper *h = rte_thash_get_helper(ctx, "snat"); 286 287 /** After calling rte_thash_add_helper() the initial_key passed on ctx 288 * creation has been changed so we get the new one. 289 */ 290 uint8_t *new_key = rte_thash_get_key(ctx); 291 292 union rte_thash_tuple tuple, rev_tuple; 293 294 /* A complete tuple from the packet. */ 295 complete_tuple(mbuf, &tuple); 296 297 /* Calculate the RSS hash or get it from mbuf->hash.rss. */ 298 uint32_t orig_hash = rte_softrss((uint32_t *)&tuple, RTE_THASH_V4_L4_LEN, new_key); 299 300 /** Complete the reverse tuple by translating the SRC address and swapping 301 * src and dst addresses and ports. 302 */ 303 get_rev_tuple(&rev_tuple, &tuple, new_ip); 304 305 /* Calculate the expected rss hash for the reverse tuple. */ 306 uint32_t rev_hash = rte_softrss((uint32_t *)&rev_tuple, RTE_THASH_V4_L4_LEN, new_key); 307 308 /* Get the adjustment bits for the src port to get a new port. */ 309 uint32_t adj = rte_thash_get_compliment(h, rev_hash, orig_hash); 310 311 /* Adjust the source port bits. */ 312 uint16_t new_sport = tuple.v4.sport ^ adj; 313 314 /* Make an actual packet translation. */ 315 do_snat(mbuf, new_ip, new_sport); 316