1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2016-2017 Intel Corporation. 3 4Elastic Flow Distributor (EFD) Library 5====================================== 6 7Introduction 8------------ 9 10In Data Centers today, clustering and scheduling of distributed workloads 11is a very common task. Many workloads require a deterministic 12partitioning of a flat key space among a cluster of machines. When a 13packet enters the cluster, the ingress node will direct the packet to 14its handling node. For example, data-centers with disaggregated storage 15use storage metadata tables to forward I/O requests to the correct back end 16storage cluster, stateful packet inspection will use match incoming 17flows to signatures in flow tables to send incoming packets to their 18intended deep packet inspection (DPI) devices, and so on. 19 20EFD is a distributor library that uses perfect hashing to determine a 21target/value for a given incoming flow key. It has the following 22advantages: first, because it uses perfect hashing it does not store the 23key itself and hence lookup performance is not dependent on the key 24size. Second, the target/value can be any arbitrary value hence the 25system designer and/or operator can better optimize service rates and 26inter-cluster network traffic locating. Third, since the storage 27requirement is much smaller than a hash-based flow table (i.e. better 28fit for CPU cache), EFD can scale to millions of flow keys. Finally, 29with the current optimized library implementation, performance is fully 30scalable with any number of CPU cores. 31 32Flow Based Distribution 33----------------------- 34 35Computation Based Schemes 36~~~~~~~~~~~~~~~~~~~~~~~~~ 37 38Flow distribution and/or load balancing can be simply done using a 39stateless computation, for instance using round-robin or a simple 40computation based on the flow key as an input. For example, a hash 41function can be used to direct a certain flow to a target based on 42the flow key (e.g. ``h(key) mod n``) where h(key) is the hash value of the 43flow key and n is the number of possible targets. 44 45.. _figure_efd1: 46 47.. figure:: img/efd_i1.* 48 49 Load Balancing Using Front End Node 50 51In this scheme (:numref:`figure_efd1`), the front end server/distributor/load balancer 52extracts the flow key from the input packet and applies a computation to determine where 53this flow should be directed. Intuitively, this scheme is very simple 54and requires no state to be kept at the front end node, and hence, 55storage requirements are minimum. 56 57.. _figure_efd2: 58 59.. figure:: img/efd_i2.* 60 61 Consistent Hashing 62 63A widely used flow distributor that belongs to the same category of 64computation-based schemes is ``consistent hashing``, shown in :numref:`figure_efd2`. 65Target destinations (shown in red) are hashed into the same space as the flow 66keys (shown in blue), and keys are mapped to the nearest target in a clockwise 67fashion. Dynamically adding and removing targets with consistent hashing 68requires only K/n keys to be remapped on average, where K is the number of 69keys, and n is the number of targets. In contrast, in a traditional hash-based 70scheme, a change in the number of targets causes nearly all keys to be 71remapped. 72 73Although computation-based schemes are simple and need very little 74storage requirement, they suffer from the drawback that the system 75designer/operator can’t fully control the target to assign a specific 76key, as this is dictated by the hash function. 77Deterministically co-locating of keys together (for example, to minimize 78inter-server traffic or to optimize for network traffic conditions, 79target load, etc.) is simply not possible. 80 81Flow-Table Based Schemes 82~~~~~~~~~~~~~~~~~~~~~~~~ 83 84When using a Flow-Table based scheme to handle flow distribution/load 85balancing, in contrast with computation-based schemes, the system designer 86has the flexibility of assigning a given flow to any given 87target. The flow table (e.g. DPDK RTE Hash Library) will simply store 88both the flow key and the target value. 89 90.. _figure_efd3: 91 92.. figure:: img/efd_i3.* 93 94 Table Based Flow Distribution 95 96As shown in :numref:`figure_efd3`, when doing a lookup, the flow-table 97is indexed with the hash of the flow key and the keys (more than one is possible, 98because of hash collision) stored in this index and corresponding values 99are retrieved. The retrieved key(s) is matched with the input flow key 100and if there is a match the value (target id) is returned. 101 102The drawback of using a hash table for flow distribution/load balancing 103is the storage requirement, since the flow table need to store keys, 104signatures and target values. This doesn't allow this scheme to scale to 105millions of flow keys. Large tables will usually not fit in 106the CPU cache, and hence, the lookup performance is degraded because of 107the latency to access the main memory. 108 109EFD Based Scheme 110~~~~~~~~~~~~~~~~ 111 112EFD combines the advantages of both flow-table based and computation-based 113schemes. It doesn't require the large storage necessary for 114flow-table based schemes (because EFD doesn't store the key as explained 115below), and it supports any arbitrary value for any given key. 116 117.. _figure_efd4: 118 119.. figure:: img/efd_i4.* 120 121 Searching for Perfect Hash Function 122 123The basic idea of EFD is when a given key is to be inserted, a family of 124hash functions is searched until the correct hash function that maps the 125input key to the correct value is found, as shown in :numref:`figure_efd4`. 126However, rather than explicitly storing all keys and their associated values, 127EFD stores only indices of hash functions that map keys to values, and 128thereby consumes much less space than conventional flow-based tables. 129The lookup operation is very simple, similar to a computational-based 130scheme: given an input key the lookup operation is reduced to hashing 131that key with the correct hash function. 132 133.. _figure_efd5: 134 135.. figure:: img/efd_i5.* 136 137 Divide and Conquer for Millions of Keys 138 139Intuitively, finding a hash function that maps each of a large number 140(millions) of input keys to the correct output value is effectively 141impossible, as a result EFD, as shown in :numref:`figure_efd5`, 142breaks the problem into smaller pieces (divide and conquer). 143EFD divides the entire input key set into many small groups. 144Each group consists of approximately 20-28 keys (a configurable parameter 145for the library), then, for each small group, a brute force search to find 146a hash function that produces the correct outputs for each key in the group. 147 148It should be mentioned that, since the online lookup table for EFD 149doesn't store the key itself, the size of the EFD table is independent 150of the key size and hence EFD lookup performance which is almost 151constant irrespective of the length of the key which is a highly 152desirable feature especially for longer keys. 153 154In summary, EFD is a set separation data structure that supports millions of 155keys. It is used to distribute a given key to an intended target. By itself 156EFD is not a FIB data structure with an exact match the input flow key. 157 158.. _Efd_example: 159 160Example of EFD Library Usage 161---------------------------- 162 163EFD can be used along the data path of many network functions and middleboxes. 164As previously mentioned, it can used as an index table for 165<key,value> pairs, meta-data for objects, a flow-level load balancer, etc. 166:numref:`figure_efd6` shows an example of using EFD as a flow-level load 167balancer, where flows are received at a front end server before being forwarded 168to the target back end server for processing. The system designer would 169deterministically co-locate flows together in order to minimize cross-server 170interaction. 171(For example, flows requesting certain webpage objects are co-located 172together, to minimize forwarding of common objects across servers). 173 174.. _figure_efd6: 175 176.. figure:: img/efd_i6.* 177 178 EFD as a Flow-Level Load Balancer 179 180As shown in :numref:`figure_efd6`, the front end server will have an EFD table that 181stores for each group what is the perfect hash index that satisfies the 182correct output. Because the table size is small and fits in cache (since 183keys are not stored), it sustains a large number of flows (N*X, where N 184is the maximum number of flows served by each back end server of the X 185possible targets). 186 187With an input flow key, the group id is computed (for example, using 188last few bits of CRC hash) and then the EFD table is indexed with the 189group id to retrieve the corresponding hash index to use. Once the index 190is retrieved the key is hashed using this hash function and the result 191will be the intended correct target where this flow is supposed to be 192processed. 193 194It should be noted that as a result of EFD not matching the exact key but 195rather distributing the flows to a target back end node based on the 196perfect hash index, a key that has not been inserted before 197will be distributed to a valid target. Hence, a local table which stores 198the flows served at each node is used and is 199exact matched with the input key to rule out new never seen before 200flows. 201 202.. _Efd_api: 203 204Library API Overview 205-------------------- 206 207The EFD library API is created with a very similar semantics of a 208hash-index or a flow table. The application creates an EFD table for a 209given maximum number of flows, a function is called to insert a flow key 210with a specific target value, and another function is used to retrieve 211target values for a given individual flow key or a bulk of keys. 212 213EFD Table Create 214~~~~~~~~~~~~~~~~ 215 216The function ``rte_efd_create()`` is used to create and return a pointer 217to an EFD table that is sized to hold up to num_flows key. 218The online version of the EFD table (the one that does 219not store the keys and is used for lookups) will be allocated and 220created in the last level cache (LLC) of the socket defined by the 221online_socket_bitmask, while the offline EFD table (the one that 222stores the keys and is used for key inserts and for computing the 223perfect hashing) is allocated and created in the LLC of the socket 224defined by offline_socket_bitmask. It should be noted, that for 225highest performance the socket id should match that where the thread is 226running, i.e. the online EFD lookup table should be created on the same 227socket as where the lookup thread is running. 228 229EFD Insert and Update 230~~~~~~~~~~~~~~~~~~~~~ 231 232The EFD function to insert a key or update a key to a new value is 233``rte_efd_update()``. This function will update an existing key to 234a new value (target) if the key has already been inserted 235before, or will insert the <key,value> pair if this key has not been inserted 236before. It will return 0 upon success. It will return 237``EFD_UPDATE_WARN_GROUP_FULL (1)`` if the operation is insert, and the 238last available space in the key's group was just used. It will return 239``EFD_UPDATE_FAILED (2)`` when the insertion or update has failed (either it 240failed to find a suitable perfect hash or the group was full). The function 241will return ``EFD_UPDATE_NO_CHANGE (3)`` if there is no change to the EFD 242table (i.e, same value already exists). 243 244.. Note:: 245 246 This function is not multi-thread safe and should only be called 247 from one thread. 248 249EFD Lookup 250~~~~~~~~~~ 251 252To lookup a certain key in an EFD table, the function ``rte_efd_lookup()`` 253is used to return the value associated with single key. 254As previously mentioned, if the key has been inserted, the correct value 255inserted is returned, if the key has not been inserted before, 256a ‘random’ value (based on hashing of the key) is returned. 257For better performance and to decrease the overhead of 258function calls per key, it is always recommended to use a bulk lookup 259function (simultaneous lookup of multiple keys) instead of a single key 260lookup function. ``rte_efd_lookup_bulk()`` is the bulk lookup function, 261that looks up num_keys simultaneously stored in the key_list and the 262corresponding return values will be returned in the value_list. 263 264.. Note:: 265 266 This function is multi-thread safe, but there should not be other threads 267 writing in the EFD table, unless locks are used. 268 269EFD Delete 270~~~~~~~~~~ 271 272To delete a certain key in an EFD table, the function 273``rte_efd_delete()`` can be used. The function returns zero upon success 274when the key has been found and deleted. Socket_id is the parameter to 275use to lookup the existing value, which is ideally the caller's socket id. 276The previous value associated with this key will be returned 277in the prev_value argument. 278 279.. Note:: 280 281 This function is not multi-thread safe and should only be called 282 from one thread. 283 284.. _Efd_internals: 285 286Library Internals 287----------------- 288 289This section provides the brief high-level idea and an overview 290of the library internals to accompany the RFC. The intent of this 291section is to explain to readers the high-level implementation of 292insert, lookup and group rebalancing in the EFD library. 293 294Insert Function Internals 295~~~~~~~~~~~~~~~~~~~~~~~~~ 296 297As previously mentioned the EFD divides the whole set of keys into 298groups of a manageable size (e.g. 28 keys) and then searches for the 299perfect hash that satisfies the intended target value for each key. EFD 300stores two version of the <key,value> table: 301 302- Offline Version (in memory): Only used for the insertion/update 303 operation, which is less frequent than the lookup operation. In the 304 offline version the exact keys for each group is stored. When a new 305 key is added, the hash function is updated that will satisfy the 306 value for the new key together with the all old keys already inserted 307 in this group. 308 309- Online Version (in cache): Used for the frequent lookup operation. In 310 the online version, as previously mentioned, the keys are not stored 311 but rather only the hash index for each group. 312 313.. _figure_efd7: 314 315.. figure:: img/efd_i7.* 316 317 Group Assignment 318 319:numref:`figure_efd7` depicts the group assignment for 7 flow keys as an example. 320Given a flow key, a hash function (in our implementation CRC hash) is 321used to get the group id. As shown in the figure, the groups can be 322unbalanced. (We highlight group rebalancing further below). 323 324.. _figure_efd8: 325 326.. figure:: img/efd_i8.* 327 328 Perfect Hash Search - Assigned Keys & Target Value 329 330Focusing on one group that has four keys, :numref:`figure_efd8` depicts the search 331algorithm to find the perfect hash function. Assuming that the target 332value bit for the keys is as shown in the figure, then the online EFD 333table will store a 16 bit hash index and 16 bit lookup table per group 334per value bit. 335 336.. _figure_efd9: 337 338.. figure:: img/efd_i9.* 339 340 Perfect Hash Search - Satisfy Target Values 341 342For a given keyX, a hash function ``(h(keyX, seed1) + index * h(keyX, seed2))`` 343is used to point to certain bit index in the 16bit lookup_table value, 344as shown in :numref:`figure_efd9`. 345The insert function will brute force search for all possible values for the 346hash index until a non conflicting lookup_table is found. 347 348.. _figure_efd10: 349 350.. figure:: img/efd_i10.* 351 352 Finding Hash Index for Conflict Free lookup_table 353 354For example, since both key3 and key7 have a target bit value of 1, it 355is okay if the hash function of both keys point to the same bit in the 356lookup table. A conflict will occur if a hash index is used that maps 357both Key4 and Key7 to the same index in the lookup_table, 358as shown in :numref:`figure_efd10`, since their target value bit are not the same. 359Once a hash index is found that produces a lookup_table with no 360contradictions, this index is stored for this group. This procedure is 361repeated for each bit of target value. 362 363Lookup Function Internals 364~~~~~~~~~~~~~~~~~~~~~~~~~ 365 366The design principle of EFD is that lookups are much more frequent than 367inserts, and hence, EFD's design optimizes for the lookups which are 368faster and much simpler than the slower insert procedure (inserts are 369slow, because of perfect hash search as previously discussed). 370 371.. _figure_efd11: 372 373.. figure:: img/efd_i11.* 374 375 EFD Lookup Operation 376 377:numref:`figure_efd11` depicts the lookup operation for EFD. Given an input key, 378the group id is computed (using CRC hash) and then the hash index for this 379group is retrieved from the EFD table. Using the retrieved hash index, 380the hash function ``h(key, seed1) + index *h(key, seed2)`` is used which will 381result in an index in the lookup_table, the bit corresponding to this 382index will be the target value bit. This procedure is repeated for each 383bit of the target value. 384 385Group Rebalancing Function Internals 386~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 387 388When discussing EFD inserts and lookups, the discussion is simplified by 389assuming that a group id is simply a result of hash function. However, 390since hashing in general is not perfect and will not always produce a 391uniform output, this simplified assumption will lead to unbalanced 392groups, i.e., some group will have more keys than other groups. 393Typically, and to minimize insert time with an increasing number of keys, 394it is preferable that all groups will have a balanced number of keys, so 395the brute force search for the perfect hash terminates with a valid hash 396index. In order to achieve this target, groups are rebalanced during 397runtime inserts, and keys are moved around from a busy group to a less 398crowded group as the more keys are inserted. 399 400.. _figure_efd12: 401 402.. figure:: img/efd_i12.* 403 404 Runtime Group Rebalancing 405 406:numref:`figure_efd12` depicts the high level idea of group rebalancing, given an 407input key the hash result is split into two parts a chunk id and 8-bit 408bin id. A chunk contains 64 different groups and 256 bins (i.e. for any 409given bin it can map to 4 distinct groups). When a key is inserted, the 410bin id is computed, for example in :numref:`figure_efd12` bin_id=2, 411and since each bin can be mapped to one of four different groups (2 bit storage), 412the four possible mappings are evaluated and the one that will result in a 413balanced key distribution across these four is selected the mapping result 414is stored in these two bits. 415 416 417.. _Efd_references: 418 419References 420----------- 421 4221- EFD is based on collaborative research work between Intel and 423Carnegie Mellon University (CMU), interested readers can refer to the paper 424"Scaling Up Clustered Network Appliances with ScaleBricks" Dong Zhou et al. 425at SIGCOMM 2015 (`http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p241.pdf`) 426for more information. 427