1.. BSD LICENSE 2 Copyright(c) 2016-2017 Intel Corporation. All rights reserved. 3 All rights reserved. 4 5 Redistribution and use in source and binary forms, with or without 6 modification, are permitted provided that the following conditions 7 are met: 8 9 * Redistributions of source code must retain the above copyright 10 notice, this list of conditions and the following disclaimer. 11 * Redistributions in binary form must reproduce the above copyright 12 notice, this list of conditions and the following disclaimer in 13 the documentation and/or other materials provided with the 14 distribution. 15 * Neither the name of Intel Corporation nor the names of its 16 contributors may be used to endorse or promote products derived 17 from this software without specific prior written permission. 18 19 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 20 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 21 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 22 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 23 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 25 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 27 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 31.. _Efd_Library: 32 33Elastic Flow Distributor Library 34================================ 35 36Introduction 37------------ 38 39In Data Centers today, clustering and scheduling of distributed workloads 40is a very common task. Many workloads require a deterministic 41partitioning of a flat key space among a cluster of machines. When a 42packet enters the cluster, the ingress node will direct the packet to 43its handling node. For example, data-centers with disaggregated storage 44use storage metadata tables to forward I/O requests to the correct back end 45storage cluster, stateful packet inspection will use match incoming 46flows to signatures in flow tables to send incoming packets to their 47intended deep packet inspection (DPI) devices, and so on. 48 49EFD is a distributor library that uses perfect hashing to determine a 50target/value for a given incoming flow key. It has the following 51advantages: first, because it uses perfect hashing it does not store the 52key itself and hence lookup performance is not dependent on the key 53size. Second, the target/value can be any arbitrary value hence the 54system designer and/or operator can better optimize service rates and 55inter-cluster network traffic locating. Third, since the storage 56requirement is much smaller than a hash-based flow table (i.e. better 57fit for CPU cache), EFD can scale to millions of flow keys. Finally, 58with the current optimized library implementation, performance is fully 59scalable with any number of CPU cores. 60 61Flow Based Distribution 62----------------------- 63 64Computation Based Schemes 65~~~~~~~~~~~~~~~~~~~~~~~~~ 66 67Flow distribution and/or load balancing can be simply done using a 68stateless computation, for instance using round-robin or a simple 69computation based on the flow key as an input. For example, a hash 70function can be used to direct a certain flow to a target based on 71the flow key (e.g. ``h(key) mod n``) where h(key) is the hash value of the 72flow key and n is the number of possible targets. 73 74.. _figure_efd1: 75 76.. figure:: img/efd_i1.* 77 78 Load Balancing Using Front End Node 79 80In this scheme (:numref:`figure_efd1`), the front end server/distributor/load balancer 81extracts the flow key from the input packet and applies a computation to determine where 82this flow should be directed. Intuitively, this scheme is very simple 83and requires no state to be kept at the front end node, and hence, 84storage requirements are minimum. 85 86.. _figure_efd2: 87 88.. figure:: img/efd_i2.* 89 90 Consistent Hashing 91 92A widely used flow distributor that belongs to the same category of 93computation-based schemes is ``consistent hashing``, shown in :numref:`figure_efd2`. 94Target destinations (shown in red) are hashed into the same space as the flow 95keys (shown in blue), and keys are mapped to the nearest target in a clockwise 96fashion. Dynamically adding and removing targets with consistent hashing 97requires only K/n keys to be remapped on average, where K is the number of 98keys, and n is the number of targets. In contrast, in a traditional hash-based 99scheme, a change in the number of targets causes nearly all keys to be 100remapped. 101 102Although computation-based schemes are simple and need very little 103storage requirement, they suffer from the drawback that the system 104designer/operator can’t fully control the target to assign a specific 105key, as this is dictated by the hash function. 106Deterministically co-locating of keys together (for example, to minimize 107inter-server traffic or to optimize for network traffic conditions, 108target load, etc.) is simply not possible. 109 110Flow-Table Based Schemes 111~~~~~~~~~~~~~~~~~~~~~~~~ 112 113When using a Flow-Table based scheme to handle flow distribution/load 114balancing, in contrast with computation-based schemes, the system designer 115has the flexibility of assigning a given flow to any given 116target. The flow table (e.g. DPDK RTE Hash Library) will simply store 117both the flow key and the target value. 118 119.. _figure_efd3: 120 121.. figure:: img/efd_i3.* 122 123 Table Based Flow Distribution 124 125As shown in :numref:`figure_efd3`, when doing a lookup, the flow-table 126is indexed with the hash of the flow key and the keys (more than one is possible, 127because of hash collision) stored in this index and corresponding values 128are retrieved. The retrieved key(s) is matched with the input flow key 129and if there is a match the value (target id) is returned. 130 131The drawback of using a hash table for flow distribution/load balancing 132is the storage requirement, since the flow table need to store keys, 133signatures and target values. This doesn't allow this scheme to scale to 134millions of flow keys. Large tables will usually not fit in 135the CPU cache, and hence, the lookup performance is degraded because of 136the latency to access the main memory. 137 138EFD Based Scheme 139~~~~~~~~~~~~~~~~ 140 141EFD combines the advantages of both flow-table based and computation-based 142schemes. It doesn't require the large storage necessary for 143flow-table based schemes (because EFD doesn't store the key as explained 144below), and it supports any arbitrary value for any given key. 145 146.. _figure_efd4: 147 148.. figure:: img/efd_i4.* 149 150 Searching for Perfect Hash Function 151 152The basic idea of EFD is when a given key is to be inserted, a family of 153hash functions is searched until the correct hash function that maps the 154input key to the correct value is found, as shown in :numref:`figure_efd4`. 155However, rather than explicitly storing all keys and their associated values, 156EFD stores only indices of hash functions that map keys to values, and 157thereby consumes much less space than conventional flow-based tables. 158The lookup operation is very simple, similar to a computational-based 159scheme: given an input key the lookup operation is reduced to hashing 160that key with the correct hash function. 161 162.. _figure_efd5: 163 164.. figure:: img/efd_i5.* 165 166 Divide and Conquer for Millions of Keys 167 168Intuitively, finding a hash function that maps each of a large number 169(millions) of input keys to the correct output value is effectively 170impossible, as a result EFD, as shown in :numref:`figure_efd5`, 171breaks the problem into smaller pieces (divide and conquer). 172EFD divides the entire input key set into many small groups. 173Each group consists of approximately 20-28 keys (a configurable parameter 174for the library), then, for each small group, a brute force search to find 175a hash function that produces the correct outputs for each key in the group. 176 177It should be mentioned that, since the online lookup table for EFD 178doesn't store the key itself, the size of the EFD table is independent 179of the key size and hence EFD lookup performance which is almost 180constant irrespective of the length of the key which is a highly 181desirable feature especially for longer keys. 182 183In summary, EFD is a set separation data structure that supports millions of 184keys. It is used to distribute a given key to an intended target. By itself 185EFD is not a FIB data structure with an exact match the input flow key. 186 187.. _Efd_example: 188 189Example of EFD Library Usage 190---------------------------- 191 192EFD can be used along the data path of many network functions and middleboxes. 193As previously mentioned, it can used as an index table for 194<key,value> pairs, meta-data for objects, a flow-level load balancer, etc. 195:numref:`figure_efd6` shows an example of using EFD as a flow-level load 196balancer, where flows are received at a front end server before being forwarded 197to the target back end server for processing. The system designer would 198deterministically co-locate flows together in order to minimize cross-server 199interaction. 200(For example, flows requesting certain webpage objects are co-located 201together, to minimize forwarding of common objects across servers). 202 203.. _figure_efd6: 204 205.. figure:: img/efd_i6.* 206 207 EFD as a Flow-Level Load Balancer 208 209As shown in :numref:`figure_efd6`, the front end server will have an EFD table that 210stores for each group what is the perfect hash index that satisfies the 211correct output. Because the table size is small and fits in cache (since 212keys are not stored), it sustains a large number of flows (N*X, where N 213is the maximum number of flows served by each back end server of the X 214possible targets). 215 216With an input flow key, the group id is computed (for example, using 217last few bits of CRC hash) and then the EFD table is indexed with the 218group id to retrieve the corresponding hash index to use. Once the index 219is retrieved the key is hashed using this hash function and the result 220will be the intended correct target where this flow is supposed to be 221processed. 222 223It should be noted that as a result of EFD not matching the exact key but 224rather distributing the flows to a target back end node based on the 225perfect hash index, a key that has not been inserted before 226will be distributed to a valid target. Hence, a local table which stores 227the flows served at each node is used and is 228exact matched with the input key to rule out new never seen before 229flows. 230 231.. _Efd_api: 232 233Library API Overview 234-------------------- 235 236The EFD library API is created with a very similar semantics of a 237hash-index or a flow table. The application creates an EFD table for a 238given maximum number of flows, a function is called to insert a flow key 239with a specific target value, and another function is used to retrieve 240target values for a given individual flow key or a bulk of keys. 241 242EFD Table Create 243~~~~~~~~~~~~~~~~ 244 245The function ``rte_efd_create()`` is used to create and return a pointer 246to an EFD table that is sized to hold up to num_flows key. 247The online version of the EFD table (the one that does 248not store the keys and is used for lookups) will be allocated and 249created in the last level cache (LLC) of the socket defined by the 250online_socket_bitmask, while the offline EFD table (the one that 251stores the keys and is used for key inserts and for computing the 252perfect hashing) is allocated and created in the LLC of the socket 253defined by offline_socket_bitmask. It should be noted, that for 254highest performance the socket id should match that where the thread is 255running, i.e. the online EFD lookup table should be created on the same 256socket as where the lookup thread is running. 257 258EFD Insert and Update 259~~~~~~~~~~~~~~~~~~~~~ 260 261The EFD function to insert a key or update a key to a new value is 262``rte_efd_update()``. This function will update an existing key to 263a new value (target) if the key has already been inserted 264before, or will insert the <key,value> pair if this key has not been inserted 265before. It will return 0 upon success. It will return 266``EFD_UPDATE_WARN_GROUP_FULL (1)`` if the operation is insert, and the 267last available space in the key's group was just used. It will return 268``EFD_UPDATE_FAILED (2)`` when the insertion or update has failed (either it 269failed to find a suitable perfect hash or the group was full). The function 270will return ``EFD_UPDATE_NO_CHANGE (3)`` if there is no change to the EFD 271table (i.e, same value already exists). 272 273.. Note:: 274 275 This function is not multi-thread safe and should only be called 276 from one thread. 277 278EFD Lookup 279~~~~~~~~~~ 280 281To lookup a certain key in an EFD table, the function ``rte_efd_lookup()`` 282is used to return the value associated with single key. 283As previously mentioned, if the key has been inserted, the correct value 284inserted is returned, if the key has not been inserted before, 285a ‘random’ value (based on hashing of the key) is returned. 286For better performance and to decrease the overhead of 287function calls per key, it is always recommended to use a bulk lookup 288function (simultaneous lookup of multiple keys) instead of a single key 289lookup function. ``rte_efd_lookup_bulk()`` is the bulk lookup function, 290that looks up num_keys simultaneously stored in the key_list and the 291corresponding return values will be returned in the value_list. 292 293.. Note:: 294 295 This function is multi-thread safe, but there should not be other threads 296 writing in the EFD table, unless locks are used. 297 298EFD Delete 299~~~~~~~~~~ 300 301To delete a certain key in an EFD table, the function 302``rte_efd_delete()`` can be used. The function returns zero upon success 303when the key has been found and deleted. Socket_id is the parameter to 304use to lookup the existing value, which is ideally the caller's socket id. 305The previous value associated with this key will be returned 306in the prev_value argument. 307 308.. Note:: 309 310 This function is not multi-thread safe and should only be called 311 from one thread. 312 313.. _Efd_internals: 314 315Library Internals 316----------------- 317 318This section provides the brief high-level idea and an overview 319of the library internals to accompany the RFC. The intent of this 320section is to explain to readers the high-level implementation of 321insert, lookup and group rebalancing in the EFD library. 322 323Insert Function Internals 324~~~~~~~~~~~~~~~~~~~~~~~~~ 325 326As previously mentioned the EFD divides the whole set of keys into 327groups of a manageable size (e.g. 28 keys) and then searches for the 328perfect hash that satisfies the intended target value for each key. EFD 329stores two version of the <key,value> table: 330 331- Offline Version (in memory): Only used for the insertion/update 332 operation, which is less frequent than the lookup operation. In the 333 offline version the exact keys for each group is stored. When a new 334 key is added, the hash function is updated that will satisfy the 335 value for the new key together with the all old keys already inserted 336 in this group. 337 338- Online Version (in cache): Used for the frequent lookup operation. In 339 the online version, as previously mentioned, the keys are not stored 340 but rather only the hash index for each group. 341 342.. _figure_efd7: 343 344.. figure:: img/efd_i7.* 345 346 Group Assignment 347 348:numref:`figure_efd7` depicts the group assignment for 7 flow keys as an example. 349Given a flow key, a hash function (in our implementation CRC hash) is 350used to get the group id. As shown in the figure, the groups can be 351unbalanced. (We highlight group rebalancing further below). 352 353.. _figure_efd8: 354 355.. figure:: img/efd_i8.* 356 357 Perfect Hash Search - Assigned Keys & Target Value 358 359Focusing on one group that has four keys, :numref:`figure_efd8` depicts the search 360algorithm to find the perfect hash function. Assuming that the target 361value bit for the keys is as shown in the figure, then the online EFD 362table will store a 16 bit hash index and 16 bit lookup table per group 363per value bit. 364 365.. _figure_efd9: 366 367.. figure:: img/efd_i9.* 368 369 Perfect Hash Search - Satisfy Target Values 370 371For a given keyX, a hash function ``(h(keyX, seed1) + index * h(keyX, seed2))`` 372is used to point to certain bit index in the 16bit lookup_table value, 373as shown in :numref:`figure_efd9`. 374The insert function will brute force search for all possible values for the 375hash index until a non conflicting lookup_table is found. 376 377.. _figure_efd10: 378 379.. figure:: img/efd_i10.* 380 381 Finding Hash Index for Conflict Free lookup_table 382 383For example, since both key3 and key7 have a target bit value of 1, it 384is okay if the hash function of both keys point to the same bit in the 385lookup table. A conflict will occur if a hash index is used that maps 386both Key4 and Key7 to the same index in the lookup_table, 387as shown in :numref:`figure_efd10`, since their target value bit are not the same. 388Once a hash index is found that produces a lookup_table with no 389contradictions, this index is stored for this group. This procedure is 390repeated for each bit of target value. 391 392Lookup Function Internals 393~~~~~~~~~~~~~~~~~~~~~~~~~ 394 395The design principle of EFD is that lookups are much more frequent than 396inserts, and hence, EFD's design optimizes for the lookups which are 397faster and much simpler than the slower insert procedure (inserts are 398slow, because of perfect hash search as previously discussed). 399 400.. _figure_efd11: 401 402.. figure:: img/efd_i11.* 403 404 EFD Lookup Operation 405 406:numref:`figure_efd11` depicts the lookup operation for EFD. Given an input key, 407the group id is computed (using CRC hash) and then the hash index for this 408group is retrieved from the EFD table. Using the retrieved hash index, 409the hash function ``h(key, seed1) + index *h(key, seed2)`` is used which will 410result in an index in the lookup_table, the bit corresponding to this 411index will be the target value bit. This procedure is repeated for each 412bit of the target value. 413 414Group Rebalancing Function Internals 415~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 416 417When discussing EFD inserts and lookups, the discussion is simplified by 418assuming that a group id is simply a result of hash function. However, 419since hashing in general is not perfect and will not always produce a 420uniform output, this simplified assumption will lead to unbalanced 421groups, i.e., some group will have more keys than other groups. 422Typically, and to minimize insert time with an increasing number of keys, 423it is preferable that all groups will have a balanced number of keys, so 424the brute force search for the perfect hash terminates with a valid hash 425index. In order to achieve this target, groups are rebalanced during 426runtime inserts, and keys are moved around from a busy group to a less 427crowded group as the more keys are inserted. 428 429.. _figure_efd12: 430 431.. figure:: img/efd_i12.* 432 433 Runtime Group Rebalancing 434 435:numref:`figure_efd12` depicts the high level idea of group rebalancing, given an 436input key the hash result is split into two parts a chunk id and 8-bit 437bin id. A chunk contains 64 different groups and 256 bins (i.e. for any 438given bin it can map to 4 distinct groups). When a key is inserted, the 439bin id is computed, for example in :numref:`figure_efd12` bin_id=2, 440and since each bin can be mapped to one of four different groups (2 bit storage), 441the four possible mappings are evaluated and the one that will result in a 442balanced key distribution across these four is selected the mapping result 443is stored in these two bits. 444 445 446.. _Efd_references: 447 448References 449----------- 450 4511- EFD is based on collaborative research work between Intel and 452Carnegie Mellon University (CMU), interested readers can refer to the paper 453“Scaling Up Clustered Network Appliances with ScaleBricks;” Dong Zhou et al. 454at SIGCOMM 2015 (`http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p241.pdf`) 455for more information. 456