12c900d09SJiayu Hu.. BSD LICENSE 22c900d09SJiayu Hu Copyright(c) 2017 Intel Corporation. All rights reserved. 32c900d09SJiayu Hu All rights reserved. 42c900d09SJiayu Hu 52c900d09SJiayu Hu Redistribution and use in source and binary forms, with or without 62c900d09SJiayu Hu modification, are permitted provided that the following conditions 72c900d09SJiayu Hu are met: 82c900d09SJiayu Hu 92c900d09SJiayu Hu * Redistributions of source code must retain the above copyright 102c900d09SJiayu Hu notice, this list of conditions and the following disclaimer. 112c900d09SJiayu Hu * Redistributions in binary form must reproduce the above copyright 122c900d09SJiayu Hu notice, this list of conditions and the following disclaimer in 132c900d09SJiayu Hu the documentation and/or other materials provided with the 142c900d09SJiayu Hu distribution. 152c900d09SJiayu Hu * Neither the name of Intel Corporation nor the names of its 162c900d09SJiayu Hu contributors may be used to endorse or promote products derived 172c900d09SJiayu Hu from this software without specific prior written permission. 182c900d09SJiayu Hu 192c900d09SJiayu Hu THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 202c900d09SJiayu Hu "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 212c900d09SJiayu Hu LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 222c900d09SJiayu Hu A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 232c900d09SJiayu Hu OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 242c900d09SJiayu Hu SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 252c900d09SJiayu Hu LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 262c900d09SJiayu Hu DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 272c900d09SJiayu Hu THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 282c900d09SJiayu Hu (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 292c900d09SJiayu Hu OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 302c900d09SJiayu Hu 312c900d09SJiayu HuGeneric Receive Offload Library 322c900d09SJiayu Hu=============================== 332c900d09SJiayu Hu 342c900d09SJiayu HuGeneric Receive Offload (GRO) is a widely used SW-based offloading 351e4cf4d6SJiayu Hutechnique to reduce per-packet processing overheads. By reassembling 361e4cf4d6SJiayu Husmall packets into larger ones, GRO enables applications to process 371e4cf4d6SJiayu Hufewer large packets directly, thus reducing the number of packets to 381e4cf4d6SJiayu Hube processed. To benefit DPDK-based applications, like Open vSwitch, 391e4cf4d6SJiayu HuDPDK also provides own GRO implementation. In DPDK, GRO is implemented 401e4cf4d6SJiayu Huas a standalone library. Applications explicitly use the GRO library to 411e4cf4d6SJiayu Hureassemble packets. 422c900d09SJiayu Hu 431e4cf4d6SJiayu HuOverview 441e4cf4d6SJiayu Hu-------- 452c900d09SJiayu Hu 461e4cf4d6SJiayu HuIn the GRO library, there are many GRO types which are defined by packet 471e4cf4d6SJiayu Hutypes. One GRO type is in charge of process one kind of packets. For 481e4cf4d6SJiayu Huexample, TCP/IPv4 GRO processes TCP/IPv4 packets. 492c900d09SJiayu Hu 501e4cf4d6SJiayu HuEach GRO type has a reassembly function, which defines own algorithm and 511e4cf4d6SJiayu Hutable structure to reassemble packets. We assign input packets to the 521e4cf4d6SJiayu Hucorresponding GRO functions by MBUF->packet_type. 532c900d09SJiayu Hu 541e4cf4d6SJiayu HuThe GRO library doesn't check if input packets have correct checksums and 551e4cf4d6SJiayu Hudoesn't re-calculate checksums for merged packets. The GRO library 561e4cf4d6SJiayu Huassumes the packets are complete (i.e., MF==0 && frag_off==0), when IP 57b52b61f0SJiayu Hufragmentation is possible (i.e., DF==0). Additionally, it complies RFC 58b52b61f0SJiayu Hu6864 to process the IPv4 ID field. 592c900d09SJiayu Hu 60*9e0b9d2eSJiayu HuCurrently, the GRO library provides GRO supports for TCP/IPv4 packets and 61*9e0b9d2eSJiayu HuVxLAN packets which contain an outer IPv4 header and an inner TCP/IPv4 62*9e0b9d2eSJiayu Hupacket. 632c900d09SJiayu Hu 641e4cf4d6SJiayu HuTwo Sets of API 651e4cf4d6SJiayu Hu--------------- 662c900d09SJiayu Hu 671e4cf4d6SJiayu HuFor different usage scenarios, the GRO library provides two sets of API. 681e4cf4d6SJiayu HuThe one is called the lightweight mode API, which enables applications to 691e4cf4d6SJiayu Humerge a small number of packets rapidly; the other is called the 701e4cf4d6SJiayu Huheavyweight mode API, which provides fine-grained controls to 711e4cf4d6SJiayu Huapplications and supports to merge a large number of packets. 722c900d09SJiayu Hu 731e4cf4d6SJiayu HuLightweight Mode API 741e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~ 752c900d09SJiayu Hu 761e4cf4d6SJiayu HuThe lightweight mode only has one function ``rte_gro_reassemble_burst()``, 771e4cf4d6SJiayu Huwhich process N packets at a time. Using the lightweight mode API to 781e4cf4d6SJiayu Humerge packets is very simple. Calling ``rte_gro_reassemble_burst()`` is 791e4cf4d6SJiayu Huenough. The GROed packets are returned to applications as soon as it 801e4cf4d6SJiayu Hufinishes. 812c900d09SJiayu Hu 821e4cf4d6SJiayu HuIn ``rte_gro_reassemble_burst()``, table structures of different GRO 831e4cf4d6SJiayu Hutypes are allocated in the stack. This design simplifies applications' 841e4cf4d6SJiayu Huoperations. However, limited by the stack size, the maximum number of 851e4cf4d6SJiayu Hupackets that ``rte_gro_reassemble_burst()`` can process in an invocation 861e4cf4d6SJiayu Hushould be less than or equal to ``RTE_GRO_MAX_BURST_ITEM_NUM``. 872c900d09SJiayu Hu 881e4cf4d6SJiayu HuHeavyweight Mode API 891e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~ 902c900d09SJiayu Hu 911e4cf4d6SJiayu HuCompared with the lightweight mode, using the heavyweight mode API is 921e4cf4d6SJiayu Hurelatively complex. Firstly, applications need to create a GRO context 931e4cf4d6SJiayu Huby ``rte_gro_ctx_create()``. ``rte_gro_ctx_create()`` allocates tables 941e4cf4d6SJiayu Hustructures in the heap and stores their pointers in the GRO context. 951e4cf4d6SJiayu HuSecondly, applications use ``rte_gro_reassemble()`` to merge packets. 961e4cf4d6SJiayu HuIf input packets have invalid parameters, ``rte_gro_reassemble()`` 971e4cf4d6SJiayu Hureturns them to applications. For example, packets of unsupported GRO 981e4cf4d6SJiayu Hutypes or TCP SYN packets are returned. Otherwise, the input packets are 991e4cf4d6SJiayu Hueither merged with the existed packets in the tables or inserted into the 1001e4cf4d6SJiayu Hutables. Finally, applications use ``rte_gro_timeout_flush()`` to flush 1011e4cf4d6SJiayu Hupackets from the tables, when they want to get the GROed packets. 1022c900d09SJiayu Hu 1031e4cf4d6SJiayu HuNote that all update/lookup operations on the GRO context are not thread 1041e4cf4d6SJiayu Husafe. So if different processes or threads want to access the same 1051e4cf4d6SJiayu Hucontext object simultaneously, some external syncing mechanisms must be 1061e4cf4d6SJiayu Huused. 1071e4cf4d6SJiayu Hu 1081e4cf4d6SJiayu HuReassembly Algorithm 1091e4cf4d6SJiayu Hu-------------------- 1101e4cf4d6SJiayu Hu 1111e4cf4d6SJiayu HuThe reassembly algorithm is used for reassembling packets. In the GRO 1121e4cf4d6SJiayu Hulibrary, different GRO types can use different algorithms. In this 113*9e0b9d2eSJiayu Husection, we will introduce an algorithm, which is used by TCP/IPv4 GRO 114*9e0b9d2eSJiayu Huand VxLAN GRO. 1151e4cf4d6SJiayu Hu 1161e4cf4d6SJiayu HuChallenges 1171e4cf4d6SJiayu Hu~~~~~~~~~~ 1181e4cf4d6SJiayu Hu 1191e4cf4d6SJiayu HuThe reassembly algorithm determines the efficiency of GRO. There are two 1201e4cf4d6SJiayu Huchallenges in the algorithm design: 1211e4cf4d6SJiayu Hu 1221e4cf4d6SJiayu Hu- a high cost algorithm/implementation would cause packet dropping in a 1231e4cf4d6SJiayu Hu high speed network. 1241e4cf4d6SJiayu Hu 1251e4cf4d6SJiayu Hu- packet reordering makes it hard to merge packets. For example, Linux 1261e4cf4d6SJiayu Hu GRO fails to merge packets when encounters packet reordering. 1271e4cf4d6SJiayu Hu 1281e4cf4d6SJiayu HuThe above two challenges require our algorithm is: 1291e4cf4d6SJiayu Hu 1301e4cf4d6SJiayu Hu- lightweight enough to scale fast networking speed 1311e4cf4d6SJiayu Hu 1321e4cf4d6SJiayu Hu- capable of handling packet reordering 1331e4cf4d6SJiayu Hu 1341e4cf4d6SJiayu HuIn DPDK GRO, we use a key-based algorithm to address the two challenges. 1351e4cf4d6SJiayu Hu 1361e4cf4d6SJiayu HuKey-based Reassembly Algorithm 1371e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1381e4cf4d6SJiayu Hu 1391e4cf4d6SJiayu Hu:numref:`figure_gro-key-algorithm` illustrates the procedure of the 1401e4cf4d6SJiayu Hukey-based algorithm. Packets are classified into "flows" by some header 1411e4cf4d6SJiayu Hufields (we call them as "key"). To process an input packet, the algorithm 1421e4cf4d6SJiayu Husearches for a matched "flow" (i.e., the same value of key) for the 1431e4cf4d6SJiayu Hupacket first, then checks all packets in the "flow" and tries to find a 1441e4cf4d6SJiayu Hu"neighbor" for it. If find a "neighbor", merge the two packets together. 1451e4cf4d6SJiayu HuIf can't find a "neighbor", store the packet into its "flow". If can't 1461e4cf4d6SJiayu Hufind a matched "flow", insert a new "flow" and store the packet into the 1471e4cf4d6SJiayu Hu"flow". 1481e4cf4d6SJiayu Hu 1491e4cf4d6SJiayu Hu.. note:: 1501e4cf4d6SJiayu Hu Packets in the same "flow" that can't merge are always caused 1511e4cf4d6SJiayu Hu by packet reordering. 1521e4cf4d6SJiayu Hu 1531e4cf4d6SJiayu HuThe key-based algorithm has two characters: 1541e4cf4d6SJiayu Hu 1551e4cf4d6SJiayu Hu- classifying packets into "flows" to accelerate packet aggregation is 1561e4cf4d6SJiayu Hu simple (address challenge 1). 1571e4cf4d6SJiayu Hu 1581e4cf4d6SJiayu Hu- storing out-of-order packets makes it possible to merge later (address 1591e4cf4d6SJiayu Hu challenge 2). 1601e4cf4d6SJiayu Hu 1611e4cf4d6SJiayu Hu.. _figure_gro-key-algorithm: 1621e4cf4d6SJiayu Hu 1631e4cf4d6SJiayu Hu.. figure:: img/gro-key-algorithm.* 1641e4cf4d6SJiayu Hu :align: center 1651e4cf4d6SJiayu Hu 1661e4cf4d6SJiayu Hu Key-based Reassembly Algorithm 1672c900d09SJiayu Hu 1682c900d09SJiayu HuTCP/IPv4 GRO 1692c900d09SJiayu Hu------------ 1702c900d09SJiayu Hu 1711e4cf4d6SJiayu HuThe table structure used by TCP/IPv4 GRO contains two arrays: flow array 1721e4cf4d6SJiayu Huand item array. The flow array keeps flow information, and the item array 1731e4cf4d6SJiayu Hukeeps packet information. 1742c900d09SJiayu Hu 1751e4cf4d6SJiayu HuHeader fields used to define a TCP/IPv4 flow include: 1762c900d09SJiayu Hu 1771e4cf4d6SJiayu Hu- source and destination: Ethernet and IP address, TCP port 1782c900d09SJiayu Hu 1791e4cf4d6SJiayu Hu- TCP acknowledge number 1802c900d09SJiayu Hu 1811e4cf4d6SJiayu HuTCP/IPv4 packets whose FIN, SYN, RST, URG, PSH, ECE or CWR bit is set 1821e4cf4d6SJiayu Huwon't be processed. 1832c900d09SJiayu Hu 1841e4cf4d6SJiayu HuHeader fields deciding if two packets are neighbors include: 1852c900d09SJiayu Hu 1861e4cf4d6SJiayu Hu- TCP sequence number 1872c900d09SJiayu Hu 188b52b61f0SJiayu Hu- IPv4 ID. The IPv4 ID fields of the packets, whose DF bit is 0, should 189b52b61f0SJiayu Hu be increased by 1. 190b52b61f0SJiayu Hu 191*9e0b9d2eSJiayu HuVxLAN GRO 192*9e0b9d2eSJiayu Hu--------- 193*9e0b9d2eSJiayu Hu 194*9e0b9d2eSJiayu HuThe table structure used by VxLAN GRO, which is in charge of processing 195*9e0b9d2eSJiayu HuVxLAN packets with an outer IPv4 header and inner TCP/IPv4 packet, is 196*9e0b9d2eSJiayu Husimilar with that of TCP/IPv4 GRO. Differently, the header fields used 197*9e0b9d2eSJiayu Huto define a VxLAN flow include: 198*9e0b9d2eSJiayu Hu 199*9e0b9d2eSJiayu Hu- outer source and destination: Ethernet and IP address, UDP port 200*9e0b9d2eSJiayu Hu 201*9e0b9d2eSJiayu Hu- VxLAN header (VNI and flag) 202*9e0b9d2eSJiayu Hu 203*9e0b9d2eSJiayu Hu- inner source and destination: Ethernet and IP address, TCP port 204*9e0b9d2eSJiayu Hu 205*9e0b9d2eSJiayu HuHeader fields deciding if packets are neighbors include: 206*9e0b9d2eSJiayu Hu 207*9e0b9d2eSJiayu Hu- outer IPv4 ID. The IPv4 ID fields of the packets, whose DF bit in the 208*9e0b9d2eSJiayu Hu outer IPv4 header is 0, should be increased by 1. 209*9e0b9d2eSJiayu Hu 210*9e0b9d2eSJiayu Hu- inner TCP sequence number 211*9e0b9d2eSJiayu Hu 212*9e0b9d2eSJiayu Hu- inner IPv4 ID. The IPv4 ID fields of the packets, whose DF bit in the 213*9e0b9d2eSJiayu Hu inner IPv4 header is 0, should be increased by 1. 214*9e0b9d2eSJiayu Hu 215b52b61f0SJiayu Hu.. note:: 216b52b61f0SJiayu Hu We comply RFC 6864 to process the IPv4 ID field. Specifically, 217b52b61f0SJiayu Hu we check IPv4 ID fields for the packets whose DF bit is 0 and 218b52b61f0SJiayu Hu ignore IPv4 ID fields for the packets whose DF bit is 1. 219b52b61f0SJiayu Hu Additionally, packets which have different value of DF bit can't 220b52b61f0SJiayu Hu be merged. 221