12c900d09SJiayu Hu.. BSD LICENSE 22c900d09SJiayu Hu Copyright(c) 2017 Intel Corporation. All rights reserved. 32c900d09SJiayu Hu All rights reserved. 42c900d09SJiayu Hu 52c900d09SJiayu Hu Redistribution and use in source and binary forms, with or without 62c900d09SJiayu Hu modification, are permitted provided that the following conditions 72c900d09SJiayu Hu are met: 82c900d09SJiayu Hu 92c900d09SJiayu Hu * Redistributions of source code must retain the above copyright 102c900d09SJiayu Hu notice, this list of conditions and the following disclaimer. 112c900d09SJiayu Hu * Redistributions in binary form must reproduce the above copyright 122c900d09SJiayu Hu notice, this list of conditions and the following disclaimer in 132c900d09SJiayu Hu the documentation and/or other materials provided with the 142c900d09SJiayu Hu distribution. 152c900d09SJiayu Hu * Neither the name of Intel Corporation nor the names of its 162c900d09SJiayu Hu contributors may be used to endorse or promote products derived 172c900d09SJiayu Hu from this software without specific prior written permission. 182c900d09SJiayu Hu 192c900d09SJiayu Hu THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 202c900d09SJiayu Hu "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 212c900d09SJiayu Hu LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 222c900d09SJiayu Hu A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 232c900d09SJiayu Hu OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 242c900d09SJiayu Hu SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 252c900d09SJiayu Hu LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 262c900d09SJiayu Hu DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 272c900d09SJiayu Hu THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 282c900d09SJiayu Hu (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 292c900d09SJiayu Hu OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 302c900d09SJiayu Hu 312c900d09SJiayu HuGeneric Receive Offload Library 322c900d09SJiayu Hu=============================== 332c900d09SJiayu Hu 342c900d09SJiayu HuGeneric Receive Offload (GRO) is a widely used SW-based offloading 351e4cf4d6SJiayu Hutechnique to reduce per-packet processing overheads. By reassembling 361e4cf4d6SJiayu Husmall packets into larger ones, GRO enables applications to process 371e4cf4d6SJiayu Hufewer large packets directly, thus reducing the number of packets to 381e4cf4d6SJiayu Hube processed. To benefit DPDK-based applications, like Open vSwitch, 391e4cf4d6SJiayu HuDPDK also provides own GRO implementation. In DPDK, GRO is implemented 401e4cf4d6SJiayu Huas a standalone library. Applications explicitly use the GRO library to 411e4cf4d6SJiayu Hureassemble packets. 422c900d09SJiayu Hu 431e4cf4d6SJiayu HuOverview 441e4cf4d6SJiayu Hu-------- 452c900d09SJiayu Hu 461e4cf4d6SJiayu HuIn the GRO library, there are many GRO types which are defined by packet 471e4cf4d6SJiayu Hutypes. One GRO type is in charge of process one kind of packets. For 481e4cf4d6SJiayu Huexample, TCP/IPv4 GRO processes TCP/IPv4 packets. 492c900d09SJiayu Hu 501e4cf4d6SJiayu HuEach GRO type has a reassembly function, which defines own algorithm and 511e4cf4d6SJiayu Hutable structure to reassemble packets. We assign input packets to the 521e4cf4d6SJiayu Hucorresponding GRO functions by MBUF->packet_type. 532c900d09SJiayu Hu 541e4cf4d6SJiayu HuThe GRO library doesn't check if input packets have correct checksums and 551e4cf4d6SJiayu Hudoesn't re-calculate checksums for merged packets. The GRO library 561e4cf4d6SJiayu Huassumes the packets are complete (i.e., MF==0 && frag_off==0), when IP 57*b52b61f0SJiayu Hufragmentation is possible (i.e., DF==0). Additionally, it complies RFC 58*b52b61f0SJiayu Hu6864 to process the IPv4 ID field. 592c900d09SJiayu Hu 601e4cf4d6SJiayu HuCurrently, the GRO library provides GRO supports for TCP/IPv4 packets. 612c900d09SJiayu Hu 621e4cf4d6SJiayu HuTwo Sets of API 631e4cf4d6SJiayu Hu--------------- 642c900d09SJiayu Hu 651e4cf4d6SJiayu HuFor different usage scenarios, the GRO library provides two sets of API. 661e4cf4d6SJiayu HuThe one is called the lightweight mode API, which enables applications to 671e4cf4d6SJiayu Humerge a small number of packets rapidly; the other is called the 681e4cf4d6SJiayu Huheavyweight mode API, which provides fine-grained controls to 691e4cf4d6SJiayu Huapplications and supports to merge a large number of packets. 702c900d09SJiayu Hu 711e4cf4d6SJiayu HuLightweight Mode API 721e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~ 732c900d09SJiayu Hu 741e4cf4d6SJiayu HuThe lightweight mode only has one function ``rte_gro_reassemble_burst()``, 751e4cf4d6SJiayu Huwhich process N packets at a time. Using the lightweight mode API to 761e4cf4d6SJiayu Humerge packets is very simple. Calling ``rte_gro_reassemble_burst()`` is 771e4cf4d6SJiayu Huenough. The GROed packets are returned to applications as soon as it 781e4cf4d6SJiayu Hufinishes. 792c900d09SJiayu Hu 801e4cf4d6SJiayu HuIn ``rte_gro_reassemble_burst()``, table structures of different GRO 811e4cf4d6SJiayu Hutypes are allocated in the stack. This design simplifies applications' 821e4cf4d6SJiayu Huoperations. However, limited by the stack size, the maximum number of 831e4cf4d6SJiayu Hupackets that ``rte_gro_reassemble_burst()`` can process in an invocation 841e4cf4d6SJiayu Hushould be less than or equal to ``RTE_GRO_MAX_BURST_ITEM_NUM``. 852c900d09SJiayu Hu 861e4cf4d6SJiayu HuHeavyweight Mode API 871e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~ 882c900d09SJiayu Hu 891e4cf4d6SJiayu HuCompared with the lightweight mode, using the heavyweight mode API is 901e4cf4d6SJiayu Hurelatively complex. Firstly, applications need to create a GRO context 911e4cf4d6SJiayu Huby ``rte_gro_ctx_create()``. ``rte_gro_ctx_create()`` allocates tables 921e4cf4d6SJiayu Hustructures in the heap and stores their pointers in the GRO context. 931e4cf4d6SJiayu HuSecondly, applications use ``rte_gro_reassemble()`` to merge packets. 941e4cf4d6SJiayu HuIf input packets have invalid parameters, ``rte_gro_reassemble()`` 951e4cf4d6SJiayu Hureturns them to applications. For example, packets of unsupported GRO 961e4cf4d6SJiayu Hutypes or TCP SYN packets are returned. Otherwise, the input packets are 971e4cf4d6SJiayu Hueither merged with the existed packets in the tables or inserted into the 981e4cf4d6SJiayu Hutables. Finally, applications use ``rte_gro_timeout_flush()`` to flush 991e4cf4d6SJiayu Hupackets from the tables, when they want to get the GROed packets. 1002c900d09SJiayu Hu 1011e4cf4d6SJiayu HuNote that all update/lookup operations on the GRO context are not thread 1021e4cf4d6SJiayu Husafe. So if different processes or threads want to access the same 1031e4cf4d6SJiayu Hucontext object simultaneously, some external syncing mechanisms must be 1041e4cf4d6SJiayu Huused. 1051e4cf4d6SJiayu Hu 1061e4cf4d6SJiayu HuReassembly Algorithm 1071e4cf4d6SJiayu Hu-------------------- 1081e4cf4d6SJiayu Hu 1091e4cf4d6SJiayu HuThe reassembly algorithm is used for reassembling packets. In the GRO 1101e4cf4d6SJiayu Hulibrary, different GRO types can use different algorithms. In this 1111e4cf4d6SJiayu Husection, we will introduce an algorithm, which is used by TCP/IPv4 GRO. 1121e4cf4d6SJiayu Hu 1131e4cf4d6SJiayu HuChallenges 1141e4cf4d6SJiayu Hu~~~~~~~~~~ 1151e4cf4d6SJiayu Hu 1161e4cf4d6SJiayu HuThe reassembly algorithm determines the efficiency of GRO. There are two 1171e4cf4d6SJiayu Huchallenges in the algorithm design: 1181e4cf4d6SJiayu Hu 1191e4cf4d6SJiayu Hu- a high cost algorithm/implementation would cause packet dropping in a 1201e4cf4d6SJiayu Hu high speed network. 1211e4cf4d6SJiayu Hu 1221e4cf4d6SJiayu Hu- packet reordering makes it hard to merge packets. For example, Linux 1231e4cf4d6SJiayu Hu GRO fails to merge packets when encounters packet reordering. 1241e4cf4d6SJiayu Hu 1251e4cf4d6SJiayu HuThe above two challenges require our algorithm is: 1261e4cf4d6SJiayu Hu 1271e4cf4d6SJiayu Hu- lightweight enough to scale fast networking speed 1281e4cf4d6SJiayu Hu 1291e4cf4d6SJiayu Hu- capable of handling packet reordering 1301e4cf4d6SJiayu Hu 1311e4cf4d6SJiayu HuIn DPDK GRO, we use a key-based algorithm to address the two challenges. 1321e4cf4d6SJiayu Hu 1331e4cf4d6SJiayu HuKey-based Reassembly Algorithm 1341e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1351e4cf4d6SJiayu Hu 1361e4cf4d6SJiayu Hu:numref:`figure_gro-key-algorithm` illustrates the procedure of the 1371e4cf4d6SJiayu Hukey-based algorithm. Packets are classified into "flows" by some header 1381e4cf4d6SJiayu Hufields (we call them as "key"). To process an input packet, the algorithm 1391e4cf4d6SJiayu Husearches for a matched "flow" (i.e., the same value of key) for the 1401e4cf4d6SJiayu Hupacket first, then checks all packets in the "flow" and tries to find a 1411e4cf4d6SJiayu Hu"neighbor" for it. If find a "neighbor", merge the two packets together. 1421e4cf4d6SJiayu HuIf can't find a "neighbor", store the packet into its "flow". If can't 1431e4cf4d6SJiayu Hufind a matched "flow", insert a new "flow" and store the packet into the 1441e4cf4d6SJiayu Hu"flow". 1451e4cf4d6SJiayu Hu 1461e4cf4d6SJiayu Hu.. note:: 1471e4cf4d6SJiayu Hu Packets in the same "flow" that can't merge are always caused 1481e4cf4d6SJiayu Hu by packet reordering. 1491e4cf4d6SJiayu Hu 1501e4cf4d6SJiayu HuThe key-based algorithm has two characters: 1511e4cf4d6SJiayu Hu 1521e4cf4d6SJiayu Hu- classifying packets into "flows" to accelerate packet aggregation is 1531e4cf4d6SJiayu Hu simple (address challenge 1). 1541e4cf4d6SJiayu Hu 1551e4cf4d6SJiayu Hu- storing out-of-order packets makes it possible to merge later (address 1561e4cf4d6SJiayu Hu challenge 2). 1571e4cf4d6SJiayu Hu 1581e4cf4d6SJiayu Hu.. _figure_gro-key-algorithm: 1591e4cf4d6SJiayu Hu 1601e4cf4d6SJiayu Hu.. figure:: img/gro-key-algorithm.* 1611e4cf4d6SJiayu Hu :align: center 1621e4cf4d6SJiayu Hu 1631e4cf4d6SJiayu Hu Key-based Reassembly Algorithm 1642c900d09SJiayu Hu 1652c900d09SJiayu HuTCP/IPv4 GRO 1662c900d09SJiayu Hu------------ 1672c900d09SJiayu Hu 1681e4cf4d6SJiayu HuThe table structure used by TCP/IPv4 GRO contains two arrays: flow array 1691e4cf4d6SJiayu Huand item array. The flow array keeps flow information, and the item array 1701e4cf4d6SJiayu Hukeeps packet information. 1712c900d09SJiayu Hu 1721e4cf4d6SJiayu HuHeader fields used to define a TCP/IPv4 flow include: 1732c900d09SJiayu Hu 1741e4cf4d6SJiayu Hu- source and destination: Ethernet and IP address, TCP port 1752c900d09SJiayu Hu 1761e4cf4d6SJiayu Hu- TCP acknowledge number 1772c900d09SJiayu Hu 1781e4cf4d6SJiayu HuTCP/IPv4 packets whose FIN, SYN, RST, URG, PSH, ECE or CWR bit is set 1791e4cf4d6SJiayu Huwon't be processed. 1802c900d09SJiayu Hu 1811e4cf4d6SJiayu HuHeader fields deciding if two packets are neighbors include: 1822c900d09SJiayu Hu 1831e4cf4d6SJiayu Hu- TCP sequence number 1842c900d09SJiayu Hu 185*b52b61f0SJiayu Hu- IPv4 ID. The IPv4 ID fields of the packets, whose DF bit is 0, should 186*b52b61f0SJiayu Hu be increased by 1. 187*b52b61f0SJiayu Hu 188*b52b61f0SJiayu Hu.. note:: 189*b52b61f0SJiayu Hu We comply RFC 6864 to process the IPv4 ID field. Specifically, 190*b52b61f0SJiayu Hu we check IPv4 ID fields for the packets whose DF bit is 0 and 191*b52b61f0SJiayu Hu ignore IPv4 ID fields for the packets whose DF bit is 1. 192*b52b61f0SJiayu Hu Additionally, packets which have different value of DF bit can't 193*b52b61f0SJiayu Hu be merged. 194