12c900d09SJiayu Hu.. BSD LICENSE 22c900d09SJiayu Hu Copyright(c) 2017 Intel Corporation. All rights reserved. 32c900d09SJiayu Hu All rights reserved. 42c900d09SJiayu Hu 52c900d09SJiayu Hu Redistribution and use in source and binary forms, with or without 62c900d09SJiayu Hu modification, are permitted provided that the following conditions 72c900d09SJiayu Hu are met: 82c900d09SJiayu Hu 92c900d09SJiayu Hu * Redistributions of source code must retain the above copyright 102c900d09SJiayu Hu notice, this list of conditions and the following disclaimer. 112c900d09SJiayu Hu * Redistributions in binary form must reproduce the above copyright 122c900d09SJiayu Hu notice, this list of conditions and the following disclaimer in 132c900d09SJiayu Hu the documentation and/or other materials provided with the 142c900d09SJiayu Hu distribution. 152c900d09SJiayu Hu * Neither the name of Intel Corporation nor the names of its 162c900d09SJiayu Hu contributors may be used to endorse or promote products derived 172c900d09SJiayu Hu from this software without specific prior written permission. 182c900d09SJiayu Hu 192c900d09SJiayu Hu THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 202c900d09SJiayu Hu "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 212c900d09SJiayu Hu LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 222c900d09SJiayu Hu A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 232c900d09SJiayu Hu OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 242c900d09SJiayu Hu SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 252c900d09SJiayu Hu LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 262c900d09SJiayu Hu DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 272c900d09SJiayu Hu THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 282c900d09SJiayu Hu (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 292c900d09SJiayu Hu OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 302c900d09SJiayu Hu 312c900d09SJiayu HuGeneric Receive Offload Library 322c900d09SJiayu Hu=============================== 332c900d09SJiayu Hu 342c900d09SJiayu HuGeneric Receive Offload (GRO) is a widely used SW-based offloading 35*1e4cf4d6SJiayu Hutechnique to reduce per-packet processing overheads. By reassembling 36*1e4cf4d6SJiayu Husmall packets into larger ones, GRO enables applications to process 37*1e4cf4d6SJiayu Hufewer large packets directly, thus reducing the number of packets to 38*1e4cf4d6SJiayu Hube processed. To benefit DPDK-based applications, like Open vSwitch, 39*1e4cf4d6SJiayu HuDPDK also provides own GRO implementation. In DPDK, GRO is implemented 40*1e4cf4d6SJiayu Huas a standalone library. Applications explicitly use the GRO library to 41*1e4cf4d6SJiayu Hureassemble packets. 422c900d09SJiayu Hu 43*1e4cf4d6SJiayu HuOverview 44*1e4cf4d6SJiayu Hu-------- 452c900d09SJiayu Hu 46*1e4cf4d6SJiayu HuIn the GRO library, there are many GRO types which are defined by packet 47*1e4cf4d6SJiayu Hutypes. One GRO type is in charge of process one kind of packets. For 48*1e4cf4d6SJiayu Huexample, TCP/IPv4 GRO processes TCP/IPv4 packets. 492c900d09SJiayu Hu 50*1e4cf4d6SJiayu HuEach GRO type has a reassembly function, which defines own algorithm and 51*1e4cf4d6SJiayu Hutable structure to reassemble packets. We assign input packets to the 52*1e4cf4d6SJiayu Hucorresponding GRO functions by MBUF->packet_type. 532c900d09SJiayu Hu 54*1e4cf4d6SJiayu HuThe GRO library doesn't check if input packets have correct checksums and 55*1e4cf4d6SJiayu Hudoesn't re-calculate checksums for merged packets. The GRO library 56*1e4cf4d6SJiayu Huassumes the packets are complete (i.e., MF==0 && frag_off==0), when IP 57*1e4cf4d6SJiayu Hufragmentation is possible (i.e., DF==0). Additionally, it requires IPv4 58*1e4cf4d6SJiayu HuID to be increased by one. 592c900d09SJiayu Hu 60*1e4cf4d6SJiayu HuCurrently, the GRO library provides GRO supports for TCP/IPv4 packets. 612c900d09SJiayu Hu 62*1e4cf4d6SJiayu HuTwo Sets of API 63*1e4cf4d6SJiayu Hu--------------- 642c900d09SJiayu Hu 65*1e4cf4d6SJiayu HuFor different usage scenarios, the GRO library provides two sets of API. 66*1e4cf4d6SJiayu HuThe one is called the lightweight mode API, which enables applications to 67*1e4cf4d6SJiayu Humerge a small number of packets rapidly; the other is called the 68*1e4cf4d6SJiayu Huheavyweight mode API, which provides fine-grained controls to 69*1e4cf4d6SJiayu Huapplications and supports to merge a large number of packets. 702c900d09SJiayu Hu 71*1e4cf4d6SJiayu HuLightweight Mode API 72*1e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~ 732c900d09SJiayu Hu 74*1e4cf4d6SJiayu HuThe lightweight mode only has one function ``rte_gro_reassemble_burst()``, 75*1e4cf4d6SJiayu Huwhich process N packets at a time. Using the lightweight mode API to 76*1e4cf4d6SJiayu Humerge packets is very simple. Calling ``rte_gro_reassemble_burst()`` is 77*1e4cf4d6SJiayu Huenough. The GROed packets are returned to applications as soon as it 78*1e4cf4d6SJiayu Hufinishes. 792c900d09SJiayu Hu 80*1e4cf4d6SJiayu HuIn ``rte_gro_reassemble_burst()``, table structures of different GRO 81*1e4cf4d6SJiayu Hutypes are allocated in the stack. This design simplifies applications' 82*1e4cf4d6SJiayu Huoperations. However, limited by the stack size, the maximum number of 83*1e4cf4d6SJiayu Hupackets that ``rte_gro_reassemble_burst()`` can process in an invocation 84*1e4cf4d6SJiayu Hushould be less than or equal to ``RTE_GRO_MAX_BURST_ITEM_NUM``. 852c900d09SJiayu Hu 86*1e4cf4d6SJiayu HuHeavyweight Mode API 87*1e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~ 882c900d09SJiayu Hu 89*1e4cf4d6SJiayu HuCompared with the lightweight mode, using the heavyweight mode API is 90*1e4cf4d6SJiayu Hurelatively complex. Firstly, applications need to create a GRO context 91*1e4cf4d6SJiayu Huby ``rte_gro_ctx_create()``. ``rte_gro_ctx_create()`` allocates tables 92*1e4cf4d6SJiayu Hustructures in the heap and stores their pointers in the GRO context. 93*1e4cf4d6SJiayu HuSecondly, applications use ``rte_gro_reassemble()`` to merge packets. 94*1e4cf4d6SJiayu HuIf input packets have invalid parameters, ``rte_gro_reassemble()`` 95*1e4cf4d6SJiayu Hureturns them to applications. For example, packets of unsupported GRO 96*1e4cf4d6SJiayu Hutypes or TCP SYN packets are returned. Otherwise, the input packets are 97*1e4cf4d6SJiayu Hueither merged with the existed packets in the tables or inserted into the 98*1e4cf4d6SJiayu Hutables. Finally, applications use ``rte_gro_timeout_flush()`` to flush 99*1e4cf4d6SJiayu Hupackets from the tables, when they want to get the GROed packets. 1002c900d09SJiayu Hu 101*1e4cf4d6SJiayu HuNote that all update/lookup operations on the GRO context are not thread 102*1e4cf4d6SJiayu Husafe. So if different processes or threads want to access the same 103*1e4cf4d6SJiayu Hucontext object simultaneously, some external syncing mechanisms must be 104*1e4cf4d6SJiayu Huused. 105*1e4cf4d6SJiayu Hu 106*1e4cf4d6SJiayu HuReassembly Algorithm 107*1e4cf4d6SJiayu Hu-------------------- 108*1e4cf4d6SJiayu Hu 109*1e4cf4d6SJiayu HuThe reassembly algorithm is used for reassembling packets. In the GRO 110*1e4cf4d6SJiayu Hulibrary, different GRO types can use different algorithms. In this 111*1e4cf4d6SJiayu Husection, we will introduce an algorithm, which is used by TCP/IPv4 GRO. 112*1e4cf4d6SJiayu Hu 113*1e4cf4d6SJiayu HuChallenges 114*1e4cf4d6SJiayu Hu~~~~~~~~~~ 115*1e4cf4d6SJiayu Hu 116*1e4cf4d6SJiayu HuThe reassembly algorithm determines the efficiency of GRO. There are two 117*1e4cf4d6SJiayu Huchallenges in the algorithm design: 118*1e4cf4d6SJiayu Hu 119*1e4cf4d6SJiayu Hu- a high cost algorithm/implementation would cause packet dropping in a 120*1e4cf4d6SJiayu Hu high speed network. 121*1e4cf4d6SJiayu Hu 122*1e4cf4d6SJiayu Hu- packet reordering makes it hard to merge packets. For example, Linux 123*1e4cf4d6SJiayu Hu GRO fails to merge packets when encounters packet reordering. 124*1e4cf4d6SJiayu Hu 125*1e4cf4d6SJiayu HuThe above two challenges require our algorithm is: 126*1e4cf4d6SJiayu Hu 127*1e4cf4d6SJiayu Hu- lightweight enough to scale fast networking speed 128*1e4cf4d6SJiayu Hu 129*1e4cf4d6SJiayu Hu- capable of handling packet reordering 130*1e4cf4d6SJiayu Hu 131*1e4cf4d6SJiayu HuIn DPDK GRO, we use a key-based algorithm to address the two challenges. 132*1e4cf4d6SJiayu Hu 133*1e4cf4d6SJiayu HuKey-based Reassembly Algorithm 134*1e4cf4d6SJiayu Hu~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 135*1e4cf4d6SJiayu Hu 136*1e4cf4d6SJiayu Hu:numref:`figure_gro-key-algorithm` illustrates the procedure of the 137*1e4cf4d6SJiayu Hukey-based algorithm. Packets are classified into "flows" by some header 138*1e4cf4d6SJiayu Hufields (we call them as "key"). To process an input packet, the algorithm 139*1e4cf4d6SJiayu Husearches for a matched "flow" (i.e., the same value of key) for the 140*1e4cf4d6SJiayu Hupacket first, then checks all packets in the "flow" and tries to find a 141*1e4cf4d6SJiayu Hu"neighbor" for it. If find a "neighbor", merge the two packets together. 142*1e4cf4d6SJiayu HuIf can't find a "neighbor", store the packet into its "flow". If can't 143*1e4cf4d6SJiayu Hufind a matched "flow", insert a new "flow" and store the packet into the 144*1e4cf4d6SJiayu Hu"flow". 145*1e4cf4d6SJiayu Hu 146*1e4cf4d6SJiayu Hu.. note:: 147*1e4cf4d6SJiayu Hu Packets in the same "flow" that can't merge are always caused 148*1e4cf4d6SJiayu Hu by packet reordering. 149*1e4cf4d6SJiayu Hu 150*1e4cf4d6SJiayu HuThe key-based algorithm has two characters: 151*1e4cf4d6SJiayu Hu 152*1e4cf4d6SJiayu Hu- classifying packets into "flows" to accelerate packet aggregation is 153*1e4cf4d6SJiayu Hu simple (address challenge 1). 154*1e4cf4d6SJiayu Hu 155*1e4cf4d6SJiayu Hu- storing out-of-order packets makes it possible to merge later (address 156*1e4cf4d6SJiayu Hu challenge 2). 157*1e4cf4d6SJiayu Hu 158*1e4cf4d6SJiayu Hu.. _figure_gro-key-algorithm: 159*1e4cf4d6SJiayu Hu 160*1e4cf4d6SJiayu Hu.. figure:: img/gro-key-algorithm.* 161*1e4cf4d6SJiayu Hu :align: center 162*1e4cf4d6SJiayu Hu 163*1e4cf4d6SJiayu Hu Key-based Reassembly Algorithm 1642c900d09SJiayu Hu 1652c900d09SJiayu HuTCP/IPv4 GRO 1662c900d09SJiayu Hu------------ 1672c900d09SJiayu Hu 168*1e4cf4d6SJiayu HuThe table structure used by TCP/IPv4 GRO contains two arrays: flow array 169*1e4cf4d6SJiayu Huand item array. The flow array keeps flow information, and the item array 170*1e4cf4d6SJiayu Hukeeps packet information. 1712c900d09SJiayu Hu 172*1e4cf4d6SJiayu HuHeader fields used to define a TCP/IPv4 flow include: 1732c900d09SJiayu Hu 174*1e4cf4d6SJiayu Hu- source and destination: Ethernet and IP address, TCP port 1752c900d09SJiayu Hu 176*1e4cf4d6SJiayu Hu- TCP acknowledge number 1772c900d09SJiayu Hu 178*1e4cf4d6SJiayu HuTCP/IPv4 packets whose FIN, SYN, RST, URG, PSH, ECE or CWR bit is set 179*1e4cf4d6SJiayu Huwon't be processed. 1802c900d09SJiayu Hu 181*1e4cf4d6SJiayu HuHeader fields deciding if two packets are neighbors include: 1822c900d09SJiayu Hu 183*1e4cf4d6SJiayu Hu- TCP sequence number 1842c900d09SJiayu Hu 185*1e4cf4d6SJiayu Hu- IPv4 ID. The IPv4 ID fields of the packets should be increased by 1. 186