xref: /dpdk/doc/guides/prog_guide/mbuf_lib.rst (revision 41dd9a6bc2d9c6e20e139ad713cc9d172572dd43)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2010-2014 Intel Corporation.
3
4Packet (Mbuf) Library
5=====================
6
7The Packet (MBuf) library provides the ability to allocate and free buffers (mbufs)
8that may be used by the DPDK application to store message buffers.
9The message buffers are stored in a mempool, using the :doc:`mempool_lib`.
10
11A rte_mbuf struct generally carries network packet buffers, but it can actually
12be any data (control data, events, ...).
13The rte_mbuf header structure is kept as small as possible and currently uses
14just two cache lines, with the most frequently used fields being on the first
15of the two cache lines.
16
17Design of Packet Buffers
18------------------------
19
20For the storage of the packet data (including protocol headers), two approaches were considered:
21
22#.  Embed metadata within a single memory buffer the structure followed by a fixed size area for the packet data.
23
24#.  Use separate memory buffers for the metadata structure and for the packet data.
25
26The advantage of the first method is that it only needs one operation to allocate/free the whole memory representation of a packet.
27On the other hand, the second method is more flexible and allows
28the complete separation of the allocation of metadata structures from the allocation of packet data buffers.
29
30The first method was chosen for the DPDK.
31The metadata contains control information such as message type, length,
32offset to the start of the data and a pointer for additional mbuf structures allowing buffer chaining.
33
34Message buffers that are used to carry network packets can handle buffer chaining
35where multiple buffers are required to hold the complete packet.
36This is the case for jumbo frames that are composed of many mbufs linked together through their next field.
37
38For a newly allocated mbuf, the area at which the data begins in the message buffer is
39RTE_PKTMBUF_HEADROOM bytes after the beginning of the buffer, which is cache aligned.
40Message buffers may be used to carry control information, packets, events,
41and so on between different entities in the system.
42Message buffers may also use their buffer pointers to point to other message buffer data sections or other structures.
43
44:numref:`figure_mbuf1` and :numref:`figure_mbuf2` show some of these scenarios.
45
46.. _figure_mbuf1:
47
48.. figure:: img/mbuf1.*
49
50   An mbuf with One Segment
51
52
53.. _figure_mbuf2:
54
55.. figure:: img/mbuf2.*
56
57   An mbuf with Three Segments
58
59
60The Buffer Manager implements a fairly standard set of buffer access functions to manipulate network packets.
61
62Buffers Stored in Memory Pools
63------------------------------
64
65The Buffer Manager uses the :doc:`mempool_lib` to allocate buffers.
66Therefore, it ensures that the packet header is interleaved optimally across the channels and ranks for L3 processing.
67An mbuf contains a field indicating the pool that it originated from.
68When calling rte_pktmbuf_free(m), the mbuf returns to its original pool.
69
70Constructors
71------------
72
73Packet mbuf constructors are provided by the API.
74The rte_pktmbuf_init() function initializes some fields in the mbuf structure that
75are not modified by the user once created (mbuf type, origin pool, buffer start address, and so on).
76This function is given as a callback function to the rte_mempool_create() function at pool creation time.
77
78Allocating and Freeing mbufs
79----------------------------
80
81Allocating a new mbuf requires the user to specify the mempool from which the mbuf should be taken.
82For any newly-allocated mbuf, it contains one segment, with a length of 0.
83The offset to data is initialized to have some bytes of headroom in the buffer (RTE_PKTMBUF_HEADROOM).
84
85Freeing a mbuf means returning it into its original mempool.
86The content of an mbuf is not modified when it is stored in a pool (as a free mbuf).
87Fields initialized by the constructor do not need to be re-initialized at mbuf allocation.
88
89When freeing a packet mbuf that contains several segments, all of them are freed and returned to their original mempool.
90
91Manipulating mbufs
92------------------
93
94This library provides some functions for manipulating the data in a packet mbuf. For instance:
95
96    *  Get data length
97
98    *  Get a pointer to the start of data
99
100    *  Prepend data before data
101
102    *   Append data after data
103
104    *   Remove data at the beginning of the buffer (rte_pktmbuf_adj())
105
106    *   Remove data at the end of the buffer (rte_pktmbuf_trim()) Refer to the *DPDK API Reference* for details.
107
108.. _mbuf_meta:
109
110Meta Information
111----------------
112
113Some information is retrieved by the network driver and stored in an mbuf to make processing easier.
114For instance, the VLAN, the RSS hash result
115and a flag indicating that the checksum was computed by hardware.
116
117An mbuf also contains the input port (where it comes from), and the number of segment mbufs in the chain.
118
119For chained buffers, only the first mbuf of the chain stores this meta information.
120
121For instance, this is the case on RX side for the IEEE1588 packet
122timestamp mechanism, the VLAN tagging and the IP checksum computation.
123
124On TX side, it is also possible for an application to delegate some
125processing to the hardware if it supports it. For instance, the
126RTE_MBUF_F_TX_IP_CKSUM flag allows to offload the computation of the IPv4
127checksum.
128
129The following examples explain how to configure different TX offloads on
130a vxlan-encapsulated tcp packet:
131``out_eth/out_ip/out_udp/vxlan/in_eth/in_ip/in_tcp/payload``
132
133- calculate checksum of out_ip::
134
135    mb->l2_len = len(out_eth)
136    mb->l3_len = len(out_ip)
137    mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CKSUM
138    set out_ip checksum to 0 in the packet
139
140  This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM.
141
142- calculate checksum of out_ip and out_udp::
143
144    mb->l2_len = len(out_eth)
145    mb->l3_len = len(out_ip)
146    mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_UDP_CKSUM
147    set out_ip checksum to 0 in the packet
148    set out_udp checksum to pseudo header using rte_ipv4_phdr_cksum()
149
150  This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
151  and RTE_ETH_TX_OFFLOAD_UDP_CKSUM.
152
153- calculate checksum of in_ip::
154
155    mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
156    mb->l3_len = len(in_ip)
157    mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CKSUM
158    set in_ip checksum to 0 in the packet
159
160  This is similar to case 1), but l2_len is different. It is supported
161  on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM.
162  Note that it can only work if outer L4 checksum is 0.
163
164- calculate checksum of in_ip and in_tcp::
165
166    mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
167    mb->l3_len = len(in_ip)
168    mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_TCP_CKSUM
169    set in_ip checksum to 0 in the packet
170    set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
171
172  This is similar to case 2), but l2_len is different. It is supported
173  on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM and
174  RTE_ETH_TX_OFFLOAD_TCP_CKSUM.
175  Note that it can only work if outer L4 checksum is 0.
176
177- segment inner TCP::
178
179    mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
180    mb->l3_len = len(in_ip)
181    mb->l4_len = len(in_tcp)
182    mb->ol_flags |= RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_TCP_CKSUM |
183      RTE_MBUF_F_TX_TCP_SEG;
184    set in_ip checksum to 0 in the packet
185    set in_tcp checksum to pseudo header without including the IP
186      payload length using rte_ipv4_phdr_cksum()
187
188  This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_TCP_TSO.
189  Note that it can only work if outer L4 checksum is 0.
190
191- calculate checksum of out_ip, in_ip, in_tcp::
192
193    mb->outer_l2_len = len(out_eth)
194    mb->outer_l3_len = len(out_ip)
195    mb->l2_len = len(out_udp + vxlan + in_eth)
196    mb->l3_len = len(in_ip)
197    mb->ol_flags |= RTE_MBUF_F_TX_OUTER_IPV4 | RTE_MBUF_F_TX_OUTER_IP_CKSUM  | \
198      RTE_MBUF_F_TX_IP_CKSUM |  RTE_MBUF_F_TX_TCP_CKSUM;
199    set out_ip checksum to 0 in the packet
200    set in_ip checksum to 0 in the packet
201    set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
202
203  This is supported on hardware advertising RTE_ETH_TX_OFFLOAD_IPV4_CKSUM,
204  RTE_ETH_TX_OFFLOAD_UDP_CKSUM and RTE_ETH_TX_OFFLOAD_OUTER_IPV4_CKSUM.
205
206The list of flags and their precise meaning is described in the mbuf API
207documentation (rte_mbuf.h). Also refer to the testpmd source code
208(specifically the csumonly.c file) for details.
209
210Dynamic fields and flags
211~~~~~~~~~~~~~~~~~~~~~~~~
212
213The size of the mbuf is constrained and limited;
214while the amount of metadata to save for each packet is quite unlimited.
215The most basic networking information already find their place
216in the existing mbuf fields and flags.
217
218If new features need to be added, the new fields and flags should fit
219in the "dynamic space", by registering some room in the mbuf structure:
220
221dynamic field
222   named area in the mbuf structure,
223   with a given size (at least 1 byte) and alignment constraint.
224
225dynamic flag
226   named bit in the mbuf structure,
227   stored in the field ``ol_flags``.
228
229The dynamic fields and flags are managed with the functions ``rte_mbuf_dyn*``.
230
231It is not possible to unregister fields or flags.
232
233.. _direct_indirect_buffer:
234
235Direct and Indirect Buffers
236---------------------------
237
238A direct buffer is a buffer that is completely separate and self-contained.
239An indirect buffer behaves like a direct buffer but for the fact that the buffer pointer and
240data offset in it refer to data in another direct buffer.
241This is useful in situations where packets need to be duplicated or fragmented,
242since indirect buffers provide the means to reuse the same packet data across multiple buffers.
243
244A buffer becomes indirect when it is "attached" to a direct buffer using the rte_pktmbuf_attach() function.
245Each buffer has a reference counter field and whenever an indirect buffer is attached to the direct buffer,
246the reference counter on the direct buffer is incremented.
247Similarly, whenever the indirect buffer is detached, the reference counter on the direct buffer is decremented.
248If the resulting reference counter is equal to 0, the direct buffer is freed since it is no longer in use.
249
250There are a few things to remember when dealing with indirect buffers.
251First of all, an indirect buffer is never attached to another indirect buffer.
252Attempting to attach buffer A to indirect buffer B that is attached to C, makes rte_pktmbuf_attach() automatically attach A to C, effectively cloning B.
253Secondly, for a buffer to become indirect, its reference counter must be equal to 1,
254that is, it must not be already referenced by another indirect buffer.
255Finally, it is not possible to reattach an indirect buffer to the direct buffer (unless it is detached first).
256
257While the attach/detach operations can be invoked directly using the recommended rte_pktmbuf_attach() and rte_pktmbuf_detach() functions,
258it is suggested to use the higher-level rte_pktmbuf_clone() function,
259which takes care of the correct initialization of an indirect buffer and can clone buffers with multiple segments.
260
261Since indirect buffers are not supposed to actually hold any data,
262the memory pool for indirect buffers should be configured to indicate the reduced memory consumption.
263Examples of the initialization of a memory pool for indirect buffers (as well as use case examples for indirect buffers)
264can be found in several of the sample applications, for example, the IPv4 Multicast sample application.
265
266Debug
267-----
268
269In debug mode, the functions of the mbuf library perform sanity checks before any operation (such as, buffer corruption,
270bad type, and so on).
271
272Use Cases
273---------
274
275All networking application should use mbufs to transport network packets.
276