xref: /dpdk/doc/guides/prog_guide/gpudev.rst (revision 3a99464456ddf04b0f397b7f524334b1bfaf89d0)
1.. SPDX-License-Identifier: BSD-3-Clause
2   Copyright (c) 2021 NVIDIA Corporation & Affiliates
3
4General-Purpose Graphics Processing Unit Library
5================================================
6
7When mixing networking activity with task processing on a GPU device,
8there may be the need to put in communication the CPU with the device
9in order to manage the memory, synchronize operations, exchange info, etc..
10
11By means of the generic GPU interface provided by this library,
12it is possible to allocate a chunk of GPU memory and use it
13to create a DPDK mempool with external mbufs having the payload
14on the GPU memory, enabling any network interface card
15(which support this feature like Mellanox NIC)
16to directly transmit and receive packets using GPU memory.
17
18Additionally, this library provides a number of functions
19to enhance the dialog between CPU and GPU.
20
21Out of scope of this library is to provide a wrapper for GPU specific libraries
22(e.g. CUDA Toolkit or OpenCL), thus it is not possible to launch workload
23on the device or create GPU specific objects
24(e.g. CUDA Driver context or CUDA Streams in case of NVIDIA GPUs).
25
26
27Features
28--------
29
30This library provides a number of features:
31
32- Interoperability with device-specific library through generic handlers.
33- Allocate and free memory on the device.
34- Register CPU memory to make it visible from the device.
35- Communication between the CPU and the device.
36
37The whole CPU - GPU communication is implemented
38using CPU memory visible from the GPU.
39
40
41API Overview
42------------
43
44Child Device
45~~~~~~~~~~~~
46
47By default, DPDK PCIe module detects and registers physical GPU devices
48in the system.
49With the gpudev library is also possible to add additional non-physical devices
50through an ``uint64_t`` generic handler (e.g. CUDA Driver context)
51that will be registered internally by the driver as an additional device (child)
52connected to a physical device (parent).
53Each device (parent or child) is represented through a ID
54required to indicate which device a given operation should be executed on.
55
56Memory Allocation
57~~~~~~~~~~~~~~~~~
58
59gpudev can allocate on an input given GPU device a memory area
60returning the pointer to that memory.
61Later, it's also possible to free that memory with gpudev.
62GPU memory allocated outside of the gpudev library
63(e.g. with GPU-specific library) cannot be freed by the gpudev library.
64
65Memory Registration
66~~~~~~~~~~~~~~~~~~~
67
68gpudev can register a CPU memory area to make it visible from a GPU device.
69Later, it's also possible to unregister that memory with gpudev.
70CPU memory registered outside of the gpudev library
71(e.g. with GPU specific library) cannot be unregistered by the gpudev library.
72
73Memory Barrier
74~~~~~~~~~~~~~~
75
76Some GPU drivers may need, under certain conditions,
77to enforce the coherency of external devices writes (e.g. NIC receiving packets)
78into the GPU memory.
79gpudev abstracts and exposes this capability.
80
81Communication Flag
82~~~~~~~~~~~~~~~~~~
83
84Considering an application with some GPU task
85that's waiting to receive a signal from the CPU
86to move forward with the execution.
87The communication flag allocates a CPU memory GPU-visible ``uint32_t`` flag
88that can be used by the CPU to communicate with a GPU task.
89
90Communication list
91~~~~~~~~~~~~~~~~~~
92
93By default, DPDK pulls free mbufs from a mempool to receive packets.
94Best practice, especially in a multithreaded application,
95is to no make any assumption on which mbufs will be used
96to receive the next bursts of packets.
97Considering an application with a GPU memory mempool
98attached to a receive queue having some task waiting on the GPU
99to receive a new burst of packets to be processed,
100there is the need to communicate from the CPU
101the list of mbuf payload addresses where received packet have been stored.
102The ``rte_gpu_comm_*()`` functions are responsible to create a list of packets
103that can be populated with receive mbuf payload addresses
104and communicated to the task running on the GPU.
105
106
107CUDA Example
108------------
109
110In the example below, there is a pseudo-code to give an example
111about how to use functions in this library in case of a CUDA application.
112
113.. code-block:: c
114
115   //////////////////////////////////////////////////////////////////////////
116   ///// gpudev library + CUDA functions
117   //////////////////////////////////////////////////////////////////////////
118   #define GPU_PAGE_SHIFT 16
119   #define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT)
120
121   int main()
122   {
123       struct rte_gpu_flag quit_flag;
124       struct rte_gpu_comm_list *comm_list;
125       int nb_rx = 0;
126       int comm_list_entry = 0;
127       struct rte_mbuf *rx_mbufs[max_rx_mbufs];
128       cudaStream_t cstream;
129       struct rte_mempool *mpool_payload, *mpool_header;
130       struct rte_pktmbuf_extmem ext_mem;
131       int16_t dev_id;
132       int16_t port_id = 0;
133
134       /* Initialize CUDA objects (cstream, context, etc..). */
135       /* Use gpudev library to register a new CUDA context if any. */
136
137       /* Let's assume the application wants to use the default context of the GPU device 0. */
138       dev_id = 0;
139
140       /* Create an external memory mempool using memory allocated on the GPU. */
141       ext_mem.elt_size = mbufs_headroom_size;
142       ext_mem.buf_len = RTE_ALIGN_CEIL(mbufs_num * ext_mem.elt_size, GPU_PAGE_SIZE);
143       ext_mem.buf_iova = RTE_BAD_IOVA;
144       ext_mem.buf_ptr = rte_gpu_mem_alloc(dev_id, ext_mem.buf_len, 0);
145       rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE);
146       rte_dev_dma_map(rte_eth_devices[port_id].device,
147               ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len);
148       mpool_payload = rte_pktmbuf_pool_create_extbuf("gpu_mempool", mbufs_num,
149                                                      0, 0, ext_mem.elt_size,
150                                                      rte_socket_id(), &ext_mem, 1);
151
152       /*
153        * Create CPU - device communication flag.
154        * With this flag, the CPU can tell to the CUDA kernel to exit from the main loop.
155        */
156       rte_gpu_comm_create_flag(dev_id, &quit_flag, RTE_GPU_COMM_FLAG_CPU);
157       rte_gpu_comm_set_flag(&quit_flag , 0);
158
159       /*
160        * Create CPU - device communication list.
161        * Each entry of this list will be populated by the CPU
162        * with a new set of received mbufs that the CUDA kernel has to process.
163        */
164       comm_list = rte_gpu_comm_create_list(dev_id, num_entries);
165
166       /* A very simple CUDA kernel with just 1 CUDA block and RTE_GPU_COMM_LIST_PKTS_MAX CUDA threads. */
167       cuda_kernel_packet_processing<<<1, RTE_GPU_COMM_LIST_PKTS_MAX, 0, cstream>>>(quit_flag->ptr, comm_list, num_entries, ...);
168
169       /*
170        * For simplicity, the CPU here receives only 2 bursts of mbufs.
171        * In a real application, network activity and device processing should overlap.
172        */
173       nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
174       rte_gpu_comm_populate_list_pkts(comm_list[0], rx_mbufs, nb_rx);
175       nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
176       rte_gpu_comm_populate_list_pkts(comm_list[1], rx_mbufs, nb_rx);
177
178       /*
179        * CPU waits for the completion of the packets' processing on the CUDA kernel
180        * and then it does a cleanup of the received mbufs.
181        */
182       while (rte_gpu_comm_cleanup_list(comm_list[0]));
183       while (rte_gpu_comm_cleanup_list(comm_list[1]));
184
185       /* CPU notifies the CUDA kernel that it has to terminate. */
186       rte_gpu_comm_set_flag(&quit_flag, 1);
187
188       /* gpudev objects cleanup/destruction */
189       rte_gpu_mem_free(dev_id, ext_mem.buf_len);
190
191       return 0;
192   }
193
194   //////////////////////////////////////////////////////////////////////////
195   ///// CUDA kernel
196   //////////////////////////////////////////////////////////////////////////
197
198   void cuda_kernel(uint32_t * quit_flag_ptr, struct rte_gpu_comm_list *comm_list, int comm_list_entries)
199   {
200       int comm_list_index = 0;
201       struct rte_gpu_comm_pkt *pkt_list = NULL;
202
203       /* Do some pre-processing operations. */
204
205       /* GPU kernel keeps checking this flag to know if it has to quit or wait for more packets. */
206       while (*quit_flag_ptr == 0) {
207           if (comm_list[comm_list_index]->status != RTE_GPU_COMM_LIST_READY)
208               continue;
209
210           if (threadIdx.x < comm_list[comm_list_index]->num_pkts)
211           {
212               /* Each CUDA thread processes a different packet. */
213               packet_processing(comm_list[comm_list_index]->addr, comm_list[comm_list_index]->size, ..);
214           }
215           __threadfence();
216           __syncthreads();
217
218           /* Wait for new packets on the next communication list entry. */
219           comm_list_index = (comm_list_index+1) % comm_list_entries;
220       }
221
222       /* Do some post-processing operations. */
223   }
224