xref: /dpdk/doc/guides/prog_guide/gpudev.rst (revision 41dd9a6bc2d9c6e20e139ad713cc9d172572dd43)
1.. SPDX-License-Identifier: BSD-3-Clause
2   Copyright (c) 2021 NVIDIA Corporation & Affiliates
3
4General-Purpose Graphics Processing Unit (GPU) Library
5======================================================
6
7When mixing networking activity with task processing on a GPU device,
8there may be the need to put in communication the CPU with the device
9in order to manage the memory, synchronize operations, exchange info, etc..
10
11By means of the generic GPU interface provided by this library,
12it is possible to allocate a chunk of GPU memory and use it
13to create a DPDK mempool with external mbufs having the payload
14on the GPU memory, enabling any network interface card
15supporting this feature
16to directly transmit and receive packets using GPU memory.
17
18Additionally, this library provides a number of functions
19to enhance the dialog between CPU and GPU.
20
21Out of scope of this library is to provide a wrapper for GPU specific libraries
22(e.g. CUDA Toolkit or OpenCL), thus it is not possible to launch workload
23on the device or create GPU specific objects
24(e.g. CUDA Driver context or CUDA Streams in case of NVIDIA GPUs).
25
26This library is optional in DPDK
27and can be disabled with ``-Ddisable_libs=gpudev``.
28
29
30Features
31--------
32
33This library provides a number of features:
34
35- Interoperability with device-specific library through generic handlers.
36- Allocate and free memory on the device.
37- Register CPU memory to make it visible from the device.
38- Communication between the CPU and the device.
39
40The whole CPU - GPU communication is implemented
41using CPU memory visible from the GPU.
42
43
44API Overview
45------------
46
47Child Device
48~~~~~~~~~~~~
49
50By default, DPDK PCIe module detects and registers physical GPU devices
51in the system.
52With the gpudev library is also possible to add additional non-physical devices
53through an ``uint64_t`` generic handler (e.g. CUDA Driver context)
54that will be registered internally by the driver as an additional device (child)
55connected to a physical device (parent).
56Each device (parent or child) is represented through a ID
57required to indicate which device a given operation should be executed on.
58
59Memory Allocation
60~~~~~~~~~~~~~~~~~
61
62gpudev can allocate on an input given GPU device a memory area
63returning the pointer to that memory.
64Later, it's also possible to free that memory with gpudev.
65GPU memory allocated outside of the gpudev library
66(e.g. with GPU-specific library) cannot be freed by the gpudev library.
67
68Memory Registration
69~~~~~~~~~~~~~~~~~~~
70
71gpudev can register a CPU memory area to make it visible from a GPU device.
72Later, it's also possible to unregister that memory with gpudev.
73CPU memory registered outside of the gpudev library
74(e.g. with GPU specific library) cannot be unregistered by the gpudev library.
75
76CPU mapping
77~~~~~~~~~~~
78
79gpudev can map into the CPU address space a GPU memory address allocated with gpudev.
80gpudev returns a pointer the CPU can use to access (ready or write) GPU memory.
81Later, it's also possible to unmap that memory with gpudev.
82GPU memory CPU mapped outside of the gpudev library (e.g. with GPU specific library)
83cannot be unmapped by the gpudev library.
84
85Memory Barrier
86~~~~~~~~~~~~~~
87
88Some GPU drivers may need, under certain conditions,
89to enforce the coherency of external devices writes (e.g. NIC receiving packets)
90into the GPU memory.
91gpudev abstracts and exposes this capability.
92
93Communication Flag
94~~~~~~~~~~~~~~~~~~
95
96Considering an application with some GPU task
97that's waiting to receive a signal from the CPU
98to move forward with the execution.
99The communication flag allocates a CPU memory GPU-visible ``uint32_t`` flag
100that can be used by the CPU to communicate with a GPU task.
101
102Communication list
103~~~~~~~~~~~~~~~~~~
104
105By default, DPDK pulls free mbufs from a mempool to receive packets.
106Best practice, especially in a multithreaded application,
107is to no make any assumption on which mbufs will be used
108to receive the next bursts of packets.
109Considering an application with a GPU memory mempool
110attached to a receive queue having some task waiting on the GPU
111to receive a new burst of packets to be processed,
112there is the need to communicate from the CPU
113the list of mbuf payload addresses where received packet have been stored.
114The ``rte_gpu_comm_*()`` functions are responsible to create a list of packets
115that can be populated with receive mbuf payload addresses
116and communicated to the task running on the GPU.
117
118
119CUDA Example
120------------
121
122In the example below, there is a pseudo-code to give an example
123about how to use functions in this library in case of a CUDA application.
124
125.. code-block:: c
126
127   //////////////////////////////////////////////////////////////////////////
128   ///// gpudev library + CUDA functions
129   //////////////////////////////////////////////////////////////////////////
130   #define GPU_PAGE_SHIFT 16
131   #define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT)
132
133   int main()
134   {
135       struct rte_gpu_flag quit_flag;
136       struct rte_gpu_comm_list *comm_list;
137       int nb_rx = 0;
138       int comm_list_entry = 0;
139       struct rte_mbuf *rx_mbufs[max_rx_mbufs];
140       cudaStream_t cstream;
141       struct rte_mempool *mpool_payload, *mpool_header;
142       struct rte_pktmbuf_extmem ext_mem;
143       int16_t dev_id;
144       int16_t port_id = 0;
145
146       /* Initialize CUDA objects (cstream, context, etc..). */
147       /* Use gpudev library to register a new CUDA context if any. */
148
149       /* Let's assume the application wants to use the default context of the GPU device 0. */
150       dev_id = 0;
151
152       /* Create an external memory mempool using memory allocated on the GPU. */
153       ext_mem.elt_size = mbufs_headroom_size;
154       ext_mem.buf_len = RTE_ALIGN_CEIL(mbufs_num * ext_mem.elt_size, GPU_PAGE_SIZE);
155       ext_mem.buf_iova = RTE_BAD_IOVA;
156       ext_mem.buf_ptr = rte_gpu_mem_alloc(dev_id, ext_mem.buf_len, 0);
157       rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE);
158       rte_dev_dma_map(rte_eth_devices[port_id].device,
159               ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len);
160       mpool_payload = rte_pktmbuf_pool_create_extbuf("gpu_mempool", mbufs_num,
161                                                      0, 0, ext_mem.elt_size,
162                                                      rte_socket_id(), &ext_mem, 1);
163
164       /*
165        * Create CPU - device communication flag.
166        * With this flag, the CPU can tell to the CUDA kernel to exit from the main loop.
167        */
168       rte_gpu_comm_create_flag(dev_id, &quit_flag, RTE_GPU_COMM_FLAG_CPU);
169       rte_gpu_comm_set_flag(&quit_flag , 0);
170
171       /*
172        * Create CPU - device communication list.
173        * Each entry of this list will be populated by the CPU
174        * with a new set of received mbufs that the CUDA kernel has to process.
175        */
176       comm_list = rte_gpu_comm_create_list(dev_id, num_entries);
177
178       /* A very simple CUDA kernel with just 1 CUDA block and RTE_GPU_COMM_LIST_PKTS_MAX CUDA threads. */
179       cuda_kernel_packet_processing<<<1, RTE_GPU_COMM_LIST_PKTS_MAX, 0, cstream>>>(quit_flag->ptr, comm_list, num_entries, ...);
180
181       /*
182        * For simplicity, the CPU here receives only 2 bursts of mbufs.
183        * In a real application, network activity and device processing should overlap.
184        */
185       nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
186       rte_gpu_comm_populate_list_pkts(comm_list[0], rx_mbufs, nb_rx);
187       nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs);
188       rte_gpu_comm_populate_list_pkts(comm_list[1], rx_mbufs, nb_rx);
189
190       /*
191        * CPU waits for the completion of the packets' processing on the CUDA kernel
192        * and then it does a cleanup of the received mbufs.
193        */
194       while (rte_gpu_comm_cleanup_list(comm_list[0]));
195       while (rte_gpu_comm_cleanup_list(comm_list[1]));
196
197       /* CPU notifies the CUDA kernel that it has to terminate. */
198       rte_gpu_comm_set_flag(&quit_flag, 1);
199
200       /* gpudev objects cleanup/destruction */
201       rte_gpu_mem_free(dev_id, ext_mem.buf_len);
202
203       return 0;
204   }
205
206   //////////////////////////////////////////////////////////////////////////
207   ///// CUDA kernel
208   //////////////////////////////////////////////////////////////////////////
209
210   void cuda_kernel(uint32_t * quit_flag_ptr, struct rte_gpu_comm_list *comm_list, int comm_list_entries)
211   {
212       int comm_list_index = 0;
213       struct rte_gpu_comm_pkt *pkt_list = NULL;
214
215       /* Do some pre-processing operations. */
216
217       /* GPU kernel keeps checking this flag to know if it has to quit or wait for more packets. */
218       while (*quit_flag_ptr == 0) {
219           if (comm_list[comm_list_index]->status_d[0] != RTE_GPU_COMM_LIST_READY)
220               continue;
221
222           if (threadIdx.x < comm_list[comm_list_index]->num_pkts)
223           {
224               /* Each CUDA thread processes a different packet. */
225               packet_processing(comm_list[comm_list_index]->addr, comm_list[comm_list_index]->size, ..);
226           }
227           __threadfence();
228           __syncthreads();
229
230           /* Wait for new packets on the next communication list entry. */
231           comm_list_index = (comm_list_index+1) % comm_list_entries;
232       }
233
234       /* Do some post-processing operations. */
235   }
236