1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright (c) 2021 NVIDIA Corporation & Affiliates 3 4General-Purpose Graphics Processing Unit Library 5================================================ 6 7When mixing networking activity with task processing on a GPU device, 8there may be the need to put in communication the CPU with the device 9in order to manage the memory, synchronize operations, exchange info, etc.. 10 11By means of the generic GPU interface provided by this library, 12it is possible to allocate a chunk of GPU memory and use it 13to create a DPDK mempool with external mbufs having the payload 14on the GPU memory, enabling any network interface card 15(which support this feature like Mellanox NIC) 16to directly transmit and receive packets using GPU memory. 17 18Additionally, this library provides a number of functions 19to enhance the dialog between CPU and GPU. 20 21Out of scope of this library is to provide a wrapper for GPU specific libraries 22(e.g. CUDA Toolkit or OpenCL), thus it is not possible to launch workload 23on the device or create GPU specific objects 24(e.g. CUDA Driver context or CUDA Streams in case of NVIDIA GPUs). 25 26This library is optional in DPDK 27and can be disabled with ``-Ddisable_libs=gpudev``. 28 29 30Features 31-------- 32 33This library provides a number of features: 34 35- Interoperability with device-specific library through generic handlers. 36- Allocate and free memory on the device. 37- Register CPU memory to make it visible from the device. 38- Communication between the CPU and the device. 39 40The whole CPU - GPU communication is implemented 41using CPU memory visible from the GPU. 42 43 44API Overview 45------------ 46 47Child Device 48~~~~~~~~~~~~ 49 50By default, DPDK PCIe module detects and registers physical GPU devices 51in the system. 52With the gpudev library is also possible to add additional non-physical devices 53through an ``uint64_t`` generic handler (e.g. CUDA Driver context) 54that will be registered internally by the driver as an additional device (child) 55connected to a physical device (parent). 56Each device (parent or child) is represented through a ID 57required to indicate which device a given operation should be executed on. 58 59Memory Allocation 60~~~~~~~~~~~~~~~~~ 61 62gpudev can allocate on an input given GPU device a memory area 63returning the pointer to that memory. 64Later, it's also possible to free that memory with gpudev. 65GPU memory allocated outside of the gpudev library 66(e.g. with GPU-specific library) cannot be freed by the gpudev library. 67 68Memory Registration 69~~~~~~~~~~~~~~~~~~~ 70 71gpudev can register a CPU memory area to make it visible from a GPU device. 72Later, it's also possible to unregister that memory with gpudev. 73CPU memory registered outside of the gpudev library 74(e.g. with GPU specific library) cannot be unregistered by the gpudev library. 75 76Memory Barrier 77~~~~~~~~~~~~~~ 78 79Some GPU drivers may need, under certain conditions, 80to enforce the coherency of external devices writes (e.g. NIC receiving packets) 81into the GPU memory. 82gpudev abstracts and exposes this capability. 83 84Communication Flag 85~~~~~~~~~~~~~~~~~~ 86 87Considering an application with some GPU task 88that's waiting to receive a signal from the CPU 89to move forward with the execution. 90The communication flag allocates a CPU memory GPU-visible ``uint32_t`` flag 91that can be used by the CPU to communicate with a GPU task. 92 93Communication list 94~~~~~~~~~~~~~~~~~~ 95 96By default, DPDK pulls free mbufs from a mempool to receive packets. 97Best practice, especially in a multithreaded application, 98is to no make any assumption on which mbufs will be used 99to receive the next bursts of packets. 100Considering an application with a GPU memory mempool 101attached to a receive queue having some task waiting on the GPU 102to receive a new burst of packets to be processed, 103there is the need to communicate from the CPU 104the list of mbuf payload addresses where received packet have been stored. 105The ``rte_gpu_comm_*()`` functions are responsible to create a list of packets 106that can be populated with receive mbuf payload addresses 107and communicated to the task running on the GPU. 108 109 110CUDA Example 111------------ 112 113In the example below, there is a pseudo-code to give an example 114about how to use functions in this library in case of a CUDA application. 115 116.. code-block:: c 117 118 ////////////////////////////////////////////////////////////////////////// 119 ///// gpudev library + CUDA functions 120 ////////////////////////////////////////////////////////////////////////// 121 #define GPU_PAGE_SHIFT 16 122 #define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT) 123 124 int main() 125 { 126 struct rte_gpu_flag quit_flag; 127 struct rte_gpu_comm_list *comm_list; 128 int nb_rx = 0; 129 int comm_list_entry = 0; 130 struct rte_mbuf *rx_mbufs[max_rx_mbufs]; 131 cudaStream_t cstream; 132 struct rte_mempool *mpool_payload, *mpool_header; 133 struct rte_pktmbuf_extmem ext_mem; 134 int16_t dev_id; 135 int16_t port_id = 0; 136 137 /* Initialize CUDA objects (cstream, context, etc..). */ 138 /* Use gpudev library to register a new CUDA context if any. */ 139 140 /* Let's assume the application wants to use the default context of the GPU device 0. */ 141 dev_id = 0; 142 143 /* Create an external memory mempool using memory allocated on the GPU. */ 144 ext_mem.elt_size = mbufs_headroom_size; 145 ext_mem.buf_len = RTE_ALIGN_CEIL(mbufs_num * ext_mem.elt_size, GPU_PAGE_SIZE); 146 ext_mem.buf_iova = RTE_BAD_IOVA; 147 ext_mem.buf_ptr = rte_gpu_mem_alloc(dev_id, ext_mem.buf_len, 0); 148 rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, NULL, ext_mem.buf_iova, GPU_PAGE_SIZE); 149 rte_dev_dma_map(rte_eth_devices[port_id].device, 150 ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len); 151 mpool_payload = rte_pktmbuf_pool_create_extbuf("gpu_mempool", mbufs_num, 152 0, 0, ext_mem.elt_size, 153 rte_socket_id(), &ext_mem, 1); 154 155 /* 156 * Create CPU - device communication flag. 157 * With this flag, the CPU can tell to the CUDA kernel to exit from the main loop. 158 */ 159 rte_gpu_comm_create_flag(dev_id, &quit_flag, RTE_GPU_COMM_FLAG_CPU); 160 rte_gpu_comm_set_flag(&quit_flag , 0); 161 162 /* 163 * Create CPU - device communication list. 164 * Each entry of this list will be populated by the CPU 165 * with a new set of received mbufs that the CUDA kernel has to process. 166 */ 167 comm_list = rte_gpu_comm_create_list(dev_id, num_entries); 168 169 /* A very simple CUDA kernel with just 1 CUDA block and RTE_GPU_COMM_LIST_PKTS_MAX CUDA threads. */ 170 cuda_kernel_packet_processing<<<1, RTE_GPU_COMM_LIST_PKTS_MAX, 0, cstream>>>(quit_flag->ptr, comm_list, num_entries, ...); 171 172 /* 173 * For simplicity, the CPU here receives only 2 bursts of mbufs. 174 * In a real application, network activity and device processing should overlap. 175 */ 176 nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs); 177 rte_gpu_comm_populate_list_pkts(comm_list[0], rx_mbufs, nb_rx); 178 nb_rx = rte_eth_rx_burst(port_id, queue_id, &(rx_mbufs[0]), max_rx_mbufs); 179 rte_gpu_comm_populate_list_pkts(comm_list[1], rx_mbufs, nb_rx); 180 181 /* 182 * CPU waits for the completion of the packets' processing on the CUDA kernel 183 * and then it does a cleanup of the received mbufs. 184 */ 185 while (rte_gpu_comm_cleanup_list(comm_list[0])); 186 while (rte_gpu_comm_cleanup_list(comm_list[1])); 187 188 /* CPU notifies the CUDA kernel that it has to terminate. */ 189 rte_gpu_comm_set_flag(&quit_flag, 1); 190 191 /* gpudev objects cleanup/destruction */ 192 rte_gpu_mem_free(dev_id, ext_mem.buf_len); 193 194 return 0; 195 } 196 197 ////////////////////////////////////////////////////////////////////////// 198 ///// CUDA kernel 199 ////////////////////////////////////////////////////////////////////////// 200 201 void cuda_kernel(uint32_t * quit_flag_ptr, struct rte_gpu_comm_list *comm_list, int comm_list_entries) 202 { 203 int comm_list_index = 0; 204 struct rte_gpu_comm_pkt *pkt_list = NULL; 205 206 /* Do some pre-processing operations. */ 207 208 /* GPU kernel keeps checking this flag to know if it has to quit or wait for more packets. */ 209 while (*quit_flag_ptr == 0) { 210 if (comm_list[comm_list_index]->status != RTE_GPU_COMM_LIST_READY) 211 continue; 212 213 if (threadIdx.x < comm_list[comm_list_index]->num_pkts) 214 { 215 /* Each CUDA thread processes a different packet. */ 216 packet_processing(comm_list[comm_list_index]->addr, comm_list[comm_list_index]->size, ..); 217 } 218 __threadfence(); 219 __syncthreads(); 220 221 /* Wait for new packets on the next communication list entry. */ 222 comm_list_index = (comm_list_index+1) % comm_list_entries; 223 } 224 225 /* Do some post-processing operations. */ 226 } 227