1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2018 Intel Corporation. 3 4Debug & Troubleshoot guide 5========================== 6 7DPDK applications can be designed to have simple or complex pipeline processing 8stages making use of single or multiple threads. Applications can use poll mode 9hardware devices which helps in offloading CPU cycles too. It is common to find 10solutions designed with 11 12* single or multiple primary processes 13 14* single primary and single secondary 15 16* single primary and multiple secondaries 17 18In all the above cases, it is tedious to isolate, debug, and understand various 19behaviors which occur randomly or periodically. The goal of the guide is to 20consolidate a few commonly seen issues for reference. Then, isolate to identify 21the root cause through step by step debug at various stages. 22 23.. note:: 24 25 It is difficult to cover all possible issues; in a single attempt. With 26 feedback and suggestions from the community, more cases can be covered. 27 28 29Application Overview 30-------------------- 31 32By making use of the application model as a reference, we can discuss multiple 33causes of issues in the guide. Let us assume the sample makes use of a single 34primary process, with various processing stages running on multiple cores. The 35application may also make uses of Poll Mode Driver, and libraries like service 36cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev. 37 38The overview of an application modeled using PMD is shown in 39:numref:`dtg_sample_app_model`. 40 41.. _dtg_sample_app_model: 42 43.. figure:: img/dtg_sample_app_model.* 44 45 Overview of pipeline stage of an application 46 47 48Bottleneck Analysis 49------------------- 50 51A couple of factors that lead the design decision could be the platform, scale 52factor, and target. This distinct preference leads to multiple combinations, 53that are built using PMD and libraries of DPDK. While the compiler, library 54mode, and optimization flags are the components are to be constant, that 55affects the application too. 56 57 58Is there mismatch in packet (received < desired) rate? 59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 60 61RX Port and associated core :numref:`dtg_rx_rate`. 62 63.. _dtg_rx_rate: 64 65.. figure:: img/dtg_rx_rate.* 66 67 RX packet rate compared against received rate. 68 69#. Is the configuration for the RX setup correctly? 70 71 * Identify if port Speed and Duplex is matching to desired values with 72 ``rte_eth_link_get``. 73 74 * Check promiscuous mode if the drops do not occur for unique MAC address 75 with ``rte_eth_promiscuous_get``. 76 77#. Is the drop isolated to certain NIC only? 78 79 * Make use of ``rte_eth_dev_stats`` to identify the drops cause. 80 81 * If there are mbuf drops, check nb_desc for RX descriptor as it might not 82 be sufficient for the application. 83 84 * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX 85 lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue 86 pair. 87 88 * If there are redirect to a specific port queue pair with, ensure RX lcore 89 threads gets enough cycles. 90 91 * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the 92 spread is not even and causing drops. 93 94 * If PMD stats are not updating, then there might be offload or configuration 95 which is dropping the incoming traffic. 96 97#. Is there drops still seen? 98 99 * If there are multiple port queue pair, it might be the RX thread, RX 100 distributor, or event RX adapter not having enough cycles. 101 102 * If there are drops seen for RX adapter or RX distributor, try using 103 ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the 104 cache is temporary. 105 106 107Is there packet drops at receive or transmit? 108~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 109 110RX-TX port and associated cores :numref:`dtg_rx_tx_drop`. 111 112.. _dtg_rx_tx_drop: 113 114.. figure:: img/dtg_rx_tx_drop.* 115 116 RX-TX drops 117 118#. At RX 119 120 * Identify if there are multiple RX queue configured for port by 121 ``nb_rx_queues`` using ``rte_eth_dev_info_get``. 122 123 * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread 124 is configured to fetch packets from the port queue pair. 125 126 * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX 127 thread has enough cycles to consume the packets from the queue. 128 129#. At TX 130 131 * If the TX rate is falling behind the application fill rate, identify if 132 there are enough descriptors with ``rte_eth_dev_info_get`` for TX. 133 134 * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets. 135 136 * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD. 137 138 * If oerrors are getting incremented, TX packet validations are failing. 139 Check if there queue specific offload failures. 140 141 * If the drops occur for large size packets, check MTU and multi-segment 142 support configured for NIC. 143 144 145Is there object drops in producer point for the ring library? 146~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 147 148Producer point for ring :numref:`dtg_producer_ring`. 149 150.. _dtg_producer_ring: 151 152.. figure:: img/dtg_producer_ring.* 153 154 Producer point for Rings 155 156#. Performance issue isolation at producer 157 158 * Use ``rte_ring_dump`` to validate for all single producer flag is set to 159 ``RING_F_SP_ENQ``. 160 161 * There should be sufficient ``rte_ring_free_count`` at any point in time. 162 163 * Extreme stalls in dequeue stage of the pipeline will cause 164 ``rte_ring_full`` to be true. 165 166 167Is there object drops in consumer point for the ring library? 168~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 169 170Consumer point for ring :numref:`dtg_consumer_ring`. 171 172.. _dtg_consumer_ring: 173 174.. figure:: img/dtg_consumer_ring.* 175 176 Consumer point for Rings 177 178#. Performance issue isolation at consumer 179 180 * Use ``rte_ring_dump`` to validate for all single consumer flag is set to 181 ``RING_F_SC_DEQ``. 182 183 * If the desired burst dequeue falls behind the actual dequeue, the enqueue 184 stage is not filling up the ring as required. 185 186 * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true. 187 188 189Is there a variance in packet or object processing rate in the pipeline? 190~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 191 192Memory objects close to NUMA :numref:`dtg_mempool`. 193 194.. _dtg_mempool: 195 196.. figure:: img/dtg_mempool.* 197 198 Memory objects have to be close to the device per NUMA. 199 200#. Stall in processing pipeline can be attributes of MBUF release delays. 201 These can be narrowed down to 202 203 * Heavy processing cycles at single or multiple processing stages. 204 205 * Cache is spread due to the increased stages in the pipeline. 206 207 * CPU thread responsible for TX is not able to keep up with the burst of 208 traffic. 209 210 * Extra cycles to linearize multi-segment buffer and software offload like 211 checksum, TSO, and VLAN strip. 212 213 * Packet buffer copy in fast path also results in stalls in MBUF release if 214 not done selectively. 215 216 * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the 217 desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does 218 not release MBUF back to mempool. 219 220#. Lower performance between the pipeline processing stages can be 221 222 * The NUMA instance for packets or objects from NIC, mempool, and ring 223 should be the same. 224 225 * Drops on a specific socket are due to insufficient objects in the pool. 226 Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor 227 when drops occurs. 228 229 * Try prefetching the content in processing pipeline logic to minimize the 230 stalls. 231 232#. Performance issue can be due to special cases 233 234 * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain 235 offload requires the same. 236 237 * Use ``rte_mempool_cache_create`` for user threads require access to 238 mempool objects. 239 240 * If the variance is absent for larger huge pages, then try rte_mem_lock_page 241 on the objects, packets, lookup tables to isolate the issue. 242 243 244Is there a variance in cryptodev performance? 245~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 246 247Crypto device and PMD :numref:`dtg_crypto`. 248 249.. _dtg_crypto: 250 251.. figure:: img/dtg_crypto.* 252 253 CRYPTO and interaction with PMD device. 254 255#. Performance issue isolation for enqueue 256 257 * Ensure cryptodev, resources and enqueue is running on NUMA cores. 258 259 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``. 260 261 * Parallelize enqueue thread for varied multiple queue pair. 262 263#. Performance issue isolation for dequeue 264 265 * Ensure cryptodev, resources and dequeue are running on NUMA cores. 266 267 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``. 268 269 * Parallelize dequeue thread for varied multiple queue pair. 270 271#. Performance issue isolation for crypto operation 272 273 * If the cryptodev software-assist is in use, ensure the library is built 274 with right (SIMD) flags or check if the queue pair using CPU ISA for 275 feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``. 276 277 * If the cryptodev hardware-assist is in use, ensure both firmware and 278 drivers are up to date. 279 280#. Configuration issue isolation 281 282 * Identify cryptodev instances with ``rte_cryptodev_count`` and 283 ``rte_cryptodev_info_get``. 284 285 286Is user functions performance is not as expected? 287~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 288 289Custom worker function :numref:`dtg_distributor_worker`. 290 291.. _dtg_distributor_worker: 292 293.. figure:: img/dtg_distributor_worker.* 294 295 Custom worker function performance drops. 296 297#. Performance issue isolation 298 299 * The functions running on CPU cores without context switches are the 300 performing scenarios. Identify lcore with ``rte_lcore`` and lcore index 301 mapping with CPU using ``rte_lcore_index``. 302 303 * Use ``rte_thread_get_affinity`` to isolate functions running on the same 304 CPU core. 305 306#. Configuration issue isolation 307 308 * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF, 309 SERVICE and NON_EAL. Check performance functions are mapped to run on the 310 cores. 311 312 * For high-performance execution logic ensure running it on correct NUMA 313 and worker core. 314 315 * Analyze run logic with ``rte_dump_stack`` and 316 ``rte_memdump`` for more insights. 317 318 * Make use of objdump to ensure opcode is matching to the desired state. 319 320 321Is the execution cycles for dynamic service functions are not frequent? 322~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 323 324service functions on service cores :numref:`dtg_service`. 325 326.. _dtg_service: 327 328.. figure:: img/dtg_service.* 329 330 functions running on service cores 331 332#. Performance issue isolation 333 334 * Services configured for parallel execution should have 335 ``rte_service_lcore_count`` should be equal to 336 ``rte_service_lcore_count_services``. 337 338 * A service to run parallel on all cores should return 339 ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and 340 ``rte_service_map_lcore_get`` returns unique lcore. 341 342 * If service function execution cycles for dynamic service functions are 343 not frequent? 344 345 * If services share the lcore, overall execution should fit budget. 346 347#. Configuration issue isolation 348 349 * Check if service is running with ``rte_service_runstate_get``. 350 351 * Generic debug via ``rte_service_dump``. 352 353 354Is there a bottleneck in the performance of eventdev? 355~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 356 357#. Check for generic configuration 358 359 * Ensure the event devices created are right NUMA using 360 ``rte_event_dev_count`` and ``rte_event_dev_socket_id``. 361 362 * Check for event stages if the events are looped back into the same queue. 363 364 * If the failure is on the enqueue stage for events, check if queue depth 365 with ``rte_event_dev_info_get``. 366 367#. If there are performance drops in the enqueue stage 368 369 * Use ``rte_event_dev_dump`` to dump the eventdev information. 370 371 * Periodically checks stats for queue and port to identify the starvation. 372 373 * Check the in-flight events for the desired queue for enqueue and dequeue. 374 375 376Is there a variance in traffic manager? 377~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 378 379Traffic Manager on TX interface :numref:`dtg_qos_tx`. 380 381.. _dtg_qos_tx: 382 383.. figure:: img/dtg_qos_tx.* 384 385 Traffic Manager just before TX. 386 387#. Identify the cause for a variance from expected behavior, is due to 388 insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features 389 for hierarchies, WRED and priority schedulers to be offloaded hardware. 390 391#. Undesired flow drops can be narrowed down to WRED, priority, and rates 392 limiters. 393 394#. Isolate the flow in which the undesired drops occur. Use 395 ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf 396 where drops occur. 397 398#. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read`` 399 for drops for hierarchy, schedulers and WRED configurations. 400 401 402Is the packet in the unexpected format? 403~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 404 405Packet capture before and after processing :numref:`dtg_pdump`. 406 407.. _dtg_pdump: 408 409.. figure:: img/dtg_pdump.* 410 411 Capture points of Traffic at RX-TX. 412 413#. To isolate the possible packet corruption in the processing pipeline, 414 carefully staged capture packets are to be implemented. 415 416 * First, isolate at NIC entry and exit. 417 418 Use pdump in primary to allow secondary to access port-queue pair. The 419 packets get copied over in RX|TX callback by the secondary process using 420 ring buffers. 421 422 * Second, isolate at pipeline entry and exit. 423 424 Using hooks or callbacks capture the packet middle of the pipeline stage 425 to copy the packets, which can be shared to the secondary debug process 426 via user-defined custom rings. 427 428.. note:: 429 430 Use similar analysis to objects and metadata corruption. 431 432 433Does the issue still persist? 434~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 435 436The issue can be further narrowed down to the following causes. 437 438#. If there are vendor or application specific metadata, check for errors due 439 to META data error flags. Dumping private meta-data in the objects can give 440 insight into details for debugging. 441 442#. If there are multi-process for either data or configuration, check for 443 possible errors in the secondary process where the configuration fails and 444 possible data corruption in the data plane. 445 446#. Random drops in the RX or TX when opening other application is an indication 447 of the effect of a noisy neighbor. Try using the cache allocation technique 448 to minimize the effect between applications. 449 450 451How to develop a custom code to debug? 452-------------------------------------- 453 454#. For an application that runs as the primary process only, debug functionality 455 is added in the same process. These can be invoked by timer call-back, 456 service core and signal handler. 457 458#. For the application that runs as multiple processes. debug functionality in 459 a standalone secondary process. 460