1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2018 Intel Corporation. 3 4Debug & Troubleshoot guide 5========================== 6 7DPDK applications can be designed to have simple or complex pipeline processing 8stages making use of single or multiple threads. Applications can use poll mode 9hardware devices which helps in offloading CPU cycles too. It is common to find 10solutions designed with 11 12* single or multiple primary processes 13 14* single primary and single secondary 15 16* single primary and multiple secondaries 17 18In all the above cases, it is tedious to isolate, debug, and understand various 19behaviors which occur randomly or periodically. The goal of the guide is to 20consolidate a few commonly seen issues for reference. Then, isolate to identify 21the root cause through step by step debug at various stages. 22 23.. note:: 24 25 It is difficult to cover all possible issues; in a single attempt. With 26 feedback and suggestions from the community, more cases can be covered. 27 28 29Application Overview 30-------------------- 31 32By making use of the application model as a reference, we can discuss multiple 33causes of issues in the guide. Let us assume the sample makes use of a single 34primary process, with various processing stages running on multiple cores. The 35application may also make uses of Poll Mode Driver, and libraries like service 36cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev. 37 38The overview of an application modeled using PMD is shown in 39:numref:`dtg_sample_app_model`. 40 41.. _dtg_sample_app_model: 42 43.. figure:: img/dtg_sample_app_model.* 44 45 Overview of pipeline stage of an application 46 47 48Bottleneck Analysis 49------------------- 50 51A couple of factors that lead the design decision could be the platform, scale 52factor, and target. This distinct preference leads to multiple combinations, 53that are built using PMD and libraries of DPDK. While the compiler, library 54mode, and optimization flags are the components are to be constant, that 55affects the application too. 56 57 58Is there mismatch in packet (received < desired) rate? 59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 60 61RX Port and associated core :numref:`dtg_rx_rate`. 62 63.. _dtg_rx_rate: 64 65.. figure:: img/dtg_rx_rate.* 66 67 RX packet rate compared against received rate. 68 69#. Is the configuration for the RX setup correctly? 70 71 * Identify if port Speed and Duplex is matching to desired values with 72 ``rte_eth_link_get``. 73 74 * Check ``DEV_RX_OFFLOAD_JUMBO_FRAME`` is set with ``rte_eth_dev_info_get``. 75 76 * Check promiscuous mode if the drops do not occur for unique MAC address 77 with ``rte_eth_promiscuous_get``. 78 79#. Is the drop isolated to certain NIC only? 80 81 * Make use of ``rte_eth_dev_stats`` to identify the drops cause. 82 83 * If there are mbuf drops, check nb_desc for RX descriptor as it might not 84 be sufficient for the application. 85 86 * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX 87 lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue 88 pair. 89 90 * If there are redirect to a specific port queue pair with, ensure RX lcore 91 threads gets enough cycles. 92 93 * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the 94 spread is not even and causing drops. 95 96 * If PMD stats are not updating, then there might be offload or configuration 97 which is dropping the incoming traffic. 98 99#. Is there drops still seen? 100 101 * If there are multiple port queue pair, it might be the RX thread, RX 102 distributor, or event RX adapter not having enough cycles. 103 104 * If there are drops seen for RX adapter or RX distributor, try using 105 ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the 106 cache is temporary. 107 108 109Is there packet drops at receive or transmit? 110~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 111 112RX-TX port and associated cores :numref:`dtg_rx_tx_drop`. 113 114.. _dtg_rx_tx_drop: 115 116.. figure:: img/dtg_rx_tx_drop.* 117 118 RX-TX drops 119 120#. At RX 121 122 * Identify if there are multiple RX queue configured for port by 123 ``nb_rx_queues`` using ``rte_eth_dev_info_get``. 124 125 * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread 126 is configured to fetch packets from the port queue pair. 127 128 * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX 129 thread has enough cycles to consume the packets from the queue. 130 131#. At TX 132 133 * If the TX rate is falling behind the application fill rate, identify if 134 there are enough descriptors with ``rte_eth_dev_info_get`` for TX. 135 136 * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets. 137 138 * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD. 139 140 * If oerrors are getting incremented, TX packet validations are failing. 141 Check if there queue specific offload failures. 142 143 * If the drops occur for large size packets, check MTU and multi-segment 144 support configured for NIC. 145 146 147Is there object drops in producer point for the ring library? 148~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 149 150Producer point for ring :numref:`dtg_producer_ring`. 151 152.. _dtg_producer_ring: 153 154.. figure:: img/dtg_producer_ring.* 155 156 Producer point for Rings 157 158#. Performance issue isolation at producer 159 160 * Use ``rte_ring_dump`` to validate for all single producer flag is set to 161 ``RING_F_SP_ENQ``. 162 163 * There should be sufficient ``rte_ring_free_count`` at any point in time. 164 165 * Extreme stalls in dequeue stage of the pipeline will cause 166 ``rte_ring_full`` to be true. 167 168 169Is there object drops in consumer point for the ring library? 170~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 171 172Consumer point for ring :numref:`dtg_consumer_ring`. 173 174.. _dtg_consumer_ring: 175 176.. figure:: img/dtg_consumer_ring.* 177 178 Consumer point for Rings 179 180#. Performance issue isolation at consumer 181 182 * Use ``rte_ring_dump`` to validate for all single consumer flag is set to 183 ``RING_F_SC_DEQ``. 184 185 * If the desired burst dequeue falls behind the actual dequeue, the enqueue 186 stage is not filling up the ring as required. 187 188 * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true. 189 190 191Is there a variance in packet or object processing rate in the pipeline? 192~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 193 194Memory objects close to NUMA :numref:`dtg_mempool`. 195 196.. _dtg_mempool: 197 198.. figure:: img/dtg_mempool.* 199 200 Memory objects have to be close to the device per NUMA. 201 202#. Stall in processing pipeline can be attributes of MBUF release delays. 203 These can be narrowed down to 204 205 * Heavy processing cycles at single or multiple processing stages. 206 207 * Cache is spread due to the increased stages in the pipeline. 208 209 * CPU thread responsible for TX is not able to keep up with the burst of 210 traffic. 211 212 * Extra cycles to linearize multi-segment buffer and software offload like 213 checksum, TSO, and VLAN strip. 214 215 * Packet buffer copy in fast path also results in stalls in MBUF release if 216 not done selectively. 217 218 * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the 219 desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does 220 not release MBUF back to mempool. 221 222#. Lower performance between the pipeline processing stages can be 223 224 * The NUMA instance for packets or objects from NIC, mempool, and ring 225 should be the same. 226 227 * Drops on a specific socket are due to insufficient objects in the pool. 228 Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor 229 when drops occurs. 230 231 * Try prefetching the content in processing pipeline logic to minimize the 232 stalls. 233 234#. Performance issue can be due to special cases 235 236 * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain 237 offload requires the same. 238 239 * Use ``rte_mempool_cache_create`` for user threads require access to 240 mempool objects. 241 242 * If the variance is absent for larger huge pages, then try rte_mem_lock_page 243 on the objects, packets, lookup tables to isolate the issue. 244 245 246Is there a variance in cryptodev performance? 247~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 248 249Crypto device and PMD :numref:`dtg_crypto`. 250 251.. _dtg_crypto: 252 253.. figure:: img/dtg_crypto.* 254 255 CRYPTO and interaction with PMD device. 256 257#. Performance issue isolation for enqueue 258 259 * Ensure cryptodev, resources and enqueue is running on NUMA cores. 260 261 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``. 262 263 * Parallelize enqueue thread for varied multiple queue pair. 264 265#. Performance issue isolation for dequeue 266 267 * Ensure cryptodev, resources and dequeue are running on NUMA cores. 268 269 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``. 270 271 * Parallelize dequeue thread for varied multiple queue pair. 272 273#. Performance issue isolation for crypto operation 274 275 * If the cryptodev software-assist is in use, ensure the library is built 276 with right (SIMD) flags or check if the queue pair using CPU ISA for 277 feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``. 278 279 * If the cryptodev hardware-assist is in use, ensure both firmware and 280 drivers are up to date. 281 282#. Configuration issue isolation 283 284 * Identify cryptodev instances with ``rte_cryptodev_count`` and 285 ``rte_cryptodev_info_get``. 286 287 288Is user functions performance is not as expected? 289~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 290 291Custom worker function :numref:`dtg_distributor_worker`. 292 293.. _dtg_distributor_worker: 294 295.. figure:: img/dtg_distributor_worker.* 296 297 Custom worker function performance drops. 298 299#. Performance issue isolation 300 301 * The functions running on CPU cores without context switches are the 302 performing scenarios. Identify lcore with ``rte_lcore`` and lcore index 303 mapping with CPU using ``rte_lcore_index``. 304 305 * Use ``rte_thread_get_affinity`` to isolate functions running on the same 306 CPU core. 307 308#. Configuration issue isolation 309 310 * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF and 311 SERVICE. Check performance functions are mapped to run on the cores. 312 313 * For high-performance execution logic ensure running it on correct NUMA 314 and non-master core. 315 316 * Analyze run logic with ``rte_dump_stack``, ``rte_dump_registers`` and 317 ``rte_memdump`` for more insights. 318 319 * Make use of objdump to ensure opcode is matching to the desired state. 320 321 322Is the execution cycles for dynamic service functions are not frequent? 323~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 324 325service functions on service cores :numref:`dtg_service`. 326 327.. _dtg_service: 328 329.. figure:: img/dtg_service.* 330 331 functions running on service cores 332 333#. Performance issue isolation 334 335 * Services configured for parallel execution should have 336 ``rte_service_lcore_count`` should be equal to 337 ``rte_service_lcore_count_services``. 338 339 * A service to run parallel on all cores should return 340 ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and 341 ``rte_service_map_lcore_get`` returns unique lcore. 342 343 * If service function execution cycles for dynamic service functions are 344 not frequent? 345 346 * If services share the lcore, overall execution should fit budget. 347 348#. Configuration issue isolation 349 350 * Check if service is running with ``rte_service_runstate_get``. 351 352 * Generic debug via ``rte_service_dump``. 353 354 355Is there a bottleneck in the performance of eventdev? 356~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 357 358#. Check for generic configuration 359 360 * Ensure the event devices created are right NUMA using 361 ``rte_event_dev_count`` and ``rte_event_dev_socket_id``. 362 363 * Check for event stages if the events are looped back into the same queue. 364 365 * If the failure is on the enqueue stage for events, check if queue depth 366 with ``rte_event_dev_info_get``. 367 368#. If there are performance drops in the enqueue stage 369 370 * Use ``rte_event_dev_dump`` to dump the eventdev information. 371 372 * Periodically checks stats for queue and port to identify the starvation. 373 374 * Check the in-flight events for the desired queue for enqueue and dequeue. 375 376 377Is there a variance in traffic manager? 378~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 379 380Traffic Manager on TX interface :numref:`dtg_qos_tx`. 381 382.. _dtg_qos_tx: 383 384.. figure:: img/dtg_qos_tx.* 385 386 Traffic Manager just before TX. 387 388#. Identify the cause for a variance from expected behavior, is due to 389 insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features 390 for hierarchies, WRED and priority schedulers to be offloaded hardware. 391 392#. Undesired flow drops can be narrowed down to WRED, priority, and rates 393 limiters. 394 395#. Isolate the flow in which the undesired drops occur. Use 396 ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf 397 where drops occur. 398 399#. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read`` 400 for drops for hierarchy, schedulers and WRED configurations. 401 402 403Is the packet in the unexpected format? 404~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 405 406Packet capture before and after processing :numref:`dtg_pdump`. 407 408.. _dtg_pdump: 409 410.. figure:: img/dtg_pdump.* 411 412 Capture points of Traffic at RX-TX. 413 414#. To isolate the possible packet corruption in the processing pipeline, 415 carefully staged capture packets are to be implemented. 416 417 * First, isolate at NIC entry and exit. 418 419 Use pdump in primary to allow secondary to access port-queue pair. The 420 packets get copied over in RX|TX callback by the secondary process using 421 ring buffers. 422 423 * Second, isolate at pipeline entry and exit. 424 425 Using hooks or callbacks capture the packet middle of the pipeline stage 426 to copy the packets, which can be shared to the secondary debug process 427 via user-defined custom rings. 428 429.. note:: 430 431 Use similar analysis to objects and metadata corruption. 432 433 434Does the issue still persist? 435~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 436 437The issue can be further narrowed down to the following causes. 438 439#. If there are vendor or application specific metadata, check for errors due 440 to META data error flags. Dumping private meta-data in the objects can give 441 insight into details for debugging. 442 443#. If there are multi-process for either data or configuration, check for 444 possible errors in the secondary process where the configuration fails and 445 possible data corruption in the data plane. 446 447#. Random drops in the RX or TX when opening other application is an indication 448 of the effect of a noisy neighbor. Try using the cache allocation technique 449 to minimize the effect between applications. 450 451 452How to develop a custom code to debug? 453-------------------------------------- 454 455#. For an application that runs as the primary process only, debug functionality 456 is added in the same process. These can be invoked by timer call-back, 457 service core and signal handler. 458 459#. For the application that runs as multiple processes. debug functionality in 460 a standalone secondary process. 461