1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2018 Intel Corporation. 3 4Debug & Troubleshoot guide 5========================== 6 7DPDK applications can be designed to have simple or complex pipeline processing 8stages making use of single or multiple threads. Applications can use poll mode 9hardware devices which helps in offloading CPU cycles too. It is common to find 10solutions designed with 11 12* single or multiple primary processes 13 14* single primary and single secondary 15 16* single primary and multiple secondaries 17 18In all the above cases, it is tedious to isolate, debug, and understand various 19behaviors which occur randomly or periodically. The goal of the guide is to 20consolidate a few commonly seen issues for reference. Then, isolate to identify 21the root cause through step by step debug at various stages. 22 23.. note:: 24 25 It is difficult to cover all possible issues; in a single attempt. With 26 feedback and suggestions from the community, more cases can be covered. 27 28 29Application Overview 30-------------------- 31 32By making use of the application model as a reference, we can discuss multiple 33causes of issues in the guide. Let us assume the sample makes use of a single 34primary process, with various processing stages running on multiple cores. The 35application may also make uses of Poll Mode Driver, and libraries like service 36cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev. 37 38The overview of an application modeled using PMD is shown in 39:numref:`dtg_sample_app_model`. 40 41.. _dtg_sample_app_model: 42 43.. figure:: img/dtg_sample_app_model.* 44 45 Overview of pipeline stage of an application 46 47 48Bottleneck Analysis 49------------------- 50 51A couple of factors that lead the design decision could be the platform, scale 52factor, and target. This distinct preference leads to multiple combinations, 53that are built using PMD and libraries of DPDK. While the compiler, library 54mode, and optimization flags are the components are to be constant, that 55affects the application too. 56 57 58Is there mismatch in packet (received < desired) rate? 59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 60 61RX Port and associated core :numref:`dtg_rx_rate`. 62 63.. _dtg_rx_rate: 64 65.. figure:: img/dtg_rx_rate.* 66 67 RX packet rate compared against received rate. 68 69#. Is the configuration for the RX setup correctly? 70 71 * Identify if port Speed and Duplex is matching to desired values with 72 ``rte_eth_link_get``. 73 74 * Check ``DEV_RX_OFFLOAD_JUMBO_FRAME`` is set with ``rte_eth_dev_info_get``. 75 76 * Check promiscuous mode if the drops do not occur for unique MAC address 77 with ``rte_eth_promiscuous_get``. 78 79#. Is the drop isolated to certain NIC only? 80 81 * Make use of ``rte_eth_dev_stats`` to identify the drops cause. 82 83 * If there are mbuf drops, check nb_desc for RX descriptor as it might not 84 be sufficient for the application. 85 86 * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX 87 lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue 88 pair. 89 90 * If there are redirect to a specific port queue pair with, ensure RX lcore 91 threads gets enough cycles. 92 93 * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the 94 spread is not even and causing drops. 95 96 * If PMD stats are not updating, then there might be offload or configuration 97 which is dropping the incoming traffic. 98 99#. Is there drops still seen? 100 101 * If there are multiple port queue pair, it might be the RX thread, RX 102 distributor, or event RX adapter not having enough cycles. 103 104 * If there are drops seen for RX adapter or RX distributor, try using 105 ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the 106 cache is temporary. 107 108 109Is there packet drops at receive or transmit? 110~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 111 112RX-TX port and associated cores :numref:`dtg_rx_tx_drop`. 113 114.. _dtg_rx_tx_drop: 115 116.. figure:: img/dtg_rx_tx_drop.* 117 118 RX-TX drops 119 120#. At RX 121 122 * Identify if there are multiple RX queue configured for port by 123 ``nb_rx_queues`` using ``rte_eth_dev_info_get``. 124 125 * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread 126 is configured to fetch packets from the port queue pair. 127 128 * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX 129 thread has enough cycles to consume the packets from the queue. 130 131#. At TX 132 133 * If the TX rate is falling behind the application fill rate, identify if 134 there are enough descriptors with ``rte_eth_dev_info_get`` for TX. 135 136 * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets. 137 138 * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD. 139 140 * If oerrors are getting incremented, TX packet validations are failing. 141 Check if there queue specific offload failures. 142 143 * If the drops occur for large size packets, check MTU and multi-segment 144 support configured for NIC. 145 146 147Is there object drops in producer point for the ring library? 148~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 149 150Producer point for ring :numref:`dtg_producer_ring`. 151 152.. _dtg_producer_ring: 153 154.. figure:: img/dtg_producer_ring.* 155 156 Producer point for Rings 157 158#. Performance issue isolation at producer 159 160 * Use ``rte_ring_dump`` to validate for all single producer flag is set to 161 ``RING_F_SP_ENQ``. 162 163 * There should be sufficient ``rte_ring_free_count`` at any point in time. 164 165 * Extreme stalls in dequeue stage of the pipeline will cause 166 ``rte_ring_full`` to be true. 167 168 169Is there object drops in consumer point for the ring library? 170~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 171 172Consumer point for ring :numref:`dtg_consumer_ring`. 173 174.. _dtg_consumer_ring: 175 176.. figure:: img/dtg_consumer_ring.* 177 178 Consumer point for Rings 179 180#. Performance issue isolation at consumer 181 182 * Use ``rte_ring_dump`` to validate for all single consumer flag is set to 183 ``RING_F_SC_DEQ``. 184 185 * If the desired burst dequeue falls behind the actual dequeue, the enqueue 186 stage is not filling up the ring as required. 187 188 * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true. 189 190 191Is there a variance in packet or object processing rate in the pipeline? 192~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 193 194Memory objects close to NUMA :numref:`dtg_mempool`. 195 196.. _dtg_mempool: 197 198.. figure:: img/dtg_mempool.* 199 200 Memory objects have to be close to the device per NUMA. 201 202#. Stall in processing pipeline can be attributes of MBUF release delays. 203 These can be narrowed down to 204 205 * Heavy processing cycles at single or multiple processing stages. 206 207 * Cache is spread due to the increased stages in the pipeline. 208 209 * CPU thread responsible for TX is not able to keep up with the burst of 210 traffic. 211 212 * Extra cycles to linearize multi-segment buffer and software offload like 213 checksum, TSO, and VLAN strip. 214 215 * Packet buffer copy in fast path also results in stalls in MBUF release if 216 not done selectively. 217 218 * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the 219 desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does 220 not release MBUF back to mempool. 221 222#. Lower performance between the pipeline processing stages can be 223 224 * The NUMA instance for packets or objects from NIC, mempool, and ring 225 should be the same. 226 227 * Drops on a specific socket are due to insufficient objects in the pool. 228 Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor 229 when drops occurs. 230 231 * Try prefetching the content in processing pipeline logic to minimize the 232 stalls. 233 234#. Performance issue can be due to special cases 235 236 * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain 237 offload requires the same. 238 239 * Use ``rte_mempool_cache_create`` for user threads require access to 240 mempool objects. 241 242 * If the variance is absent for larger huge pages, then try rte_mem_lock_page 243 on the objects, packets, lookup tables to isolate the issue. 244 245 246Is there a variance in cryptodev performance? 247~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 248 249Crypto device and PMD :numref:`dtg_crypto`. 250 251.. _dtg_crypto: 252 253.. figure:: img/dtg_crypto.* 254 255 CRYPTO and interaction with PMD device. 256 257#. Performance issue isolation for enqueue 258 259 * Ensure cryptodev, resources and enqueue is running on NUMA cores. 260 261 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``. 262 263 * Parallelize enqueue thread for varied multiple queue pair. 264 265#. Performance issue isolation for dequeue 266 267 * Ensure cryptodev, resources and dequeue are running on NUMA cores. 268 269 * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``. 270 271 * Parallelize dequeue thread for varied multiple queue pair. 272 273#. Performance issue isolation for crypto operation 274 275 * If the cryptodev software-assist is in use, ensure the library is built 276 with right (SIMD) flags or check if the queue pair using CPU ISA for 277 feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``. 278 279 * If the cryptodev hardware-assist is in use, ensure both firmware and 280 drivers are up to date. 281 282#. Configuration issue isolation 283 284 * Identify cryptodev instances with ``rte_cryptodev_count`` and 285 ``rte_cryptodev_info_get``. 286 287 288Is user functions performance is not as expected? 289~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 290 291Custom worker function :numref:`dtg_distributor_worker`. 292 293.. _dtg_distributor_worker: 294 295.. figure:: img/dtg_distributor_worker.* 296 297 Custom worker function performance drops. 298 299#. Performance issue isolation 300 301 * The functions running on CPU cores without context switches are the 302 performing scenarios. Identify lcore with ``rte_lcore`` and lcore index 303 mapping with CPU using ``rte_lcore_index``. 304 305 * Use ``rte_thread_get_affinity`` to isolate functions running on the same 306 CPU core. 307 308#. Configuration issue isolation 309 310 * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF, 311 SERVICE and NON_EAL. Check performance functions are mapped to run on the 312 cores. 313 314 * For high-performance execution logic ensure running it on correct NUMA 315 and worker core. 316 317 * Analyze run logic with ``rte_dump_stack`` and 318 ``rte_memdump`` for more insights. 319 320 * Make use of objdump to ensure opcode is matching to the desired state. 321 322 323Is the execution cycles for dynamic service functions are not frequent? 324~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 325 326service functions on service cores :numref:`dtg_service`. 327 328.. _dtg_service: 329 330.. figure:: img/dtg_service.* 331 332 functions running on service cores 333 334#. Performance issue isolation 335 336 * Services configured for parallel execution should have 337 ``rte_service_lcore_count`` should be equal to 338 ``rte_service_lcore_count_services``. 339 340 * A service to run parallel on all cores should return 341 ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and 342 ``rte_service_map_lcore_get`` returns unique lcore. 343 344 * If service function execution cycles for dynamic service functions are 345 not frequent? 346 347 * If services share the lcore, overall execution should fit budget. 348 349#. Configuration issue isolation 350 351 * Check if service is running with ``rte_service_runstate_get``. 352 353 * Generic debug via ``rte_service_dump``. 354 355 356Is there a bottleneck in the performance of eventdev? 357~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 358 359#. Check for generic configuration 360 361 * Ensure the event devices created are right NUMA using 362 ``rte_event_dev_count`` and ``rte_event_dev_socket_id``. 363 364 * Check for event stages if the events are looped back into the same queue. 365 366 * If the failure is on the enqueue stage for events, check if queue depth 367 with ``rte_event_dev_info_get``. 368 369#. If there are performance drops in the enqueue stage 370 371 * Use ``rte_event_dev_dump`` to dump the eventdev information. 372 373 * Periodically checks stats for queue and port to identify the starvation. 374 375 * Check the in-flight events for the desired queue for enqueue and dequeue. 376 377 378Is there a variance in traffic manager? 379~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 380 381Traffic Manager on TX interface :numref:`dtg_qos_tx`. 382 383.. _dtg_qos_tx: 384 385.. figure:: img/dtg_qos_tx.* 386 387 Traffic Manager just before TX. 388 389#. Identify the cause for a variance from expected behavior, is due to 390 insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features 391 for hierarchies, WRED and priority schedulers to be offloaded hardware. 392 393#. Undesired flow drops can be narrowed down to WRED, priority, and rates 394 limiters. 395 396#. Isolate the flow in which the undesired drops occur. Use 397 ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf 398 where drops occur. 399 400#. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read`` 401 for drops for hierarchy, schedulers and WRED configurations. 402 403 404Is the packet in the unexpected format? 405~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 406 407Packet capture before and after processing :numref:`dtg_pdump`. 408 409.. _dtg_pdump: 410 411.. figure:: img/dtg_pdump.* 412 413 Capture points of Traffic at RX-TX. 414 415#. To isolate the possible packet corruption in the processing pipeline, 416 carefully staged capture packets are to be implemented. 417 418 * First, isolate at NIC entry and exit. 419 420 Use pdump in primary to allow secondary to access port-queue pair. The 421 packets get copied over in RX|TX callback by the secondary process using 422 ring buffers. 423 424 * Second, isolate at pipeline entry and exit. 425 426 Using hooks or callbacks capture the packet middle of the pipeline stage 427 to copy the packets, which can be shared to the secondary debug process 428 via user-defined custom rings. 429 430.. note:: 431 432 Use similar analysis to objects and metadata corruption. 433 434 435Does the issue still persist? 436~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 437 438The issue can be further narrowed down to the following causes. 439 440#. If there are vendor or application specific metadata, check for errors due 441 to META data error flags. Dumping private meta-data in the objects can give 442 insight into details for debugging. 443 444#. If there are multi-process for either data or configuration, check for 445 possible errors in the secondary process where the configuration fails and 446 possible data corruption in the data plane. 447 448#. Random drops in the RX or TX when opening other application is an indication 449 of the effect of a noisy neighbor. Try using the cache allocation technique 450 to minimize the effect between applications. 451 452 453How to develop a custom code to debug? 454-------------------------------------- 455 456#. For an application that runs as the primary process only, debug functionality 457 is added in the same process. These can be invoked by timer call-back, 458 service core and signal handler. 459 460#. For the application that runs as multiple processes. debug functionality in 461 a standalone secondary process. 462