xref: /dpdk/doc/guides/howto/debug_troubleshoot.rst (revision 200bc52e5aa0d72e70464c9cd22b55cf536ed13c)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2018 Intel Corporation.
3
4Debug & Troubleshoot guide
5==========================
6
7DPDK applications can be designed to have simple or complex pipeline processing
8stages making use of single or multiple threads. Applications can use poll mode
9hardware devices which helps in offloading CPU cycles too. It is common to find
10solutions designed with
11
12* single or multiple primary processes
13
14* single primary and single secondary
15
16* single primary and multiple secondaries
17
18In all the above cases, it is tedious to isolate, debug, and understand various
19behaviors which occur randomly or periodically. The goal of the guide is to
20consolidate a few commonly seen issues for reference. Then, isolate to identify
21the root cause through step by step debug at various stages.
22
23.. note::
24
25 It is difficult to cover all possible issues; in a single attempt. With
26 feedback and suggestions from the community, more cases can be covered.
27
28
29Application Overview
30--------------------
31
32By making use of the application model as a reference, we can discuss multiple
33causes of issues in the guide. Let us assume the sample makes use of a single
34primary process, with various processing stages running on multiple cores. The
35application may also make uses of Poll Mode Driver, and libraries like service
36cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev.
37
38The overview of an application modeled using PMD is shown in
39:numref:`dtg_sample_app_model`.
40
41.. _dtg_sample_app_model:
42
43.. figure:: img/dtg_sample_app_model.*
44
45   Overview of pipeline stage of an application
46
47
48Bottleneck Analysis
49-------------------
50
51A couple of factors that lead the design decision could be the platform, scale
52factor, and target. This distinct preference leads to multiple combinations,
53that are built using PMD and libraries of DPDK. While the compiler, library
54mode, and optimization flags are the components are to be constant, that
55affects the application too.
56
57
58Is there mismatch in packet (received < desired) rate?
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61RX Port and associated core :numref:`dtg_rx_rate`.
62
63.. _dtg_rx_rate:
64
65.. figure:: img/dtg_rx_rate.*
66
67   RX packet rate compared against received rate.
68
69#. Is the configuration for the RX setup correctly?
70
71   * Identify if port Speed and Duplex is matching to desired values with
72     ``rte_eth_link_get``.
73
74   * Check ``DEV_RX_OFFLOAD_JUMBO_FRAME`` is set with ``rte_eth_dev_info_get``.
75
76   * Check promiscuous mode if the drops do not occur for unique MAC address
77     with ``rte_eth_promiscuous_get``.
78
79#. Is the drop isolated to certain NIC only?
80
81   * Make use of ``rte_eth_dev_stats`` to identify the drops cause.
82
83   * If there are mbuf drops, check nb_desc for RX descriptor as it might not
84     be sufficient for the application.
85
86   * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX
87     lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue
88     pair.
89
90   * If there are redirect to a specific port queue pair with, ensure RX lcore
91     threads gets enough cycles.
92
93   * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the
94     spread is not even and causing drops.
95
96   * If PMD stats are not updating, then there might be offload or configuration
97     which is dropping the incoming traffic.
98
99#. Is there drops still seen?
100
101   * If there are multiple port queue pair, it might be the RX thread, RX
102     distributor, or event RX adapter not having enough cycles.
103
104   * If there are drops seen for RX adapter or RX distributor, try using
105     ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the
106     cache is temporary.
107
108
109Is there packet drops at receive or transmit?
110~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
111
112RX-TX port and associated cores :numref:`dtg_rx_tx_drop`.
113
114.. _dtg_rx_tx_drop:
115
116.. figure:: img/dtg_rx_tx_drop.*
117
118   RX-TX drops
119
120#. At RX
121
122   * Identify if there are multiple RX queue configured for port by
123     ``nb_rx_queues`` using ``rte_eth_dev_info_get``.
124
125   * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread
126     is configured to fetch packets from the port queue pair.
127
128   * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX
129     thread has enough cycles to consume the packets from the queue.
130
131#. At TX
132
133   * If the TX rate is falling behind the application fill rate, identify if
134     there are enough descriptors with ``rte_eth_dev_info_get`` for TX.
135
136   * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets.
137
138   * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD.
139
140   * If oerrors are getting incremented, TX packet validations are failing.
141     Check if there queue specific offload failures.
142
143   * If the drops occur for large size packets, check MTU and multi-segment
144     support configured for NIC.
145
146
147Is there object drops in producer point for the ring library?
148~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
149
150Producer point for ring :numref:`dtg_producer_ring`.
151
152.. _dtg_producer_ring:
153
154.. figure:: img/dtg_producer_ring.*
155
156   Producer point for Rings
157
158#. Performance issue isolation at producer
159
160   * Use ``rte_ring_dump`` to validate for all single producer flag is set to
161     ``RING_F_SP_ENQ``.
162
163   * There should be sufficient ``rte_ring_free_count`` at any point in time.
164
165   * Extreme stalls in dequeue stage of the pipeline will cause
166     ``rte_ring_full`` to be true.
167
168
169Is there object drops in consumer point for the ring library?
170~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
171
172Consumer point for ring :numref:`dtg_consumer_ring`.
173
174.. _dtg_consumer_ring:
175
176.. figure:: img/dtg_consumer_ring.*
177
178   Consumer point for Rings
179
180#. Performance issue isolation at consumer
181
182   * Use ``rte_ring_dump`` to validate for all single consumer flag is set to
183     ``RING_F_SC_DEQ``.
184
185   * If the desired burst dequeue falls behind the actual dequeue, the enqueue
186     stage is not filling up the ring as required.
187
188   * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true.
189
190
191Is there a variance in packet or object processing rate in the pipeline?
192~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
193
194Memory objects close to NUMA :numref:`dtg_mempool`.
195
196.. _dtg_mempool:
197
198.. figure:: img/dtg_mempool.*
199
200   Memory objects have to be close to the device per NUMA.
201
202#. Stall in processing pipeline can be attributes of MBUF release delays.
203   These can be narrowed down to
204
205   * Heavy processing cycles at single or multiple processing stages.
206
207   * Cache is spread due to the increased stages in the pipeline.
208
209   * CPU thread responsible for TX is not able to keep up with the burst of
210     traffic.
211
212   * Extra cycles to linearize multi-segment buffer and software offload like
213     checksum, TSO, and VLAN strip.
214
215   * Packet buffer copy in fast path also results in stalls in MBUF release if
216     not done selectively.
217
218   * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the
219     desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does
220     not release MBUF back to mempool.
221
222#. Lower performance between the pipeline processing stages can be
223
224   * The NUMA instance for packets or objects from NIC, mempool, and ring
225     should be the same.
226
227   * Drops on a specific socket are due to insufficient objects in the pool.
228     Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor
229     when drops occurs.
230
231   * Try prefetching the content in processing pipeline logic to minimize the
232     stalls.
233
234#. Performance issue can be due to special cases
235
236   * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain
237     offload requires the same.
238
239   * Use ``rte_mempool_cache_create`` for user threads require access to
240     mempool objects.
241
242   * If the variance is absent for larger huge pages, then try rte_mem_lock_page
243     on the objects, packets, lookup tables to isolate the issue.
244
245
246Is there a variance in cryptodev performance?
247~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
248
249Crypto device and PMD :numref:`dtg_crypto`.
250
251.. _dtg_crypto:
252
253.. figure:: img/dtg_crypto.*
254
255   CRYPTO and interaction with PMD device.
256
257#. Performance issue isolation for enqueue
258
259   * Ensure cryptodev, resources and enqueue is running on NUMA cores.
260
261   * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
262
263   * Parallelize enqueue thread for varied multiple queue pair.
264
265#. Performance issue isolation for dequeue
266
267   * Ensure cryptodev, resources and dequeue are running on NUMA cores.
268
269   * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
270
271   * Parallelize dequeue thread for varied multiple queue pair.
272
273#. Performance issue isolation for crypto operation
274
275   * If the cryptodev software-assist is in use, ensure the library is built
276     with right (SIMD) flags or check if the queue pair using CPU ISA for
277     feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``.
278
279   * If the cryptodev hardware-assist is in use, ensure both firmware and
280     drivers are up to date.
281
282#. Configuration issue isolation
283
284   * Identify cryptodev instances with ``rte_cryptodev_count`` and
285     ``rte_cryptodev_info_get``.
286
287
288Is user functions performance is not as expected?
289~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
290
291Custom worker function :numref:`dtg_distributor_worker`.
292
293.. _dtg_distributor_worker:
294
295.. figure:: img/dtg_distributor_worker.*
296
297   Custom worker function performance drops.
298
299#. Performance issue isolation
300
301   * The functions running on CPU cores without context switches are the
302     performing scenarios. Identify lcore with ``rte_lcore`` and lcore index
303     mapping with CPU using ``rte_lcore_index``.
304
305   * Use ``rte_thread_get_affinity`` to isolate functions running on the same
306     CPU core.
307
308#. Configuration issue isolation
309
310   * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF and
311     SERVICE. Check performance functions are mapped to run on the cores.
312
313   * For high-performance execution logic ensure running it on correct NUMA
314     and non-master core.
315
316   * Analyze run logic with ``rte_dump_stack``, ``rte_dump_registers`` and
317     ``rte_memdump`` for more insights.
318
319   * Make use of objdump to ensure opcode is matching to the desired state.
320
321
322Is the execution cycles for dynamic service functions are not frequent?
323~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
324
325service functions on service cores :numref:`dtg_service`.
326
327.. _dtg_service:
328
329.. figure:: img/dtg_service.*
330
331   functions running on service cores
332
333#. Performance issue isolation
334
335   * Services configured for parallel execution should have
336     ``rte_service_lcore_count`` should be equal to
337     ``rte_service_lcore_count_services``.
338
339   * A service to run parallel on all cores should return
340     ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and
341     ``rte_service_map_lcore_get`` returns unique lcore.
342
343   * If service function execution cycles for dynamic service functions are
344     not frequent?
345
346   * If services share the lcore, overall execution should fit budget.
347
348#. Configuration issue isolation
349
350   * Check if service is running with ``rte_service_runstate_get``.
351
352   * Generic debug via ``rte_service_dump``.
353
354
355Is there a bottleneck in the performance of eventdev?
356~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
357
358#. Check for generic configuration
359
360   * Ensure the event devices created are right NUMA using
361     ``rte_event_dev_count`` and ``rte_event_dev_socket_id``.
362
363   * Check for event stages if the events are looped back into the same queue.
364
365   * If the failure is on the enqueue stage for events, check if queue depth
366     with ``rte_event_dev_info_get``.
367
368#. If there are performance drops in the enqueue stage
369
370   * Use ``rte_event_dev_dump`` to dump the eventdev information.
371
372   * Periodically checks stats for queue and port to identify the starvation.
373
374   * Check the in-flight events for the desired queue for enqueue and dequeue.
375
376
377Is there a variance in traffic manager?
378~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
379
380Traffic Manager on TX interface :numref:`dtg_qos_tx`.
381
382.. _dtg_qos_tx:
383
384.. figure:: img/dtg_qos_tx.*
385
386   Traffic Manager just before TX.
387
388#. Identify the cause for a variance from expected behavior, is due to
389   insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features
390   for hierarchies, WRED and priority schedulers to be offloaded hardware.
391
392#. Undesired flow drops can be narrowed down to WRED, priority, and rates
393   limiters.
394
395#. Isolate the flow in which the undesired drops occur. Use
396   ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf
397   where drops occur.
398
399#. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read``
400   for drops for hierarchy, schedulers and WRED configurations.
401
402
403Is the packet in the unexpected format?
404~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405
406Packet capture before and after processing :numref:`dtg_pdump`.
407
408.. _dtg_pdump:
409
410.. figure:: img/dtg_pdump.*
411
412   Capture points of Traffic at RX-TX.
413
414#. To isolate the possible packet corruption in the processing pipeline,
415   carefully staged capture packets are to be implemented.
416
417   * First, isolate at NIC entry and exit.
418
419     Use pdump in primary to allow secondary to access port-queue pair. The
420     packets get copied over in RX|TX callback by the secondary process using
421     ring buffers.
422
423   * Second, isolate at pipeline entry and exit.
424
425     Using hooks or callbacks capture the packet middle of the pipeline stage
426     to copy the packets, which can be shared to the secondary debug process
427     via user-defined custom rings.
428
429.. note::
430
431   Use similar analysis to objects and metadata corruption.
432
433
434Does the issue still persist?
435~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
436
437The issue can be further narrowed down to the following causes.
438
439#. If there are vendor or application specific metadata, check for errors due
440   to META data error flags. Dumping private meta-data in the objects can give
441   insight into details for debugging.
442
443#. If there are multi-process for either data or configuration, check for
444   possible errors in the secondary process where the configuration fails and
445   possible data corruption in the data plane.
446
447#. Random drops in the RX or TX when opening other application is an indication
448   of the effect of a noisy neighbor. Try using the cache allocation technique
449   to minimize the effect between applications.
450
451
452How to develop a custom code to debug?
453--------------------------------------
454
455#. For an application that runs as the primary process only, debug functionality
456   is added in the same process. These can be invoked by timer call-back,
457   service core and signal handler.
458
459#. For the application that runs as multiple processes. debug functionality in
460   a standalone secondary process.
461