xref: /dpdk/doc/guides/howto/debug_troubleshoot.rst (revision b563c1421282a1ec6038e5d26b4cd4fcbb01ada1)
1..  SPDX-License-Identifier: BSD-3-Clause
2    Copyright(c) 2018 Intel Corporation.
3
4Debug & Troubleshoot guide
5==========================
6
7DPDK applications can be designed to have simple or complex pipeline processing
8stages making use of single or multiple threads. Applications can use poll mode
9hardware devices which helps in offloading CPU cycles too. It is common to find
10solutions designed with
11
12* single or multiple primary processes
13
14* single primary and single secondary
15
16* single primary and multiple secondaries
17
18In all the above cases, it is tedious to isolate, debug, and understand various
19behaviors which occur randomly or periodically. The goal of the guide is to
20consolidate a few commonly seen issues for reference. Then, isolate to identify
21the root cause through step by step debug at various stages.
22
23.. note::
24
25 It is difficult to cover all possible issues; in a single attempt. With
26 feedback and suggestions from the community, more cases can be covered.
27
28
29Application Overview
30--------------------
31
32By making use of the application model as a reference, we can discuss multiple
33causes of issues in the guide. Let us assume the sample makes use of a single
34primary process, with various processing stages running on multiple cores. The
35application may also make uses of Poll Mode Driver, and libraries like service
36cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev.
37
38The overview of an application modeled using PMD is shown in
39:numref:`dtg_sample_app_model`.
40
41.. _dtg_sample_app_model:
42
43.. figure:: img/dtg_sample_app_model.*
44
45   Overview of pipeline stage of an application
46
47
48Bottleneck Analysis
49-------------------
50
51A couple of factors that lead the design decision could be the platform, scale
52factor, and target. This distinct preference leads to multiple combinations,
53that are built using PMD and libraries of DPDK. While the compiler, library
54mode, and optimization flags are the components are to be constant, that
55affects the application too.
56
57
58Is there mismatch in packet (received < desired) rate?
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61RX Port and associated core :numref:`dtg_rx_rate`.
62
63.. _dtg_rx_rate:
64
65.. figure:: img/dtg_rx_rate.*
66
67   RX packet rate compared against received rate.
68
69#. Is the configuration for the RX setup correctly?
70
71   * Identify if port Speed and Duplex is matching to desired values with
72     ``rte_eth_link_get``.
73
74   * Check promiscuous mode if the drops do not occur for unique MAC address
75     with ``rte_eth_promiscuous_get``.
76
77#. Is the drop isolated to certain NIC only?
78
79   * Make use of ``rte_eth_dev_stats`` to identify the drops cause.
80
81   * If there are mbuf drops, check nb_desc for RX descriptor as it might not
82     be sufficient for the application.
83
84   * If ``rte_eth_dev_stats`` shows drops are on specific RX queues, ensure RX
85     lcore threads has enough cycles for ``rte_eth_rx_burst`` on the port queue
86     pair.
87
88   * If there are redirect to a specific port queue pair with, ensure RX lcore
89     threads gets enough cycles.
90
91   * Check the RSS configuration ``rte_eth_dev_rss_hash_conf_get`` if the
92     spread is not even and causing drops.
93
94   * If PMD stats are not updating, then there might be offload or configuration
95     which is dropping the incoming traffic.
96
97#. Is there drops still seen?
98
99   * If there are multiple port queue pair, it might be the RX thread, RX
100     distributor, or event RX adapter not having enough cycles.
101
102   * If there are drops seen for RX adapter or RX distributor, try using
103     ``rte_prefetch_non_temporal`` which intimates the core that the mbuf in the
104     cache is temporary.
105
106
107Is there packet drops at receive or transmit?
108~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109
110RX-TX port and associated cores :numref:`dtg_rx_tx_drop`.
111
112.. _dtg_rx_tx_drop:
113
114.. figure:: img/dtg_rx_tx_drop.*
115
116   RX-TX drops
117
118#. At RX
119
120   * Identify if there are multiple RX queue configured for port by
121     ``nb_rx_queues`` using ``rte_eth_dev_info_get``.
122
123   * Using ``rte_eth_dev_stats`` fetch drops in q_errors, check if RX thread
124     is configured to fetch packets from the port queue pair.
125
126   * Using ``rte_eth_dev_stats`` shows drops in ``rx_nombuf``, check if RX
127     thread has enough cycles to consume the packets from the queue.
128
129#. At TX
130
131   * If the TX rate is falling behind the application fill rate, identify if
132     there are enough descriptors with ``rte_eth_dev_info_get`` for TX.
133
134   * Check the ``nb_pkt`` in ``rte_eth_tx_burst`` is done for multiple packets.
135
136   * Check ``rte_eth_tx_burst`` invokes the vector function call for the PMD.
137
138   * If oerrors are getting incremented, TX packet validations are failing.
139     Check if there queue specific offload failures.
140
141   * If the drops occur for large size packets, check MTU and multi-segment
142     support configured for NIC.
143
144
145Is there object drops in producer point for the ring library?
146~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
147
148Producer point for ring :numref:`dtg_producer_ring`.
149
150.. _dtg_producer_ring:
151
152.. figure:: img/dtg_producer_ring.*
153
154   Producer point for Rings
155
156#. Performance issue isolation at producer
157
158   * Use ``rte_ring_dump`` to validate for all single producer flag is set to
159     ``RING_F_SP_ENQ``.
160
161   * There should be sufficient ``rte_ring_free_count`` at any point in time.
162
163   * Extreme stalls in dequeue stage of the pipeline will cause
164     ``rte_ring_full`` to be true.
165
166
167Is there object drops in consumer point for the ring library?
168~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169
170Consumer point for ring :numref:`dtg_consumer_ring`.
171
172.. _dtg_consumer_ring:
173
174.. figure:: img/dtg_consumer_ring.*
175
176   Consumer point for Rings
177
178#. Performance issue isolation at consumer
179
180   * Use ``rte_ring_dump`` to validate for all single consumer flag is set to
181     ``RING_F_SC_DEQ``.
182
183   * If the desired burst dequeue falls behind the actual dequeue, the enqueue
184     stage is not filling up the ring as required.
185
186   * Extreme stall in the enqueue will lead to ``rte_ring_empty`` to be true.
187
188
189Is there a variance in packet or object processing rate in the pipeline?
190~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
191
192Memory objects close to NUMA :numref:`dtg_mempool`.
193
194.. _dtg_mempool:
195
196.. figure:: img/dtg_mempool.*
197
198   Memory objects have to be close to the device per NUMA.
199
200#. Stall in processing pipeline can be attributes of MBUF release delays.
201   These can be narrowed down to
202
203   * Heavy processing cycles at single or multiple processing stages.
204
205   * Cache is spread due to the increased stages in the pipeline.
206
207   * CPU thread responsible for TX is not able to keep up with the burst of
208     traffic.
209
210   * Extra cycles to linearize multi-segment buffer and software offload like
211     checksum, TSO, and VLAN strip.
212
213   * Packet buffer copy in fast path also results in stalls in MBUF release if
214     not done selectively.
215
216   * Application logic sets ``rte_pktmbuf_refcnt_set`` to higher than the
217     desired value and frequently uses ``rte_pktmbuf_prefree_seg`` and does
218     not release MBUF back to mempool.
219
220#. Lower performance between the pipeline processing stages can be
221
222   * The NUMA instance for packets or objects from NIC, mempool, and ring
223     should be the same.
224
225   * Drops on a specific socket are due to insufficient objects in the pool.
226     Use ``rte_mempool_get_count`` or ``rte_mempool_avail_count`` to monitor
227     when drops occurs.
228
229   * Try prefetching the content in processing pipeline logic to minimize the
230     stalls.
231
232#. Performance issue can be due to special cases
233
234   * Check if MBUF continuous with ``rte_pktmbuf_is_contiguous`` as certain
235     offload requires the same.
236
237   * Use ``rte_mempool_cache_create`` for user threads require access to
238     mempool objects.
239
240   * If the variance is absent for larger huge pages, then try rte_mem_lock_page
241     on the objects, packets, lookup tables to isolate the issue.
242
243
244Is there a variance in cryptodev performance?
245~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
246
247Crypto device and PMD :numref:`dtg_crypto`.
248
249.. _dtg_crypto:
250
251.. figure:: img/dtg_crypto.*
252
253   CRYPTO and interaction with PMD device.
254
255#. Performance issue isolation for enqueue
256
257   * Ensure cryptodev, resources and enqueue is running on NUMA cores.
258
259   * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
260
261   * Parallelize enqueue thread for varied multiple queue pair.
262
263#. Performance issue isolation for dequeue
264
265   * Ensure cryptodev, resources and dequeue are running on NUMA cores.
266
267   * Isolate if the cause of errors for err_count using ``rte_cryptodev_stats``.
268
269   * Parallelize dequeue thread for varied multiple queue pair.
270
271#. Performance issue isolation for crypto operation
272
273   * If the cryptodev software-assist is in use, ensure the library is built
274     with right (SIMD) flags or check if the queue pair using CPU ISA for
275     feature_flags AVX|SSE|NEON using ``rte_cryptodev_info_get``.
276
277   * If the cryptodev hardware-assist is in use, ensure both firmware and
278     drivers are up to date.
279
280#. Configuration issue isolation
281
282   * Identify cryptodev instances with ``rte_cryptodev_count`` and
283     ``rte_cryptodev_info_get``.
284
285
286Is user functions performance is not as expected?
287~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
288
289Custom worker function :numref:`dtg_distributor_worker`.
290
291.. _dtg_distributor_worker:
292
293.. figure:: img/dtg_distributor_worker.*
294
295   Custom worker function performance drops.
296
297#. Performance issue isolation
298
299   * The functions running on CPU cores without context switches are the
300     performing scenarios. Identify lcore with ``rte_lcore`` and lcore index
301     mapping with CPU using ``rte_lcore_index``.
302
303   * Use ``rte_thread_get_affinity`` to isolate functions running on the same
304     CPU core.
305
306#. Configuration issue isolation
307
308   * Identify core role using ``rte_eal_lcore_role`` to identify RTE, OFF,
309     SERVICE and NON_EAL. Check performance functions are mapped to run on the
310     cores.
311
312   * For high-performance execution logic ensure running it on correct NUMA
313     and worker core.
314
315   * Analyze run logic with ``rte_dump_stack`` and
316     ``rte_memdump`` for more insights.
317
318   * Make use of objdump to ensure opcode is matching to the desired state.
319
320
321Is the execution cycles for dynamic service functions are not frequent?
322~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
323
324service functions on service cores :numref:`dtg_service`.
325
326.. _dtg_service:
327
328.. figure:: img/dtg_service.*
329
330   functions running on service cores
331
332#. Performance issue isolation
333
334   * Services configured for parallel execution should have
335     ``rte_service_lcore_count`` should be equal to
336     ``rte_service_lcore_count_services``.
337
338   * A service to run parallel on all cores should return
339     ``RTE_SERVICE_CAP_MT_SAFE`` for ``rte_service_probe_capability`` and
340     ``rte_service_map_lcore_get`` returns unique lcore.
341
342   * If service function execution cycles for dynamic service functions are
343     not frequent?
344
345   * If services share the lcore, overall execution should fit budget.
346
347#. Configuration issue isolation
348
349   * Check if service is running with ``rte_service_runstate_get``.
350
351   * Generic debug via ``rte_service_dump``.
352
353
354Is there a bottleneck in the performance of eventdev?
355~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
356
357#. Check for generic configuration
358
359   * Ensure the event devices created are right NUMA using
360     ``rte_event_dev_count`` and ``rte_event_dev_socket_id``.
361
362   * Check for event stages if the events are looped back into the same queue.
363
364   * If the failure is on the enqueue stage for events, check if queue depth
365     with ``rte_event_dev_info_get``.
366
367#. If there are performance drops in the enqueue stage
368
369   * Use ``rte_event_dev_dump`` to dump the eventdev information.
370
371   * Periodically checks stats for queue and port to identify the starvation.
372
373   * Check the in-flight events for the desired queue for enqueue and dequeue.
374
375
376Is there a variance in traffic manager?
377~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
378
379Traffic Manager on TX interface :numref:`dtg_qos_tx`.
380
381.. _dtg_qos_tx:
382
383.. figure:: img/dtg_qos_tx.*
384
385   Traffic Manager just before TX.
386
387#. Identify the cause for a variance from expected behavior, is due to
388   insufficient CPU cycles. Use ``rte_tm_capabilities_get`` to fetch features
389   for hierarchies, WRED and priority schedulers to be offloaded hardware.
390
391#. Undesired flow drops can be narrowed down to WRED, priority, and rates
392   limiters.
393
394#. Isolate the flow in which the undesired drops occur. Use
395   ``rte_tn_get_number_of_leaf_node`` and flow table to ping down the leaf
396   where drops occur.
397
398#. Check the stats using ``rte_tm_stats_update`` and ``rte_tm_node_stats_read``
399   for drops for hierarchy, schedulers and WRED configurations.
400
401
402Is the packet in the unexpected format?
403~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
404
405Packet capture before and after processing :numref:`dtg_pdump`.
406
407.. _dtg_pdump:
408
409.. figure:: img/dtg_pdump.*
410
411   Capture points of Traffic at RX-TX.
412
413#. To isolate the possible packet corruption in the processing pipeline,
414   carefully staged capture packets are to be implemented.
415
416   * First, isolate at NIC entry and exit.
417
418     Use pdump in primary to allow secondary to access port-queue pair. The
419     packets get copied over in RX|TX callback by the secondary process using
420     ring buffers.
421
422   * Second, isolate at pipeline entry and exit.
423
424     Using hooks or callbacks capture the packet middle of the pipeline stage
425     to copy the packets, which can be shared to the secondary debug process
426     via user-defined custom rings.
427
428.. note::
429
430   Use similar analysis to objects and metadata corruption.
431
432
433Does the issue still persist?
434~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
435
436The issue can be further narrowed down to the following causes.
437
438#. If there are vendor or application specific metadata, check for errors due
439   to META data error flags. Dumping private meta-data in the objects can give
440   insight into details for debugging.
441
442#. If there are multi-process for either data or configuration, check for
443   possible errors in the secondary process where the configuration fails and
444   possible data corruption in the data plane.
445
446#. Random drops in the RX or TX when opening other application is an indication
447   of the effect of a noisy neighbor. Try using the cache allocation technique
448   to minimize the effect between applications.
449
450
451How to develop a custom code to debug?
452--------------------------------------
453
454#. For an application that runs as the primary process only, debug functionality
455   is added in the same process. These can be invoked by timer call-back,
456   service core and signal handler.
457
458#. For the application that runs as multiple processes. debug functionality in
459   a standalone secondary process.
460