1d67e6ecfSHonnappa Nagarahalli.. SPDX-License-Identifier: BSD-3-Clause 2d67e6ecfSHonnappa Nagarahalli Copyright(c) 2019 Arm Limited. 3d67e6ecfSHonnappa Nagarahalli 4*41dd9a6bSDavid YoungRead-Copy-Update (RCU) Library 5*41dd9a6bSDavid Young============================== 6d67e6ecfSHonnappa Nagarahalli 7d67e6ecfSHonnappa NagarahalliLockless data structures provide scalability and determinism. 8d67e6ecfSHonnappa NagarahalliThey enable use cases where locking may not be allowed 9d67e6ecfSHonnappa Nagarahalli(for example real-time applications). 10d67e6ecfSHonnappa Nagarahalli 11d67e6ecfSHonnappa NagarahalliIn the following sections, the term "memory" refers to memory allocated 12d67e6ecfSHonnappa Nagarahalliby typical APIs like malloc() or anything that is representative of 13d67e6ecfSHonnappa Nagarahallimemory, for example an index of a free element array. 14d67e6ecfSHonnappa Nagarahalli 15d67e6ecfSHonnappa NagarahalliSince these data structures are lockless, the writers and readers 16d67e6ecfSHonnappa Nagarahalliare accessing the data structures concurrently. Hence, while removing 17d67e6ecfSHonnappa Nagarahallian element from a data structure, the writers cannot return the memory 18d67e6ecfSHonnappa Nagarahallito the allocator, without knowing that the readers are not 19d67e6ecfSHonnappa Nagarahallireferencing that element/memory anymore. Hence, it is required to 20d67e6ecfSHonnappa Nagarahalliseparate the operation of removing an element into two steps: 21d67e6ecfSHonnappa Nagarahalli 22d67e6ecfSHonnappa Nagarahalli#. Delete: in this step, the writer removes the reference to the element from 23d67e6ecfSHonnappa Nagarahalli the data structure but does not return the associated memory to the 24d67e6ecfSHonnappa Nagarahalli allocator. This will ensure that new readers will not get a reference to 25d67e6ecfSHonnappa Nagarahalli the removed element. Removing the reference is an atomic operation. 26d67e6ecfSHonnappa Nagarahalli 27d67e6ecfSHonnappa Nagarahalli#. Free (Reclaim): in this step, the writer returns the memory to the 28d67e6ecfSHonnappa Nagarahalli memory allocator only after knowing that all the readers have stopped 29d67e6ecfSHonnappa Nagarahalli referencing the deleted element. 30d67e6ecfSHonnappa Nagarahalli 31d67e6ecfSHonnappa NagarahalliThis library helps the writer determine when it is safe to free the 32d67e6ecfSHonnappa Nagarahallimemory by making use of thread Quiescent State (QS). 33d67e6ecfSHonnappa Nagarahalli 34d67e6ecfSHonnappa NagarahalliWhat is Quiescent State 35d67e6ecfSHonnappa Nagarahalli----------------------- 36d67e6ecfSHonnappa Nagarahalli 37d67e6ecfSHonnappa NagarahalliQuiescent State can be defined as "any point in the thread execution where the 384831115fSHonnappa Nagarahallithread does not hold a reference to shared memory". It is the responsibility of 394831115fSHonnappa Nagarahallithe application to determine its quiescent state. 40d67e6ecfSHonnappa Nagarahalli 41d67e6ecfSHonnappa NagarahalliLet us consider the following diagram: 42d67e6ecfSHonnappa Nagarahalli 43d67e6ecfSHonnappa Nagarahalli.. _figure_quiescent_state: 44d67e6ecfSHonnappa Nagarahalli 45d67e6ecfSHonnappa Nagarahalli.. figure:: img/rcu_general_info.* 46d67e6ecfSHonnappa Nagarahalli 47d67e6ecfSHonnappa Nagarahalli Phases in the Quiescent State model. 48d67e6ecfSHonnappa Nagarahalli 49d67e6ecfSHonnappa Nagarahalli 50d67e6ecfSHonnappa NagarahalliAs shown in :numref:`figure_quiescent_state`, reader thread 1 accesses data 51d67e6ecfSHonnappa Nagarahallistructures D1 and D2. When it is accessing D1, if the writer has to remove an 52d67e6ecfSHonnappa Nagarahallielement from D1, the writer cannot free the memory associated with that 53d67e6ecfSHonnappa Nagarahallielement immediately. The writer can return the memory to the allocator only 54d67e6ecfSHonnappa Nagarahalliafter the reader stops referencing D1. In other words, reader thread RT1 has 55d67e6ecfSHonnappa Nagarahallito enter a quiescent state. 56d67e6ecfSHonnappa Nagarahalli 57d67e6ecfSHonnappa NagarahalliSimilarly, since reader thread 2 is also accessing D1, the writer has to 58d67e6ecfSHonnappa Nagarahalliwait till thread 2 enters quiescent state as well. 59d67e6ecfSHonnappa Nagarahalli 60d67e6ecfSHonnappa NagarahalliHowever, the writer does not need to wait for reader thread 3 to enter 61d67e6ecfSHonnappa Nagarahalliquiescent state. Reader thread 3 was not accessing D1 when the delete 6207d311ccSPrateek Agarwaloperation happened. So, reader thread 3 will not have a reference to the 63d67e6ecfSHonnappa Nagarahallideleted entry. 64d67e6ecfSHonnappa Nagarahalli 65d67e6ecfSHonnappa NagarahalliIt can be noted that, the critical sections for D2 is a quiescent state 66d67e6ecfSHonnappa Nagarahallifor D1. i.e. for a given data structure Dx, any point in the thread execution 67d67e6ecfSHonnappa Nagarahallithat does not reference Dx is a quiescent state. 68d67e6ecfSHonnappa Nagarahalli 69d67e6ecfSHonnappa NagarahalliSince memory is not freed immediately, there might be a need for 70d67e6ecfSHonnappa Nagarahalliprovisioning of additional memory, depending on the application requirements. 71d67e6ecfSHonnappa Nagarahalli 72d67e6ecfSHonnappa NagarahalliFactors affecting the RCU mechanism 73d67e6ecfSHonnappa Nagarahalli----------------------------------- 74d67e6ecfSHonnappa Nagarahalli 75d67e6ecfSHonnappa NagarahalliIt is important to make sure that this library keeps the overhead of 76d67e6ecfSHonnappa Nagarahalliidentifying the end of grace period and subsequent freeing of memory, 774831115fSHonnappa Nagarahallito a minimum. The following paras explain how grace period and critical 78d67e6ecfSHonnappa Nagarahallisection affect this overhead. 79d67e6ecfSHonnappa Nagarahalli 80d67e6ecfSHonnappa NagarahalliThe writer has to poll the readers to identify the end of grace period. 81d67e6ecfSHonnappa NagarahalliPolling introduces memory accesses and wastes CPU cycles. The memory 82d67e6ecfSHonnappa Nagarahalliis not available for reuse during the grace period. Longer grace periods 83d67e6ecfSHonnappa Nagarahalliexasperate these conditions. 84d67e6ecfSHonnappa Nagarahalli 85d67e6ecfSHonnappa NagarahalliThe length of the critical section and the number of reader threads 86d67e6ecfSHonnappa Nagarahalliis proportional to the duration of the grace period. Keeping the critical 87d67e6ecfSHonnappa Nagarahallisections smaller will keep the grace period smaller. However, keeping the 88d67e6ecfSHonnappa Nagarahallicritical sections smaller requires additional CPU cycles (due to additional 89d67e6ecfSHonnappa Nagarahallireporting) in the readers. 90d67e6ecfSHonnappa Nagarahalli 91d67e6ecfSHonnappa NagarahalliHence, we need the characteristics of a small grace period and large critical 924831115fSHonnappa Nagarahallisection. This library addresses these characteristics by allowing the writer 934831115fSHonnappa Nagarahallito do other work without having to block until the readers report their 944831115fSHonnappa Nagarahalliquiescent state. 95d67e6ecfSHonnappa Nagarahalli 96d67e6ecfSHonnappa NagarahalliRCU in DPDK 97d67e6ecfSHonnappa Nagarahalli----------- 98d67e6ecfSHonnappa Nagarahalli 994831115fSHonnappa NagarahalliFor DPDK applications, the beginning and end of a ``while(1)`` loop (where no 100d67e6ecfSHonnappa Nagarahallireferences to shared data structures are kept) act as perfect quiescent 101d67e6ecfSHonnappa Nagarahallistates. This will combine all the shared data structure accesses into a 102d67e6ecfSHonnappa Nagarahallisingle, large critical section which helps keep the overhead on the 103d67e6ecfSHonnappa Nagarahallireader side to a minimum. 104d67e6ecfSHonnappa Nagarahalli 105d67e6ecfSHonnappa NagarahalliDPDK supports a pipeline model of packet processing and service cores. 106d67e6ecfSHonnappa NagarahalliIn these use cases, a given data structure may not be used by all the 1074831115fSHonnappa Nagarahalliworkers in the application. The writer has to wait only for the workers that 1084831115fSHonnappa Nagarahalliuse the data structure to report their quiescent state. To provide the required 1094831115fSHonnappa Nagarahalliflexibility, this library has a concept of a QS variable. If required, the 1104831115fSHonnappa Nagarahalliapplication can create one QS variable per data structure to help it track the 1114831115fSHonnappa Nagarahalliend of grace period for each data structure. This helps keep the length of grace 112d67e6ecfSHonnappa Nagarahalliperiod to a minimum. 113d67e6ecfSHonnappa Nagarahalli 114d67e6ecfSHonnappa NagarahalliHow to use this library 115d67e6ecfSHonnappa Nagarahalli----------------------- 116d67e6ecfSHonnappa Nagarahalli 117d67e6ecfSHonnappa NagarahalliThe application must allocate memory and initialize a QS variable. 118d67e6ecfSHonnappa Nagarahalli 119d67e6ecfSHonnappa NagarahalliApplications can call ``rte_rcu_qsbr_get_memsize()`` to calculate the size 120d67e6ecfSHonnappa Nagarahalliof memory to allocate. This API takes a maximum number of reader threads, 121f50d9aadSHonnappa Nagarahalliusing this variable, as a parameter. 122d67e6ecfSHonnappa Nagarahalli 123d67e6ecfSHonnappa NagarahalliFurther, the application can initialize a QS variable using the API 124d67e6ecfSHonnappa Nagarahalli``rte_rcu_qsbr_init()``. 125d67e6ecfSHonnappa Nagarahalli 126d67e6ecfSHonnappa NagarahalliEach reader thread is assumed to have a unique thread ID. Currently, the 127d67e6ecfSHonnappa Nagarahallimanagement of the thread ID (for example allocation/free) is left to the 128d67e6ecfSHonnappa Nagarahalliapplication. The thread ID should be in the range of 0 to 129d67e6ecfSHonnappa Nagarahallimaximum number of threads provided while creating the QS variable. 130d67e6ecfSHonnappa NagarahalliThe application could also use ``lcore_id`` as the thread ID where applicable. 131d67e6ecfSHonnappa Nagarahalli 132d67e6ecfSHonnappa NagarahalliThe ``rte_rcu_qsbr_thread_register()`` API will register a reader thread 133d67e6ecfSHonnappa Nagarahallito report its quiescent state. This can be called from a reader thread. 134d67e6ecfSHonnappa NagarahalliA control plane thread can also call this on behalf of a reader thread. 135d67e6ecfSHonnappa NagarahalliThe reader thread must call ``rte_rcu_qsbr_thread_online()`` API to start 136d67e6ecfSHonnappa Nagarahallireporting its quiescent state. 137d67e6ecfSHonnappa Nagarahalli 138d67e6ecfSHonnappa NagarahalliSome of the use cases might require the reader threads to make blocking API 139d67e6ecfSHonnappa Nagarahallicalls (for example while using eventdev APIs). The writer thread should not 140d67e6ecfSHonnappa Nagarahalliwait for such reader threads to enter quiescent state. The reader thread must 141d67e6ecfSHonnappa Nagarahallicall ``rte_rcu_qsbr_thread_offline()`` API, before calling blocking APIs. It 142d67e6ecfSHonnappa Nagarahallican call ``rte_rcu_qsbr_thread_online()`` API once the blocking API call 143d67e6ecfSHonnappa Nagarahallireturns. 144d67e6ecfSHonnappa Nagarahalli 145d67e6ecfSHonnappa NagarahalliThe writer thread can trigger the reader threads to report their quiescent 146d67e6ecfSHonnappa Nagarahallistate by calling the API ``rte_rcu_qsbr_start()``. It is possible for multiple 147d67e6ecfSHonnappa Nagarahalliwriter threads to query the quiescent state status simultaneously. Hence, 148d67e6ecfSHonnappa Nagarahalli``rte_rcu_qsbr_start()`` returns a token to each caller. 149d67e6ecfSHonnappa Nagarahalli 150d67e6ecfSHonnappa NagarahalliThe writer thread must call ``rte_rcu_qsbr_check()`` API with the token to 151d67e6ecfSHonnappa Nagarahalliget the current quiescent state status. Option to block till all the reader 152d67e6ecfSHonnappa Nagarahallithreads enter the quiescent state is provided. If this API indicates that 153d67e6ecfSHonnappa Nagarahalliall the reader threads have entered the quiescent state, the application 154d67e6ecfSHonnappa Nagarahallican free the deleted entry. 155d67e6ecfSHonnappa Nagarahalli 156d67e6ecfSHonnappa NagarahalliThe APIs ``rte_rcu_qsbr_start()`` and ``rte_rcu_qsbr_check()`` are lock free. 157d67e6ecfSHonnappa NagarahalliHence, they can be called concurrently from multiple writers even while 158d67e6ecfSHonnappa Nagarahallirunning as worker threads. 159d67e6ecfSHonnappa Nagarahalli 160d67e6ecfSHonnappa NagarahalliThe separation of triggering the reporting from querying the status provides 161d67e6ecfSHonnappa Nagarahallithe writer threads flexibility to do useful work instead of blocking for the 162d67e6ecfSHonnappa Nagarahallireader threads to enter the quiescent state or go offline. This reduces the 1630e9d3de6SHonnappa Nagarahallimemory accesses due to continuous polling for the status. But, since the 1640e9d3de6SHonnappa Nagarahalliresource is freed at a later time, the token and the reference to the deleted 1650e9d3de6SHonnappa Nagarahalliresource need to be stored for later queries. 166d67e6ecfSHonnappa Nagarahalli 167d67e6ecfSHonnappa NagarahalliThe ``rte_rcu_qsbr_synchronize()`` API combines the functionality of 168d67e6ecfSHonnappa Nagarahalli``rte_rcu_qsbr_start()`` and blocking ``rte_rcu_qsbr_check()`` into a single 169d67e6ecfSHonnappa NagarahalliAPI. This API triggers the reader threads to report their quiescent state and 170d67e6ecfSHonnappa Nagarahallipolls till all the readers enter the quiescent state or go offline. This API 171d67e6ecfSHonnappa Nagarahallidoes not allow the writer to do useful work while waiting and introduces 1720e9d3de6SHonnappa Nagarahalliadditional memory accesses due to continuous polling. However, the application 1730e9d3de6SHonnappa Nagarahallidoes not have to store the token or the reference to the deleted resource. The 1740e9d3de6SHonnappa Nagarahalliresource can be freed immediately after ``rte_rcu_qsbr_synchronize()`` API 1750e9d3de6SHonnappa Nagarahallireturns. 176d67e6ecfSHonnappa Nagarahalli 177d67e6ecfSHonnappa NagarahalliThe reader thread must call ``rte_rcu_qsbr_thread_offline()`` and 178d67e6ecfSHonnappa Nagarahalli``rte_rcu_qsbr_thread_unregister()`` APIs to remove itself from reporting its 179d67e6ecfSHonnappa Nagarahalliquiescent state. The ``rte_rcu_qsbr_check()`` API will not wait for this reader 180d67e6ecfSHonnappa Nagarahallithread to report the quiescent state status anymore. 181d67e6ecfSHonnappa Nagarahalli 182d67e6ecfSHonnappa NagarahalliThe reader threads should call ``rte_rcu_qsbr_quiescent()`` API to indicate that 183d67e6ecfSHonnappa Nagarahallithey entered a quiescent state. This API checks if a writer has triggered a 184d67e6ecfSHonnappa Nagarahalliquiescent state query and update the state accordingly. 185d67e6ecfSHonnappa Nagarahalli 186d67e6ecfSHonnappa NagarahalliThe ``rte_rcu_qsbr_lock()`` and ``rte_rcu_qsbr_unlock()`` are empty functions. 18789c67ae2SCiara PowerHowever, these APIs can aid in debugging issues. One can mark the access to 18889c67ae2SCiara Powershared data structures on the reader side using these APIs. The 18989c67ae2SCiara Power``rte_rcu_qsbr_quiescent()`` will check if all the locks are unlocked. 1901d301a8aSRuifeng Wang 1911d301a8aSRuifeng WangResource reclamation framework for DPDK 1921d301a8aSRuifeng Wang--------------------------------------- 1931d301a8aSRuifeng Wang 1941d301a8aSRuifeng WangLock-free algorithms place additional burden of resource reclamation on 1951d301a8aSRuifeng Wangthe application. When a writer deletes an entry from a data structure, the writer: 1961d301a8aSRuifeng Wang 1971d301a8aSRuifeng Wang#. Has to start the grace period 1981d301a8aSRuifeng Wang#. Has to store a reference to the deleted resources in a FIFO 1991d301a8aSRuifeng Wang#. Should check if the readers have completed a grace period and free the resources. 2001d301a8aSRuifeng Wang 2011d301a8aSRuifeng WangThere are several APIs provided to help with this process. The writer 2021d301a8aSRuifeng Wangcan create a FIFO to store the references to deleted resources using ``rte_rcu_qsbr_dq_create()``. 2031d301a8aSRuifeng WangThe resources can be enqueued to this FIFO using ``rte_rcu_qsbr_dq_enqueue()``. 2041d301a8aSRuifeng WangIf the FIFO is full, ``rte_rcu_qsbr_dq_enqueue`` will reclaim the resources before enqueuing. It will also reclaim resources on regular basis to keep the FIFO from growing too large. If the writer runs out of resources, the writer can call ``rte_rcu_qsbr_dq_reclaim`` API to reclaim resources. ``rte_rcu_qsbr_dq_delete`` is provided to reclaim any remaining resources and free the FIFO while shutting down. 2051d301a8aSRuifeng Wang 2061d301a8aSRuifeng WangHowever, if this resource reclamation process were to be integrated in lock-free data structure libraries, it 2071d301a8aSRuifeng Wanghides this complexity from the application and makes it easier for the application to adopt lock-free algorithms. The following paragraphs discuss how the reclamation process can be integrated in DPDK libraries. 2081d301a8aSRuifeng Wang 2091d301a8aSRuifeng WangIn any DPDK application, the resource reclamation process using QSBR can be split into 4 parts: 2101d301a8aSRuifeng Wang 2111d301a8aSRuifeng Wang#. Initialization 2121d301a8aSRuifeng Wang#. Quiescent State Reporting 2131d301a8aSRuifeng Wang#. Reclaiming Resources 2141d301a8aSRuifeng Wang#. Shutdown 2151d301a8aSRuifeng Wang 2161d301a8aSRuifeng WangThe design proposed here assigns different parts of this process to client libraries and applications. The term 'client library' refers to lock-free data structure libraries such at rte_hash, rte_lpm etc. in DPDK or similar libraries outside of DPDK. The term 'application' refers to the packet processing application that makes use of DPDK such as L3 Forwarding example application, OVS, VPP etc.. 2171d301a8aSRuifeng Wang 2181d301a8aSRuifeng WangThe application has to handle 'Initialization' and 'Quiescent State Reporting'. So, 2191d301a8aSRuifeng Wang 2201d301a8aSRuifeng Wang* the application has to create the RCU variable and register the reader threads to report their quiescent state. 2211d301a8aSRuifeng Wang* the application has to register the same RCU variable with the client library. 2221d301a8aSRuifeng Wang* reader threads in the application have to report the quiescent state. This allows for the application to control the length of the critical section/how frequently the application wants to report the quiescent state. 2231d301a8aSRuifeng Wang 2241d301a8aSRuifeng WangThe client library will handle 'Reclaiming Resources' part of the process. The 2251d301a8aSRuifeng Wangclient libraries will make use of the writer thread context to execute the memory 2261d301a8aSRuifeng Wangreclamation algorithm. So, 2271d301a8aSRuifeng Wang 2281d301a8aSRuifeng Wang* client library should provide an API to register a RCU variable that it will use. It should call ``rte_rcu_qsbr_dq_create()`` to create the FIFO to store the references to deleted entries. 2291d301a8aSRuifeng Wang* client library should use ``rte_rcu_qsbr_dq_enqueue`` to enqueue the deleted resources on the FIFO and start the grace period. 2301d301a8aSRuifeng Wang* if the library runs out of resources while adding entries, it should call ``rte_rcu_qsbr_dq_reclaim`` to reclaim the resources and try the resource allocation again. 2311d301a8aSRuifeng Wang 2321d301a8aSRuifeng WangThe 'Shutdown' process needs to be shared between the application and the 2331d301a8aSRuifeng Wangclient library. 2341d301a8aSRuifeng Wang 2351d301a8aSRuifeng Wang* the application should make sure that the reader threads are not using the shared data structure, unregister the reader threads from the QSBR variable before calling the client library's shutdown function. 2361d301a8aSRuifeng Wang 2371d301a8aSRuifeng Wang* client library should call ``rte_rcu_qsbr_dq_delete`` to reclaim any remaining resources and free the FIFO. 2381d301a8aSRuifeng Wang 2391d301a8aSRuifeng WangIntegrating the resource reclamation with client libraries removes the burden from 2401d301a8aSRuifeng Wangthe application and makes it easy to use lock-free algorithms. 2411d301a8aSRuifeng Wang 2421d301a8aSRuifeng WangThis design has several advantages over currently known methods. 2431d301a8aSRuifeng Wang 2441d301a8aSRuifeng Wang#. Application does not need a dedicated thread to reclaim resources. Memory 2451d301a8aSRuifeng Wang reclamation happens as part of the writer thread with little impact on 2461d301a8aSRuifeng Wang performance. 2471d301a8aSRuifeng Wang#. The client library has better control over the resources. For example: the client 2481d301a8aSRuifeng Wang library can attempt to reclaim when it has run out of resources. 249