1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2019 Arm Limited. 3 4Read-Copy-Update (RCU) Library 5============================== 6 7Lockless data structures provide scalability and determinism. 8They enable use cases where locking may not be allowed 9(for example real-time applications). 10 11In the following sections, the term "memory" refers to memory allocated 12by typical APIs like malloc() or anything that is representative of 13memory, for example an index of a free element array. 14 15Since these data structures are lockless, the writers and readers 16are accessing the data structures concurrently. Hence, while removing 17an element from a data structure, the writers cannot return the memory 18to the allocator, without knowing that the readers are not 19referencing that element/memory anymore. Hence, it is required to 20separate the operation of removing an element into two steps: 21 22#. Delete: in this step, the writer removes the reference to the element from 23 the data structure but does not return the associated memory to the 24 allocator. This will ensure that new readers will not get a reference to 25 the removed element. Removing the reference is an atomic operation. 26 27#. Free (Reclaim): in this step, the writer returns the memory to the 28 memory allocator only after knowing that all the readers have stopped 29 referencing the deleted element. 30 31This library helps the writer determine when it is safe to free the 32memory by making use of thread Quiescent State (QS). 33 34What is Quiescent State 35----------------------- 36 37Quiescent State can be defined as "any point in the thread execution where the 38thread does not hold a reference to shared memory". It is the responsibility of 39the application to determine its quiescent state. 40 41Let us consider the following diagram: 42 43.. _figure_quiescent_state: 44 45.. figure:: img/rcu_general_info.* 46 47 Phases in the Quiescent State model. 48 49 50As shown in :numref:`figure_quiescent_state`, reader thread 1 accesses data 51structures D1 and D2. When it is accessing D1, if the writer has to remove an 52element from D1, the writer cannot free the memory associated with that 53element immediately. The writer can return the memory to the allocator only 54after the reader stops referencing D1. In other words, reader thread RT1 has 55to enter a quiescent state. 56 57Similarly, since reader thread 2 is also accessing D1, the writer has to 58wait till thread 2 enters quiescent state as well. 59 60However, the writer does not need to wait for reader thread 3 to enter 61quiescent state. Reader thread 3 was not accessing D1 when the delete 62operation happened. So, reader thread 3 will not have a reference to the 63deleted entry. 64 65It can be noted that, the critical sections for D2 is a quiescent state 66for D1. i.e. for a given data structure Dx, any point in the thread execution 67that does not reference Dx is a quiescent state. 68 69Since memory is not freed immediately, there might be a need for 70provisioning of additional memory, depending on the application requirements. 71 72Factors affecting the RCU mechanism 73----------------------------------- 74 75It is important to make sure that this library keeps the overhead of 76identifying the end of grace period and subsequent freeing of memory, 77to a minimum. The following paras explain how grace period and critical 78section affect this overhead. 79 80The writer has to poll the readers to identify the end of grace period. 81Polling introduces memory accesses and wastes CPU cycles. The memory 82is not available for reuse during the grace period. Longer grace periods 83exasperate these conditions. 84 85The length of the critical section and the number of reader threads 86is proportional to the duration of the grace period. Keeping the critical 87sections smaller will keep the grace period smaller. However, keeping the 88critical sections smaller requires additional CPU cycles (due to additional 89reporting) in the readers. 90 91Hence, we need the characteristics of a small grace period and large critical 92section. This library addresses these characteristics by allowing the writer 93to do other work without having to block until the readers report their 94quiescent state. 95 96RCU in DPDK 97----------- 98 99For DPDK applications, the beginning and end of a ``while(1)`` loop (where no 100references to shared data structures are kept) act as perfect quiescent 101states. This will combine all the shared data structure accesses into a 102single, large critical section which helps keep the overhead on the 103reader side to a minimum. 104 105DPDK supports a pipeline model of packet processing and service cores. 106In these use cases, a given data structure may not be used by all the 107workers in the application. The writer has to wait only for the workers that 108use the data structure to report their quiescent state. To provide the required 109flexibility, this library has a concept of a QS variable. If required, the 110application can create one QS variable per data structure to help it track the 111end of grace period for each data structure. This helps keep the length of grace 112period to a minimum. 113 114How to use this library 115----------------------- 116 117The application must allocate memory and initialize a QS variable. 118 119Applications can call ``rte_rcu_qsbr_get_memsize()`` to calculate the size 120of memory to allocate. This API takes a maximum number of reader threads, 121using this variable, as a parameter. 122 123Further, the application can initialize a QS variable using the API 124``rte_rcu_qsbr_init()``. 125 126Each reader thread is assumed to have a unique thread ID. Currently, the 127management of the thread ID (for example allocation/free) is left to the 128application. The thread ID should be in the range of 0 to 129maximum number of threads provided while creating the QS variable. 130The application could also use ``lcore_id`` as the thread ID where applicable. 131 132The ``rte_rcu_qsbr_thread_register()`` API will register a reader thread 133to report its quiescent state. This can be called from a reader thread. 134A control plane thread can also call this on behalf of a reader thread. 135The reader thread must call ``rte_rcu_qsbr_thread_online()`` API to start 136reporting its quiescent state. 137 138Some of the use cases might require the reader threads to make blocking API 139calls (for example while using eventdev APIs). The writer thread should not 140wait for such reader threads to enter quiescent state. The reader thread must 141call ``rte_rcu_qsbr_thread_offline()`` API, before calling blocking APIs. It 142can call ``rte_rcu_qsbr_thread_online()`` API once the blocking API call 143returns. 144 145The writer thread can trigger the reader threads to report their quiescent 146state by calling the API ``rte_rcu_qsbr_start()``. It is possible for multiple 147writer threads to query the quiescent state status simultaneously. Hence, 148``rte_rcu_qsbr_start()`` returns a token to each caller. 149 150The writer thread must call ``rte_rcu_qsbr_check()`` API with the token to 151get the current quiescent state status. Option to block till all the reader 152threads enter the quiescent state is provided. If this API indicates that 153all the reader threads have entered the quiescent state, the application 154can free the deleted entry. 155 156The APIs ``rte_rcu_qsbr_start()`` and ``rte_rcu_qsbr_check()`` are lock free. 157Hence, they can be called concurrently from multiple writers even while 158running as worker threads. 159 160The separation of triggering the reporting from querying the status provides 161the writer threads flexibility to do useful work instead of blocking for the 162reader threads to enter the quiescent state or go offline. This reduces the 163memory accesses due to continuous polling for the status. But, since the 164resource is freed at a later time, the token and the reference to the deleted 165resource need to be stored for later queries. 166 167The ``rte_rcu_qsbr_synchronize()`` API combines the functionality of 168``rte_rcu_qsbr_start()`` and blocking ``rte_rcu_qsbr_check()`` into a single 169API. This API triggers the reader threads to report their quiescent state and 170polls till all the readers enter the quiescent state or go offline. This API 171does not allow the writer to do useful work while waiting and introduces 172additional memory accesses due to continuous polling. However, the application 173does not have to store the token or the reference to the deleted resource. The 174resource can be freed immediately after ``rte_rcu_qsbr_synchronize()`` API 175returns. 176 177The reader thread must call ``rte_rcu_qsbr_thread_offline()`` and 178``rte_rcu_qsbr_thread_unregister()`` APIs to remove itself from reporting its 179quiescent state. The ``rte_rcu_qsbr_check()`` API will not wait for this reader 180thread to report the quiescent state status anymore. 181 182The reader threads should call ``rte_rcu_qsbr_quiescent()`` API to indicate that 183they entered a quiescent state. This API checks if a writer has triggered a 184quiescent state query and update the state accordingly. 185 186The ``rte_rcu_qsbr_lock()`` and ``rte_rcu_qsbr_unlock()`` are empty functions. 187However, these APIs can aid in debugging issues. One can mark the access to 188shared data structures on the reader side using these APIs. The 189``rte_rcu_qsbr_quiescent()`` will check if all the locks are unlocked. 190 191Resource reclamation framework for DPDK 192--------------------------------------- 193 194Lock-free algorithms place additional burden of resource reclamation on 195the application. When a writer deletes an entry from a data structure, the writer: 196 197#. Has to start the grace period 198#. Has to store a reference to the deleted resources in a FIFO 199#. Should check if the readers have completed a grace period and free the resources. 200 201There are several APIs provided to help with this process. The writer 202can create a FIFO to store the references to deleted resources using ``rte_rcu_qsbr_dq_create()``. 203The resources can be enqueued to this FIFO using ``rte_rcu_qsbr_dq_enqueue()``. 204If the FIFO is full, ``rte_rcu_qsbr_dq_enqueue`` will reclaim the resources before enqueuing. It will also reclaim resources on regular basis to keep the FIFO from growing too large. If the writer runs out of resources, the writer can call ``rte_rcu_qsbr_dq_reclaim`` API to reclaim resources. ``rte_rcu_qsbr_dq_delete`` is provided to reclaim any remaining resources and free the FIFO while shutting down. 205 206However, if this resource reclamation process were to be integrated in lock-free data structure libraries, it 207hides this complexity from the application and makes it easier for the application to adopt lock-free algorithms. The following paragraphs discuss how the reclamation process can be integrated in DPDK libraries. 208 209In any DPDK application, the resource reclamation process using QSBR can be split into 4 parts: 210 211#. Initialization 212#. Quiescent State Reporting 213#. Reclaiming Resources 214#. Shutdown 215 216The design proposed here assigns different parts of this process to client libraries and applications. The term 'client library' refers to lock-free data structure libraries such at rte_hash, rte_lpm etc. in DPDK or similar libraries outside of DPDK. The term 'application' refers to the packet processing application that makes use of DPDK such as L3 Forwarding example application, OVS, VPP etc.. 217 218The application has to handle 'Initialization' and 'Quiescent State Reporting'. So, 219 220* the application has to create the RCU variable and register the reader threads to report their quiescent state. 221* the application has to register the same RCU variable with the client library. 222* reader threads in the application have to report the quiescent state. This allows for the application to control the length of the critical section/how frequently the application wants to report the quiescent state. 223 224The client library will handle 'Reclaiming Resources' part of the process. The 225client libraries will make use of the writer thread context to execute the memory 226reclamation algorithm. So, 227 228* client library should provide an API to register a RCU variable that it will use. It should call ``rte_rcu_qsbr_dq_create()`` to create the FIFO to store the references to deleted entries. 229* client library should use ``rte_rcu_qsbr_dq_enqueue`` to enqueue the deleted resources on the FIFO and start the grace period. 230* if the library runs out of resources while adding entries, it should call ``rte_rcu_qsbr_dq_reclaim`` to reclaim the resources and try the resource allocation again. 231 232The 'Shutdown' process needs to be shared between the application and the 233client library. 234 235* the application should make sure that the reader threads are not using the shared data structure, unregister the reader threads from the QSBR variable before calling the client library's shutdown function. 236 237* client library should call ``rte_rcu_qsbr_dq_delete`` to reclaim any remaining resources and free the FIFO. 238 239Integrating the resource reclamation with client libraries removes the burden from 240the application and makes it easy to use lock-free algorithms. 241 242This design has several advantages over currently known methods. 243 244#. Application does not need a dedicated thread to reclaim resources. Memory 245 reclamation happens as part of the writer thread with little impact on 246 performance. 247#. The client library has better control over the resources. For example: the client 248 library can attempt to reclaim when it has run out of resources. 249