1<?xml version="1.0" encoding="ISO-8859-1"?> 2<!DOCTYPE html 3 PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 4 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 5 6<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 7<head> 8 <meta name="AUTHOR" content="Stefan Olsson <stefan@xapa.se>" /> 9 <meta name="KEYWORDS" content="c++, libstdc++, g++, allocator, memory" /> 10 <meta name="DESCRIPTION" content="Allocators and allocation" /> 11 <meta name="GENERATOR" content="emacs and ten fingers" /> 12 <title>A fixed-size, multi-thread optimized allocator</title> 13<link rel="StyleSheet" href="../lib3styles.css" type="text/css" /> 14<link rel="Start" href="../documentation.html" type="text/html" 15 title="GNU C++ Standard Library" /> 16<link rel="Bookmark" href="howto.html" type="text/html" title="Extensions" /> 17<link rel="Copyright" href="../17_intro/license.html" type="text/html" /> 18</head> 19<body> 20 21<h1 class="centered"><a name="top">A fixed-size, multi-thread optimized allocator</a></h1> 22 23<p class="fineprint"><em> 24 The latest version of this document is always available at 25 <a href="http://gcc.gnu.org/onlinedocs/libstdc++/ext/mt_allocator.html"> 26 http://gcc.gnu.org/onlinedocs/libstdc++/ext/mt_allocator.html</a>. 27</em></p> 28 29<p><em> 30 To the <a href="http://gcc.gnu.org/libstdc++/">libstdc++-v3 homepage</a>. 31</em></p> 32 33<!-- ####################################################### --> 34<hr /> 35<h3 class="left"> 36 <a name="intro">Introduction</a> 37</h3> 38 39<p> The mt allocator [hereinafter referred to simply as "the 40allocator"] is a fixed size (power of two) allocator that was 41initially developed specifically to suit the needs of multi threaded 42applications [hereinafter referred to as an MT application]. Over time 43the allocator has evolved and been improved in many ways, in 44particular it now also does a good job in single threaded applications 45[hereinafter referred to as a ST application]. (Note: In this 46document, when referring to single threaded applications this also 47includes applications that are compiled with gcc without thread 48support enabled. This is accomplished using ifdef's on 49__GTHREADS). This allocator is tunable, very flexible, and capable of 50high-performance. 51</p> 52 53<p> 54The aim of this document is to describe - from a application point of 55view - the "inner workings" of the allocator. 56</p> 57 58<h3 class="left"> 59 <a name="design">Design Overview</a> 60</h3> 61 62<p> There are three general components to the allocator: a datum 63describing the characteristics of the memory pool, a policy class 64containing this pool that links instantiation types to common or 65individual pools, and a class inheriting from the policy class that is 66the actual allocator. 67</p> 68 69<p>The datum describing pools characteristics is 70 <pre> 71 template<bool _Thread> 72 class __pool 73 </pre> 74This class is parametrized on thread support, and is explicitly 75specialized for both multiple threads (with <code>bool==true</code>) 76and single threads (via <code>bool==false</code>.) It is possible to 77use a custom pool datum instead of the default class that is provided. 78</p> 79 80<p> There are two distinct policy classes, each of which can be used 81with either type of underlying pool datum. 82</p> 83 84<pre> 85 template<bool _Thread> 86 struct __common_pool_policy 87 88 template<typename _Tp, bool _Thread> 89 struct __per_type_pool_policy 90</pre> 91 92<p> The first policy, <code>__common_pool_policy</code>, implements a 93common pool. This means that allocators that are instantiated with 94different types, say <code>char</code> and <code>long</code> will both 95use the same pool. This is the default policy. 96</p> 97 98<p> The second policy, <code>__per_type_pool_policy</code>, implements 99a separate pool for each instantiating type. Thus, <code>char</code> 100and <code>long</code> will use separate pools. This allows per-type 101tuning, for instance. 102</p> 103 104<p> Putting this all together, the actual allocator class is 105<pre> 106 template<typename _Tp, typename _Poolp = __default_policy> 107 class __mt_alloc : public __mt_alloc_base<_Tp>, _Poolp 108</pre> 109This class has the interface required for standard library allocator 110classes, namely member functions <code>allocate</code> and 111<code>deallocate</code>, plus others. 112</p> 113 114<h3 class="left"> 115 <a name="init">Tunable parameters</a> 116</h3> 117 118<p>Certain allocation parameters can be modified, or tuned. There 119exists a nested <pre>struct __pool_base::_Tune</pre> that contains all 120these parameters, which include settings for 121</p> 122 <ul> 123 <li>Alignment </li> 124 <li>Maximum bytes before calling <code>::operator new</code> directly</li> 125 <li>Minimum bytes</li> 126 <li>Size of underlying global allocations</li> 127 <li>Maximum number of supported threads</li> 128 <li>Migration of deallocations to the global free list</li> 129 <li>Shunt for global <code>new</code> and <code>delete</code></li> 130 </ul> 131<p>Adjusting parameters for a given instance of an allocator can only 132happen before any allocations take place, when the allocator itself is 133initialized. For instance: 134</p> 135<pre> 136#include <ext/mt_allocator.h> 137 138struct pod 139{ 140 int i; 141 int j; 142}; 143 144int main() 145{ 146 typedef pod value_type; 147 typedef __gnu_cxx::__mt_alloc<value_type> allocator_type; 148 typedef __gnu_cxx::__pool_base::_Tune tune_type; 149 150 tune_type t_default; 151 tune_type t_opt(16, 5120, 32, 5120, 20, 10, false); 152 tune_type t_single(16, 5120, 32, 5120, 1, 10, false); 153 154 tune_type t; 155 t = allocator_type::_M_get_options(); 156 allocator_type::_M_set_options(t_opt); 157 t = allocator_type::_M_get_options(); 158 159 allocator_type a; 160 allocator_type::pointer p1 = a.allocate(128); 161 allocator_type::pointer p2 = a.allocate(5128); 162 163 a.deallocate(p1, 128); 164 a.deallocate(p2, 5128); 165 166 return 0; 167} 168</pre> 169 170<h3 class="left"> 171 <a name="init">Initialization</a> 172</h3> 173 174<p> 175The static variables (pointers to freelists, tuning parameters etc) 176are initialized as above, or are set to the global defaults. 177</p> 178 179<p> 180The very first allocate() call will always call the 181_S_initialize_once() function. In order to make sure that this 182function is called exactly once we make use of a __gthread_once call 183in MT applications and check a static bool (_S_init) in ST 184applications. 185</p> 186 187<p> 188The _S_initialize() function: 189- If the GLIBCXX_FORCE_NEW environment variable is set, it sets the bool 190 _S_force_new to true and then returns. This will cause subsequent calls to 191 allocate() to return memory directly from a new() call, and deallocate will 192 only do a delete() call. 193</p> 194 195<p> 196- If the GLIBCXX_FORCE_NEW environment variable is not set, both ST and MT 197 applications will: 198 - Calculate the number of bins needed. A bin is a specific power of two size 199 of bytes. I.e., by default the allocator will deal with requests of up to 200 128 bytes (or whatever the value of _S_max_bytes is when _S_init() is 201 called). This means that there will be bins of the following sizes 202 (in bytes): 1, 2, 4, 8, 16, 32, 64, 128. 203 204 - Create the _S_binmap array. All requests are rounded up to the next 205 "large enough" bin. I.e., a request for 29 bytes will cause a block from 206 the "32 byte bin" to be returned to the application. The purpose of 207 _S_binmap is to speed up the process of finding out which bin to use. 208 I.e., the value of _S_binmap[ 29 ] is initialized to 5 (bin 5 = 32 bytes). 209</p> 210<p> 211 - Create the _S_bin array. This array consists of bin_records. There will be 212 as many bin_records in this array as the number of bins that we calculated 213 earlier. I.e., if _S_max_bytes = 128 there will be 8 entries. 214 Each bin_record is then initialized: 215 - bin_record->first = An array of pointers to block_records. There will be 216 as many block_records pointers as there are maximum number of threads 217 (in a ST application there is only 1 thread, in a MT application there 218 are _S_max_threads). 219 This holds the pointer to the first free block for each thread in this 220 bin. I.e., if we would like to know where the first free block of size 32 221 for thread number 3 is we would look this up by: _S_bin[ 5 ].first[ 3 ] 222 223 The above created block_record pointers members are now initialized to 224 their initial values. I.e. _S_bin[ n ].first[ n ] = NULL; 225</p> 226 227<p> 228- Additionally a MT application will: 229 - Create a list of free thread id's. The pointer to the first entry 230 is stored in _S_thread_freelist_first. The reason for this approach is 231 that the __gthread_self() call will not return a value that corresponds to 232 the maximum number of threads allowed but rather a process id number or 233 something else. So what we do is that we create a list of thread_records. 234 This list is _S_max_threads long and each entry holds a size_t thread_id 235 which is initialized to 1, 2, 3, 4, 5 and so on up to _S_max_threads. 236 Each time a thread calls allocate() or deallocate() we call 237 _S_get_thread_id() which looks at the value of _S_thread_key which is a 238 thread local storage pointer. If this is NULL we know that this is a newly 239 created thread and we pop the first entry from this list and saves the 240 pointer to this record in the _S_thread_key variable. The next time 241 we will get the pointer to the thread_record back and we use the 242 thread_record->thread_id as identification. I.e., the first thread that 243 calls allocate will get the first record in this list and thus be thread 244 number 1 and will then find the pointer to its first free 32 byte block 245 in _S_bin[ 5 ].first[ 1 ] 246 When we create the _S_thread_key we also define a destructor 247 (_S_thread_key_destr) which means that when the thread dies, this 248 thread_record is returned to the front of this list and the thread id 249 can then be reused if a new thread is created. 250 This list is protected by a mutex (_S_thread_freelist_mutex) which is only 251 locked when records are removed/added to the list. 252</p> 253<p> 254 - Initialize the free and used counters of each bin_record: 255 - bin_record->free = An array of size_t. This keeps track of the number 256 of blocks on a specific thread's freelist in each bin. I.e., if a thread 257 has 12 32-byte blocks on it's freelists and allocates one of these, this 258 counter would be decreased to 11. 259 260 - bin_record->used = An array of size_t. This keeps track of the number 261 of blocks currently in use of this size by this thread. I.e., if a thread 262 has made 678 requests (and no deallocations...) of 32-byte blocks this 263 counter will read 678. 264 265 The above created arrays are now initialized with their initial values. 266 I.e. _S_bin[ n ].free[ n ] = 0; 267</p> 268<p> 269 - Initialize the mutex of each bin_record: The bin_record->mutex 270 is used to protect the global freelist. This concept of a global 271 freelist is explained in more detail in the section "A multi 272 threaded example", but basically this mutex is locked whenever a 273 block of memory is retrieved or returned to the global freelist 274 for this specific bin. This only occurs when a number of blocks 275 are grabbed from the global list to a thread specific list or when 276 a thread decides to return some blocks to the global freelist. 277</p> 278 279<p> Notes about deallocation. This allocator does not explicitly 280release memory. Because of this, memory debugging programs like 281valgrind or purify may notice leaks: sorry about this 282inconvenience. Operating systems will reclaim allocated memory at 283program termination anyway. If sidestepping this kind of noise is 284desired, there are three options: use an allocator, like 285<code>new_allocator</code> that releases memory while debugging, use 286GLIBCXX_FORCE_NEW to bypass the allocator's internal pools, or use a 287custom pool datum that releases resources on destruction.</p> 288 289<p>On systems with the function <code>__cxa_atexit</code>, the 290allocator can be forced to free all memory allocated before program 291termination with the member function 292<code>__pool_type::_M_destroy</code>. However, because this member 293function relies on the precise and exactly-conforming ordering of 294static destructors, including those of a static local 295<code>__pool</code> object, it should not be used, ever, on systems 296that don't have the necessary underlying support. In addition, in 297practice, forcing deallocation can be tricky, as it requires the 298<code>__pool</code> object to be fully-constructed before the object 299that uses it is fully constructed. For most (but not all) STL 300containers, this works, as an instance of the allocator is constructed 301as part of a container's constructor. However, this assumption is 302implementation-specific, and subject to change. For an example of a 303pool that frees memory, see the following 304 <a href="http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/libstdc%2b%2b-v3/testsuite/ext/mt_allocator/deallocate_local-6.cc"> 305 example.</a> 306</p> 307 308<h3 class="left"> 309 <a name="st_example">A single threaded example (and a primer for the multi threaded example!)</a> 310</h3> 311 312<p> 313Let's start by describing how the data on a freelist is laid out in memory. 314This is the first two blocks in freelist for thread id 3 in bin 3 (8 bytes): 315</p> 316<pre> 317+----------------+ 318| next* ---------|--+ (_S_bin[ 3 ].first[ 3 ] points here) 319| | | 320| | | 321| | | 322+----------------+ | 323| thread_id = 3 | | 324| | | 325| | | 326| | | 327+----------------+ | 328| DATA | | (A pointer to here is what is returned to the 329| | | the application when needed) 330| | | 331| | | 332| | | 333| | | 334| | | 335| | | 336+----------------+ | 337+----------------+ | 338| next* |<-+ (If next == NULL it's the last one on the list) 339| | 340| | 341| | 342+----------------+ 343| thread_id = 3 | 344| | 345| | 346| | 347+----------------+ 348| DATA | 349| | 350| | 351| | 352| | 353| | 354| | 355| | 356+----------------+ 357</pre> 358 359<p> 360With this in mind we simplify things a bit for a while and say that there is 361only one thread (a ST application). In this case all operations are made to 362what is referred to as the global pool - thread id 0 (No thread may be 363assigned this id since they span from 1 to _S_max_threads in a MT application). 364</p> 365<p> 366When the application requests memory (calling allocate()) we first look at the 367requested size and if this is > _S_max_bytes we call new() directly and return. 368</p> 369<p> 370If the requested size is within limits we start by finding out from which 371bin we should serve this request by looking in _S_binmap. 372</p> 373<p> 374A quick look at _S_bin[ bin ].first[ 0 ] tells us if there are any blocks of 375this size on the freelist (0). If this is not NULL - fine, just remove the 376block that _S_bin[ bin ].first[ 0 ] points to from the list, 377update _S_bin[ bin ].first[ 0 ] and return a pointer to that blocks data. 378</p> 379<p> 380If the freelist is empty (the pointer is NULL) we must get memory from the 381system and build us a freelist within this memory. All requests for new memory 382is made in chunks of _S_chunk_size. Knowing the size of a block_record and 383the bytes that this bin stores we then calculate how many blocks we can create 384within this chunk, build the list, remove the first block, update the pointer 385(_S_bin[ bin ].first[ 0 ]) and return a pointer to that blocks data. 386</p> 387 388<p> 389Deallocation is equally simple; the pointer is casted back to a block_record 390pointer, lookup which bin to use based on the size, add the block to the front 391of the global freelist and update the pointer as needed 392(_S_bin[ bin ].first[ 0 ]). 393</p> 394 395<p> 396The decision to add deallocated blocks to the front of the freelist was made 397after a set of performance measurements that showed that this is roughly 10% 398faster than maintaining a set of "last pointers" as well. 399</p> 400 401<h3 class="left"> 402 <a name="mt_example">A multi threaded example</a> 403</h3> 404 405<p> 406In the ST example we never used the thread_id variable present in each block. 407Let's start by explaining the purpose of this in a MT application. 408</p> 409 410<p> 411The concept of "ownership" was introduced since many MT applications 412allocate and deallocate memory to shared containers from different 413threads (such as a cache shared amongst all threads). This introduces 414a problem if the allocator only returns memory to the current threads 415freelist (I.e., there might be one thread doing all the allocation and 416thus obtaining ever more memory from the system and another thread 417that is getting a longer and longer freelist - this will in the end 418consume all available memory). 419</p> 420 421<p> 422Each time a block is moved from the global list (where ownership is 423irrelevant), to a threads freelist (or when a new freelist is built 424from a chunk directly onto a threads freelist or when a deallocation 425occurs on a block which was not allocated by the same thread id as the 426one doing the deallocation) the thread id is set to the current one. 427</p> 428 429<p> 430What's the use? Well, when a deallocation occurs we can now look at 431the thread id and find out if it was allocated by another thread id 432and decrease the used counter of that thread instead, thus keeping the 433free and used counters correct. And keeping the free and used counters 434corrects is very important since the relationship between these two 435variables decides if memory should be returned to the global pool or 436not when a deallocation occurs. 437</p> 438 439<p> 440When the application requests memory (calling allocate()) we first 441look at the requested size and if this is > _S_max_bytes we call new() 442directly and return. 443</p> 444 445<p> 446If the requested size is within limits we start by finding out from which 447bin we should serve this request by looking in _S_binmap. 448</p> 449 450<p> 451A call to _S_get_thread_id() returns the thread id for the calling thread 452(and if no value has been set in _S_thread_key, a new id is assigned and 453returned). 454</p> 455 456<p> 457A quick look at _S_bin[ bin ].first[ thread_id ] tells us if there are 458any blocks of this size on the current threads freelist. If this is 459not NULL - fine, just remove the block that _S_bin[ bin ].first[ 460thread_id ] points to from the list, update _S_bin[ bin ].first[ 461thread_id ], update the free and used counters and return a pointer to 462that blocks data. 463</p> 464 465<p> 466If the freelist is empty (the pointer is NULL) we start by looking at 467the global freelist (0). If there are blocks available on the global 468freelist we lock this bins mutex and move up to block_count (the 469number of blocks of this bins size that will fit into a _S_chunk_size) 470or until end of list - whatever comes first - to the current threads 471freelist and at the same time change the thread_id ownership and 472update the counters and pointers. When the bins mutex has been 473unlocked, we remove the block that _S_bin[ bin ].first[ thread_id ] 474points to from the list, update _S_bin[ bin ].first[ thread_id ], 475update the free and used counters, and return a pointer to that blocks 476data. 477</p> 478 479<p> 480The reason that the number of blocks moved to the current threads 481freelist is limited to block_count is to minimize the chance that a 482subsequent deallocate() call will return the excess blocks to the 483global freelist (based on the _S_freelist_headroom calculation, see 484below). 485</p> 486 487<p> 488However if there isn't any memory on the global pool we need to get 489memory from the system - this is done in exactly the same way as in a 490single threaded application with one major difference; the list built 491in the newly allocated memory (of _S_chunk_size size) is added to the 492current threads freelist instead of to the global. 493</p> 494 495<p> 496The basic process of a deallocation call is simple: always add the 497block to the front of the current threads freelist and update the 498counters and pointers (as described earlier with the specific check of 499ownership that causes the used counter of the thread that originally 500allocated the block to be decreased instead of the current threads 501counter). 502</p> 503 504<p> 505And here comes the free and used counters to service. Each time a 506deallocation() call is made, the length of the current threads 507freelist is compared to the amount memory in use by this thread. 508</p> 509 510<p> 511Let's go back to the example of an application that has one thread 512that does all the allocations and one that deallocates. Both these 513threads use say 516 32-byte blocks that was allocated during thread 514creation for example. Their used counters will both say 516 at this 515point. The allocation thread now grabs 1000 32-byte blocks and puts 516them in a shared container. The used counter for this thread is now 5171516. 518</p> 519 520<p> 521The deallocation thread now deallocates 500 of these blocks. For each 522deallocation made the used counter of the allocating thread is 523decreased and the freelist of the deallocation thread gets longer and 524longer. But the calculation made in deallocate() will limit the length 525of the freelist in the deallocation thread to _S_freelist_headroom % 526of it's used counter. In this case, when the freelist (given that the 527_S_freelist_headroom is at it's default value of 10%) exceeds 52 528(516/10) blocks will be returned to the global pool where the 529allocating thread may pick them up and reuse them. 530</p> 531 532<p> 533In order to reduce lock contention (since this requires this bins 534mutex to be locked) this operation is also made in chunks of blocks 535(just like when chunks of blocks are moved from the global freelist to 536a threads freelist mentioned above). The "formula" used can probably 537be improved to further reduce the risk of blocks being "bounced back 538and forth" between freelists. 539</p> 540 541<hr /> 542<p>Return <a href="#top">to the top of the page</a> or 543 <a href="http://gcc.gnu.org/libstdc++/">to the libstdc++ homepage</a>. 544</p> 545 546 547<!-- ####################################################### --> 548 549<hr /> 550<p class="fineprint"><em> 551See <a href="../17_intro/license.html">license.html</a> for copying conditions. 552Comments and suggestions are welcome, and may be sent to 553<a href="mailto:libstdc++@gcc.gnu.org">the libstdc++ mailing list</a>. 554</em></p> 555 556 557</body> 558</html> 559