1<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 4<html> 5 6<head> 7 8<title>Postfix Queue Scheduler</title> 9 10<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 11 12</head> 13 14<body> 15 16<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix 17Queue Scheduler</h1> 18 19<hr> 20 21<h2> Disclaimer </h2> 22 23<p> Many of the <i>transport</i>-specific configuration parameters 24discussed in this document will not show up in "postconf" command 25output before Postfix version 2.9. This limitation applies to many 26parameters whose name is a combination of a master.cf service name 27such as "relay" and a built-in suffix such as 28"_destination_concurrency_limit". </p> 29 30<h2> Overview </h2> 31 32<p> The queue manager is by far the most complex part of the Postfix 33mail system. It schedules delivery of new mail, retries failed 34deliveries at specific times, and removes mail from the queue after 35the last delivery attempt. There are two major classes of mechanisms 36that control the operation of the queue manager. </p> 37 38<p> Topics covered by this document: </p> 39 40<ul> 41 42<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned 43with the number of concurrent deliveries to a specific destination, 44including decisions on when to suspend deliveries after persistent 45failures. 46 47<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with 48the selection of email messages and recipients for a given destination. 49 50<li> <a href="#credits"> Credits</a>, something this document would not be 51complete without. 52 53</ul> 54 55<!-- 56 57<p> Once started, the qmgr(8) process runs until "postfix reload" 58or "postfix stop". As a persistent process, the queue manager has 59to meet strict requirements with respect to code correctness and 60robustness. Unlike non-persistent daemon processes, the queue manager 61cannot benefit from Postfix's process rejuvenation mechanism that 62limit the impact from resource leaks and other coding errors 63(translation: replacing a process after a short time covers up bugs 64before they can become a problem). </p> 65 66--> 67 68<h2> <a name="concurrency"> Concurrency scheduling </a> </h2> 69 70<p> The following sections document the Postfix 2.5 concurrency 71scheduler, after a discussion of the limitations of the earlier 72concurrency scheduler. This is followed by results of medium-concurrency 73experiments, and a discussion of trade-offs between performance and 74robustness. </p> 75 76<p> The material is organized as follows: </p> 77 78<ul> 79 80<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing 81concurrency scheduler </a> 82 83<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5 84concurrency feedback algorithm </a> 85 86<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead 87destination" detection algorithm </a> 88 89<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5 90concurrency scheduler </a> 91 92<li> <a href="#concurrency_results"> Results for delivery to 93concurrency limited servers </a> 94 95<li> <a href="#concurrency_discussion"> Discussion of concurrency 96limited server results </a> 97 98<li> <a href="#concurrency_limitations"> Limitations of less-than-1 99per delivery feedback </a> 100 101<li> <a href="#concurrency_config"> Concurrency configuration 102parameters </a> 103 104</ul> 105 106<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing 107concurrency scheduler </a> </h3> 108 109<p> From the start, Postfix has used a simple but robust algorithm 110where the per-destination delivery concurrency is decremented by 1 111after delivery failed due to connection or handshake failure, and 112incremented by 1 otherwise. Of course the concurrency is never 113allowed to exceed the maximum per-destination concurrency limit. 114And when a destination's concurrency level drops to zero, the 115destination is declared "dead" and delivery is suspended. </p> 116 117<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p> 118 119<ul> 120 121<li> <p> Overshoot due to exponential delivery concurrency growth 122with each pseudo-cohort(*). This can be an issue with high-concurrency 123channels. For example, with the default initial concurrency of 5, 124concurrency would proceed over time as (5-10-20). </p> 125 126<li> <p> Throttling down to zero concurrency after a single 127pseudo-cohort(*) failure. This was especially an issue with 128low-concurrency channels where a single failure could be sufficient 129to mark a destination as "dead", causing the suspension of further 130deliveries to the affected destination. </p> 131 132</ul> 133 134<p> (*) A pseudo-cohort is a number of delivery requests equal to 135a destination's delivery concurrency. </p> 136 137<p> The revised concurrency scheduler has a highly modular structure. 138It uses separate mechanisms for per-destination concurrency control 139and for "dead destination" detection. The concurrency control in 140turn is built from two separate mechanisms: it supports less-than-1 141feedback per delivery to allow for more gradual concurrency 142adjustments, and it uses feedback hysteresis to suppress concurrency 143oscillations. And instead of waiting for delivery concurrency to 144throttle down to zero, a destination is declared "dead" after a 145configurable number of pseudo-cohorts reports connection or handshake 146failure. </p> 147 148<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 149concurrency feedback algorithm </a> </h3> 150 151<p> We want to increment a destination's delivery concurrency when 152some (not necessarily consecutive) number of deliveries complete 153without connection or handshake failure. This is implemented with 154positive feedback g(N) where N is the destination's delivery 155concurrency. With g(N)=1 feedback per delivery, concurrency increases 156by 1 after each positive feedback event; this gives us the old 157scheduler's exponential growth in time. With g(N)=1/N feedback per 158delivery, concurrency increases by 1 after an entire pseudo-cohort 159N of positive feedback reports; this gives us linear growth in time. 160Less-than-1 feedback per delivery and integer truncation naturally 161give us hysteresis, so that transitions to larger concurrency happen 162every 1/g(N) positive feedback events. </p> 163 164<p> We want to decrement a destination's delivery concurrency when 165some (not necessarily consecutive) number of deliveries complete 166after connection or handshake failure. This is implemented with 167negative feedback f(N) where N is the destination's delivery 168concurrency. With f(N)=1 feedback per delivery, concurrency decreases 169by 1 after each negative feedback event; this gives us the old 170scheduler's behavior where concurrency is throttled down dramatically 171after a single pseudo-cohort failure. With f(N)=1/N feedback per 172delivery, concurrency backs off more gently. Again, less-than-1 173feedback per delivery and integer truncation naturally give us 174hysteresis, so that transitions to lower concurrency happen every 1751/f(N) negative feedback events. </p> 176 177<p> However, with negative feedback we introduce a subtle twist. 178We "reverse" the negative hysteresis cycle so that the transition 179to lower concurrency happens at the <b>beginning</b> of a sequence 180of 1/f(N) negative feedback events. Otherwise, a correction for 181overload would be made too late. This makes the choice of f(N) 182relatively unimportant, as borne out by measurements later in this 183document. </p> 184 185<p> In summary, the main ingredients for the Postfix 2.5 concurrency 186feedback algorithm are a) the option of less-than-1 positive feedback 187per delivery to avoid overwhelming servers, b) the option of 188less-than-1 negative feedback per delivery to avoid giving up too 189fast, c) feedback hysteresis to avoid rapid oscillation, and d) a 190"reverse" hysteresis cycle for negative feedback, so that it can 191correct for overload quickly. </p> 192 193<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3> 194 195<p> We want to suspend deliveries to a specific destination after 196some number of deliveries suffers connection or handshake failure. 197The old scheduler declares a destination "dead" when negative (-1) 198feedback throttles the delivery concurrency down to zero. With 199less-than-1 feedback per delivery, this throttling down would 200obviously take too long. We therefore have to separate "dead 201destination" detection from concurrency feedback. This is implemented 202by introducing the concept of pseudo-cohort failure. The Postfix 2032.5 concurrency scheduler declares a destination "dead" after a 204configurable number of pseudo-cohorts suffers from connection or 205handshake failures. The old scheduler corresponds to the special 206case where the pseudo-cohort failure limit is equal to 1. </p> 207 208<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3> 209 210<p> The pseudo code shows how the ideas behind new concurrency 211scheduler are implemented as of November 2007. The actual code can 212be found in the module qmgr/qmgr_queue.c. </p> 213 214<pre> 215Types: 216 Each destination has one set of the following variables 217 int concurrency 218 double success 219 double failure 220 double fail_cohorts 221 222Feedback functions: 223 N is concurrency; x, y are arbitrary numbers in [0..1] inclusive 224 positive feedback: g(N) = x/N | x/sqrt(N) | x 225 negative feedback: f(N) = y/N | y/sqrt(N) | y 226 227Initialization: 228 concurrency = initial_concurrency 229 success = 0 230 failure = 0 231 fail_cohorts = 0 232 233After success: 234 fail_cohorts = 0 235 Be prepared for feedback > hysteresis, or rounding error 236 success += g(concurrency) 237 while (success >= 1) Hysteresis 1 238 concurrency += 1 Hysteresis 1 239 failure = 0 240 success -= 1 Hysteresis 1 241 Be prepared for overshoot 242 if (concurrency > concurrency limit) 243 concurrency = concurrency limit 244 245Safety: 246 Don't apply positive feedback unless 247 concurrency < busy_refcount + init_dest_concurrency 248 otherwise negative feedback effect could be delayed 249 250After failure: 251 if (concurrency > 0) 252 fail_cohorts += 1.0 / concurrency 253 if (fail_cohorts > cohort_failure_limit) 254 concurrency = 0 255 if (concurrency > 0) 256 Be prepared for feedback > hysteresis, rounding errors 257 failure -= f(concurrency) 258 while (failure < 0) 259 concurrency -= 1 Hysteresis 1 260 failure += 1 Hysteresis 1 261 success = 0 262 Be prepared for overshoot 263 if (concurrency < 1) 264 concurrency = 1 265</pre> 266 267<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3> 268 269<p> Discussions about the concurrency scheduler redesign started 270early 2004, when the primary goal was to find alternatives that did 271not exhibit exponential growth or rapid concurrency throttling. No 272code was implemented until late 2007, when the primary concern had 273shifted towards better handling of server concurrency limits. For 274this reason we measure how well the new scheduler does this 275job. The table below compares mail delivery performance of the old 276+/-1 feedback per delivery with several less-than-1 feedback 277functions, for different limited-concurrency server scenarios. 278Measurements were done with a FreeBSD 6.2 client and with FreeBSD 2796.2 and various Linux servers. </p> 280 281<p> Server configuration: </p> 282 283<ul> <li> The mail flow was slowed down with 1 second latency per 284recipient ("smtpd_client_restrictions = sleep 1"). The purpose was 285to make results less dependent on hardware details, by avoiding 286slow-downs by queue file I/O, logging I/O, and network I/O. 287 288<li> Concurrency was limited by the server process limit 289("default_process_limit = 5" and "smtpd_client_event_limit_exceptions 290= static:all"). Postfix was stopped and started after changing the 291process limit, because the same number is also used as the backlog 292argument to the listen(2) system call, and "postfix reload" does 293not re-issue this call. 294 295<li> Mail was discarded with "local_recipient_maps = static:all" and 296"local_transport = discard". The discard action in access maps or 297header/body checks 298could not be used as it fails to update the in_flow_delay counters. 299 300</ul> 301 302<p> Client configuration: </p> 303 304<ul> 305 306<li> Queue file overhead was minimized by sending one message to a 307virtual alias that expanded into 2000 different remote recipients. 308All recipients were accounted for according to the maillog file. 309The virtual_alias_expansion_limit setting was increased to avoid 310complaints from the cleanup(8) server. 311 312<li> The number of deliveries was maximized with 313"smtp_destination_recipient_limit = 2". A smaller limit would cause 314Postfix to schedule the concurrency per recipient instead of domain, 315which is not what we want. 316 317<li> Maximum concurrency was limited with 318"smtp_destination_concurrency_limit = 20", and 319initial_destination_concurrency was set to the same value. 320 321<li> The positive and negative concurrency feedback hysteresis was 3221. Concurrency was incremented by 1 at the END of 1/feedback steps 323of positive feedback, and was decremented by 1 at the START of 3241/feedback steps of negative feedback. 325 326<li> The SMTP client used the default 30s SMTP connect timeout and 327300s SMTP greeting timeout. 328 329</ul> 330 331<h4> Impact of the 30s SMTP connect timeout </h4> 332 333<p> The first results are for a FreeBSD 6.2 server, where our 334artificially low listen(2) backlog results in a very short kernel 335queue for established connections. The table shows that all deferred 336deliveries failed due to a 30s connection timeout, and none failed 337due to a server greeting timeout. This measurement simulates what 338happens when the server's connection queue is completely full under 339load, and the TCP engine drops new connections. </p> 340 341<blockquote> 342 343<table> 344 345<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 346style</th> <th>connection<br> caching</th> <th>percentage<br> 347deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 348<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 349 350<tr> <td align="center" colspan="9"> <hr> </td> </tr> 351 352<tr><td align="center">20</td> <td align="center">5</td> <td 353align="center">1/N</td> <td align="center">no</td> <td 354align="center">9.9</td> <td align="center">19.4</td> <td 355align="center">0.49</td> <td align="center">198</td> <td 356align="center">-</td> </tr> 357 358<tr><td align="center">20</td> <td align="center">5</td> <td 359align="center">1/N</td> <td align="center">yes</td> <td 360align="center">10.3</td> <td align="center">19.4</td> <td 361align="center">0.49</td> <td align="center">206</td> <td 362align="center">-</td> </tr> 363 364<tr><td align="center">20</td> <td align="center">5</td> <td 365align="center">1/sqrt(N)</td> <td align="center">no</td> 366<td align="center">10.4</td> <td align="center">19.6</td> <td 367align="center">0.59</td> <td align="center">208</td> <td 368align="center">-</td> </tr> 369 370<tr><td align="center">20</td> <td align="center">5</td> <td 371align="center">1/sqrt(N)</td> <td align="center">yes</td> 372<td align="center">10.6</td> <td align="center">19.6</td> <td 373align="center">0.61</td> <td align="center">212</td> <td 374align="center">-</td> </tr> 375 376<tr><td align="center">20</td> <td align="center">5</td> <td 377align="center">1</td> <td align="center">no</td> <td 378align="center">10.1</td> <td align="center">19.5</td> <td 379align="center">1.29</td> <td align="center">202</td> <td 380align="center">-</td> </tr> 381 382<tr><td align="center">20</td> <td align="center">5</td> <td 383align="center">1</td> <td align="center">yes</td> <td 384align="center">10.8</td> <td align="center">19.3</td> <td 385align="center">1.57</td> <td align="center">216</td> <td 386align="center">-</td> </tr> 387 388<tr> <td align="center" colspan="9"> <hr> </td> </tr> 389 390</table> 391 392<p> A busy server with a completely full connection queue. N is 393the client delivery concurrency. Failed deliveries time out after 39430s without completing the TCP handshake. See text for a discussion 395of results. </p> 396 397</blockquote> 398 399<h4> Impact of the 300s SMTP greeting timeout </h4> 400 401<p> The next table shows results for a Fedora Core 8 server (results 402for RedHat 7.3 are identical). In this case, the artificially small 403listen(2) backlog argument does not impact our measurement. The 404table shows that practically all deferred deliveries fail after the 405300s SMTP greeting timeout. As these timeouts were 10x longer than 406with the first measurement, we increased the recipient count (and 407thus the running time) by a factor of 10 to keep the results 408comparable. The deferred mail percentages are a factor 10 lower 409than with the first measurement, because the 1s per-recipient delay 410was 1/300th of the greeting timeout instead of 1/30th of the 411connection timeout. </p> 412 413<blockquote> 414 415<table> 416 417<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 418style</th> <th>connection<br> caching</th> <th>percentage<br> 419deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 420<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 421 422<tr> <td align="center" colspan="9"> <hr> </td> </tr> 423 424<tr> <td align="center">20</td> <td align="center">5</td> <td 425align="center">1/N</td> <td align="center">no</td> <td 426align="center">1.16</td> <td align="center">19.8</td> <td 427align="center">0.37</td> <td align="center">-</td> <td 428align="center">230</td> </tr> 429 430<tr> <td align="center">20</td> <td align="center">5</td> <td 431align="center">1/N</td> <td align="center">yes</td> <td 432align="center">1.36</td> <td align="center">19.8</td> <td 433align="center">0.36</td> <td align="center">-</td> <td 434align="center">272</td> </tr> 435 436<tr> <td align="center">20</td> <td align="center">5</td> <td 437align="center">1/sqrt(N)</td> <td align="center">no</td> 438<td align="center">1.21</td> <td align="center">19.9</td> <td 439align="center">0.23</td> <td align="center">4</td> <td 440align="center">238</td> </tr> 441 442<tr> <td align="center">20</td> <td align="center">5</td> <td 443align="center">1/sqrt(N)</td> <td align="center">yes</td> 444<td align="center">1.36</td> <td align="center">20.0</td> <td 445align="center">0.23</td> <td align="center">-</td> <td 446align="center">272</td> </tr> 447 448<tr> <td align="center">20</td> <td align="center">5</td> <td 449align="center">1</td> <td align="center">no</td> <td 450align="center">1.18</td> <td align="center">20.0</td> <td 451align="center">0.16</td> <td align="center">-</td> <td 452align="center">236</td> </tr> 453 454<tr> <td align="center">20</td> <td align="center">5</td> <td 455align="center">1</td> <td align="center">yes</td> <td 456align="center">1.39</td> <td align="center">20.0</td> <td 457align="center">0.16</td> <td align="center">-</td> <td 458align="center">278</td> </tr> 459 460<tr> <td align="center" colspan="9"> <hr> </td> </tr> 461 462</table> 463 464<p> A busy server with a non-full connection queue. N is the client 465delivery concurrency. Failed deliveries complete at the TCP level, 466but time out after 300s while waiting for the SMTP greeting. See 467text for a discussion of results. </p> 468 469</blockquote> 470 471<h4> Impact of active server concurrency limiter </h4> 472 473<p> The final concurrency-limited result shows what happens when 474SMTP connections don't time out, but are rejected immediately with 475the Postfix server's smtpd_client_connection_count_limit feature 476(the server replies with a 421 status and disconnects immediately). 477Similar results can be expected with concurrency limiting features 478built into other MTAs or firewalls. For this measurement we specified 479a server concurrency limit and a client initial destination concurrency 480of 5, and a server process limit of 10; all other conditions were 481the same as with the first measurement. The same result would be 482obtained with a FreeBSD or Linux server, because the "pushing back" 483is done entirely by the receiving side. </p> 484 485<blockquote> 486 487<table> 488 489<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 490style</th> <th>connection<br> caching</th> <th>percentage<br> 491deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 492<th>theoretical<br>defer rate</th> </tr> 493 494<tr> <td align="center" colspan="9"> <hr> </td> </tr> 495 496<tr> <td align="center">20</td> <td align="center">5</td> <td 497align="center">1/N</td> <td align="center">no</td> <td 498align="center">16.5</td> <td align="center">5.17</td> <td 499align="center">0.38</td> <td align="center">1/6</td> </tr> 500 501<tr> <td align="center">20</td> <td align="center">5</td> <td 502align="center">1/N</td> <td align="center">yes</td> <td 503align="center">16.5</td> <td align="center">5.17</td> <td 504align="center">0.38</td> <td align="center">1/6</td> </tr> 505 506<tr> <td align="center">20</td> <td align="center">5</td> <td 507align="center">1/sqrt(N)</td> <td align="center">no</td> 508<td align="center">24.5</td> <td align="center">5.28</td> <td 509align="center">0.45</td> <td align="center">1/4</td> </tr> 510 511<tr> <td align="center">20</td> <td align="center">5</td> <td 512align="center">1/sqrt(N)</td> <td align="center">yes</td> 513<td align="center">24.3</td> <td align="center">5.28</td> <td 514align="center">0.46</td> <td align="center">1/4</td> </tr> 515 516<tr> <td align="center">20</td> <td align="center">5</td> <td 517align="center">1</td> <td align="center">no</td> <td 518align="center">49.7</td> <td align="center">5.63</td> <td 519align="center">0.67</td> <td align="center">1/2</td> </tr> 520 521<tr> <td align="center">20</td> <td align="center">5</td> <td 522align="center">1</td> <td align="center">yes</td> <td 523align="center">49.7</td> <td align="center">5.68</td> <td 524align="center">0.70</td> <td align="center">1/2</td> </tr> 525 526<tr> <td align="center" colspan="9"> <hr> </td> </tr> 527 528</table> 529 530<p> A server with active per-client concurrency limiter that replies 531with 421 and disconnects. N is the client delivery concurrency. 532The theoretical defer rate is 1/(1+roundup(1/feedback)). This is 533always 1/2 with the fixed +/-1 feedback per delivery; with the 534concurrency-dependent feedback variants, the defer rate decreases 535with increasing concurrency. See text for a discussion of results. 536</p> 537 538</blockquote> 539 540<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3> 541 542<p> All results in the previous sections are based on the first 543delivery runs only; they do not include any second etc. delivery 544attempts. It's also worth noting that the measurements look at 545steady-state behavior only. They don't show what happens when the 546client starts sending at a much higher or lower concurrency. 547</p> 548 549<p> The first two examples show that the effect of feedback 550is negligible when concurrency is limited due to congestion. This 551is because the initial concurrency is already at the client's 552concurrency maximum, and because there is 10-100 times more positive 553than negative feedback. Under these conditions, it is no surprise 554that the contribution from SMTP connection caching is also negligible. 555</p> 556 557<p> In the last example, the old +/-1 feedback per delivery will 558defer 50% of the mail when confronted with an active (anvil-style) 559server concurrency limit, where the server hangs up immediately 560with a 421 status (a TCP-level RST would have the same result). 561Less aggressive feedback mechanisms fare better than more aggressive 562ones. Concurrency-dependent feedback fares even better at higher 563concurrencies than shown here, but has limitations as discussed in 564the next section. </p> 565 566<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3> 567 568<p> Less-than-1 feedback is of interest primarily when sending large 569amounts of mail to destinations with active concurrency limiters 570(servers that reply with 421, or firewalls that send RST). When 571sending small amounts of mail per destination, less-than-1 per-delivery 572feedback won't have a noticeable effect on the per-destination 573concurrency, because the number of deliveries to the same destination 574is too small. You might just as well use zero per-delivery feedback 575and stay with the initial per-destination concurrency. And when 576mail deliveries fail due to congestion instead of active concurrency 577limiters, the measurements above show that per-delivery feedback 578has no effect. With large amounts of mail you might just as well 579use zero per-delivery feedback and start with the maximal per-destination 580concurrency. </p> 581 582<p> The scheduler with less-than-1 concurrency 583feedback per delivery solves a problem with servers that have active 584concurrency limiters. This works only because feedback is handled 585in a peculiar manner: positive feedback will increment the concurrency 586by 1 at the <b>end</b> of a sequence of events of length 1/feedback, 587while negative feedback will decrement concurrency by 1 at the 588<b>beginning</b> of such a sequence. This is how Postfix adjusts 589quickly for overshoot without causing lots of mail to be deferred. 590Without this difference in feedback treatment, less-than-1 feedback 591per delivery would defer 50% of the mail, and would be no better 592in this respect than the old +/-1 feedback per delivery. </p> 593 594<p> Unfortunately, the same feature that corrects quickly for 595concurrency overshoot also makes the scheduler more sensitive for 596noisy negative feedback. The reason is that one lonely negative 597feedback event has the same effect as a complete sequence of length 5981/feedback: in both cases delivery concurrency is dropped by 1 599immediately. As a worst-case scenario, consider multiple servers 600behind a load balancer on a single IP address, and no backup MX 601address. When 1 out of K servers fails to complete the SMTP handshake 602or drops the connection, a scheduler with 1/N (N = concurrency) 603feedback stops increasing its concurrency once it reaches a concurrency 604level of about K, even though the good servers behind the load 605balancer are perfectly capable of handling more traffic. </p> 606 607<p> This noise problem gets worse as the amount of positive feedback 608per delivery gets smaller. A compromise is to use fixed less-than-1 609positive feedback values instead of concurrency-dependent positive 610feedback. For example, to tolerate 1 of 4 bad servers in the above 611load balancer scenario, use positive feedback of 1/4 per "good" 612delivery (no connect or handshake error), and use an equal or smaller 613amount of negative feedback per "bad" delivery. The downside of 614using concurrency-independent feedback is that some of the old +/-1 615feedback problems will return at large concurrencies. Sites that 616must deliver mail at non-trivial per-destination concurrencies will 617require special configuration. </p> 618 619<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3> 620 621<p> The Postfix 2.5 concurrency scheduler is controlled with the 622following configuration parameters, where "<i>transport</i>_foo" 623provides a transport-specific parameter override. All parameter 624default settings are compatible with earlier Postfix versions. </p> 625 626<blockquote> 627 628<table border="0"> 629 630<tr> <th> Parameter name </th> <th> Postfix version </th> <th> 631Description </th> </tr> 632 633<tr> <td colspan="3"> <hr> </td> </tr> 634 635<tr> <td> initial_destination_concurrency<br> 636<i>transport</i>_initial_destination_concurrency </td> <td 637align="center"> all<br> 2.5 </td> <td> Initial per-destination 638delivery concurrency </td> </tr> 639 640<tr> <td> default_destination_concurrency_limit<br> 641<i>transport</i>_destination_concurrency_limit </td> <td align="center"> 642all<br> all </td> <td> Maximum per-destination delivery concurrency 643</td> </tr> 644 645<tr> <td> default_destination_concurrency_positive_feedback<br> 646<i>transport</i>_destination_concurrency_positive_feedback </td> 647<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive 648feedback amount, per delivery that does not fail with connection 649or handshake failure </td> </tr> 650 651<tr> <td> default_destination_concurrency_negative_feedback<br> 652<i>transport</i>_destination_concurrency_negative_feedback </td> 653<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative 654feedback amount, per delivery that fails with connection or handshake 655failure </td> </tr> 656 657<tr> <td> default_destination_concurrency_failed_cohort_limit<br> 658<i>transport</i>_destination_concurrency_failed_cohort_limit </td> 659<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed 660pseudo-cohorts after which a destination is declared "dead" and 661delivery is suspended </td> </tr> 662 663<tr> <td> destination_concurrency_feedback_debug</td> <td align="center"> 6642.5 </td> <td> Enable verbose logging of concurrency scheduler 665activity </td> </tr> 666 667<tr> <td colspan="3"> <hr> </td> </tr> 668 669</table> 670 671</blockquote> 672 673<h2> <a name="jobs"> Preemptive scheduling </a> </h2> 674 675<p> 676 677The following sections describe the new queue manager and its 678preemptive scheduler algorithm. Note that the document was originally 679written to describe the changes between the new queue manager (in 680this text referred to as <tt>nqmgr</tt>, the name it was known by 681before it became the default queue manager) and the old queue manager 682(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt> 683every so often. 684 685</p> 686 687<p> 688 689This document is divided into sections as follows: 690 691</p> 692 693<ul> 694 695<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by 696nqmgr </a> 697 698<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 699up the message </a> - how it is assigned to transports, jobs, peers, 700entries 701 702<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection 703works </a> 704 705<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption 706works </a> - what messages may be preempted and how and what messages 707are chosen to preempt them 708 709<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency 710limits affect the scheduling algorithm </a> 711 712<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource 713limits </a> 714 715</ul> 716 717<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by 718nqmgr </a> </h3> 719 720<p> 721 722Let's start by recapitulating the structures and terms used when 723referring to the queue manager and how it operates. Many of these are 724partially described elsewhere, but it is nice to have a coherent 725overview in one place: 726 727</p> 728 729<ul> 730 731<li> <p> Each message structure represents one mail message which 732Postfix is to deliver. The message recipients specify to what 733destinations is the message to be delivered and what transports are 734going to be used for the delivery. </p> 735 736<li> <p> Each recipient entry groups a batch of recipients of one 737message which are all going to be delivered to the same destination 738(and over the same transport). 739</p> 740 741<li> <p> Each transport structure groups everything what is going 742to be delivered by delivery agents dedicated for that transport. 743Each transport maintains a set of queues (describing the destinations 744it shall talk to) and jobs (referencing the messages it shall 745deliver). </p> 746 747<li> <p> Each transport queue (not to be confused with the on-disk 748"active" queue or "incoming" queue) groups everything what is going be 749delivered to given destination (aka nexthop) by its transport. Each 750queue belongs to one transport, so each destination may be referred 751to by several queues, one for each transport. Each queue maintains 752a list of all recipient entries (batches of message recipients) 753which shall be delivered to given destination (the todo list), and 754a list of recipient entries already being delivered by the delivery 755agents (the busy list). </p> 756 757<li> <p> Each queue corresponds to multiple peer structures. Each 758peer structure is like the queue structure, belonging to one transport 759and referencing one destination. The difference is that it lists 760only the recipient entries which all originate from the same message, 761unlike the queue structure, whose entries may originate from various 762messages. For messages with few recipients, there is usually just 763one recipient entry for each destination, resulting in one recipient 764entry per peer. But for large mailing list messages the recipients 765may need to be split to multiple recipient entries, in which case 766the peer structure may list many entries for single destination. 767</p> 768 769<li> <p> Each transport job groups everything it takes to deliver 770one message via its transport. Each job represents one message 771within the context of the transport. The job belongs to one transport 772and message, so each message may have multiple jobs, one for each 773transport. The job groups all the peer structures, which describe 774the destinations the job's message has to be delivered to. </p> 775 776</ul> 777 778<p> 779 780The first four structures are common to both <tt>nqmgr</tt> and 781<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>. 782 783</p> 784 785<p> 786 787These terms are used extensively in the text below, feel free to 788look up the description above anytime you'll feel you have lost a 789sense what is what. 790 791</p> 792 793<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 794up the message </a> </h3> 795 796<p> 797 798Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue, 799the following happens: It reads all necessary information from the 800queue file as <tt>oqmgr</tt> does, and also reads as many recipients 801as possible - more on that later, for now let's just pretend it 802always reads all recipients. 803 804</p> 805 806<p> 807 808Then it resolves the recipients as <tt>oqmgr</tt> does, which 809means obtaining (address, nexthop, transport) triple for each 810recipient. For each triple, it finds the transport; if it does not 811exist yet, it instantiates it (unless it's dead). Within the 812transport, it finds the destination queue for the given nexthop; if it 813does not exist yet, it instantiates it (unless it's dead). The 814triple is then bound to given destination queue. This happens in 815qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>. 816 817</p> 818 819<p> 820 821Then for each triple which was bound to some queue (and thus 822transport), the program finds the job which represents the message 823within that transport's context; if it does not exist yet, it 824instantiates it. Within the job, it finds the peer which represents 825the bound destination queue within this jobs context; if it does 826not exist yet, it instantiates it. Finally, it stores the address 827from the resolved triple to the recipient entry which is appended 828to both the queue entry list and the peer entry list. The addresses 829for the same nexthop are batched in the entries up to the 830<i>transport</i>_destination_recipient_limit for that transport. 831This happens in qmgr_message_assign(), and apart 832from that it operates with job and peer structures, it is basically the 833same as in <tt>oqmgr</tt>. 834 835</p> 836 837<p> 838 839When the job is instantiated, it is enqueued on the transport's job 840list based on the time its message was picked up by <tt>nqmgr</tt>. 841For first batch of recipients this means it is appended to the end 842of the job list, but the ordering of the job list by the enqueue 843time is important as we will see shortly. 844 845</p> 846 847<p> 848 849[Now you should have a pretty good idea what the state of the 850<tt>nqmgr</tt> is after a couple of messages were picked up, and what the 851relation is between all those job, peer, queue and entry structures.] 852 853</p> 854 855<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection 856works </a> </h3> 857 858<p> 859 860Having prepared all those above mentioned structures, the task of 861the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries 862one at a time and pass them to the delivery agent for corresponding 863transport. Now how does this work? 864 865</p> 866 867<p> 868 869The first approximation of the new scheduling algorithm is like this: 870 871</p> 872 873<blockquote> 874<pre> 875foreach transport (round-robin-by-transport) 876do 877 if transport busy continue 878 if transport process limit reached continue 879 foreach transport's job (in the order of the transport's job list) 880 do 881 foreach job's peer (round-robin-by-destination) 882 if peer->queue->concurrency < peer->queue->window 883 return next peer entry. 884 done 885 done 886done 887</pre> 888</blockquote> 889 890<p> 891 892Now what is the "order of the transport's job list"? As we know 893already, the job list is by default kept in the order the message 894was picked up by the <tt>nqmgr</tt>. So by default we get the 895top-level round-robin transport, and within each transport we get 896the FIFO message delivery. The round-robin of the peers by the 897destination is perhaps of little importance in most real-life cases 898(unless the <i>transport</i>_destination_recipient_limit is reached, 899in one job there 900is only one peer structure for each destination), but theoretically 901it makes sure that even within single jobs, destinations are treated 902fairly. 903 904</p> 905 906<p> 907 908[By now you should have a feeling you really know how the scheduler 909works, except for the preemption, under ideal conditions - that is, 910no recipient resource limits and no destination concurrency problems.] 911 912</p> 913 914<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption 915works </a> </h3> 916 917<p> 918 919As you might perhaps expect by now, the transport's job list does 920not remain sorted by the job's message enqueue time all the time. 921The most cool thing about <tt>nqmgr</tt> is not the simple FIFO 922delivery, but that it is able to slip mail with little recipients 923past the mailing-list bulk mail. This is what the job preemption 924is about - shuffling the jobs on the transport's job list to get 925the best message delivery rates. Now how is it achieved? 926 927</p> 928 929<p> 930 931First I have to tell you that there are in fact two job lists in 932each transport. One is the scheduler's job list, which the scheduler 933is free to play with, while the other one keeps the jobs always 934listed in the order of the enqueue time and is used for recipient 935pool management we will discuss later. For now, we will deal with 936the scheduler's job list only. 937 938</p> 939 940<p> 941 942So, we have the job list, which is first ordered by the time the 943jobs' messages were enqueued, oldest messages first, the most recently 944picked one at the end. For now, let's assume that there are no 945destination concurrency problems. Without preemption, we pick some 946entry of the first (oldest) job on the queue, assign it to delivery 947agent, pick another one from the same job, assign it again, and so 948on, until all the entries are used and the job is delivered. We 949would then move onto the next job and so on and on. Now how do we 950manage to sneak in some entries from the recently added jobs when 951the first job on the job list belongs to a message going to the 952mailing-list and has thousands of recipient entries? 953 954</p> 955 956<p> 957 958The <tt>nqmgr</tt>'s answer is that we can artificially "inflate" 959the delivery time of that first job by some constant for free - it 960is basically the same trick you might remember as "accumulation of 961potential" from the amortized complexity lessons. For example, 962instead of delivering the entries of the first job on the job list 963every time a delivery agent becomes available, we can do it only 964every second time. If you view the moments the delivery agent becomes 965available on a timeline as "delivery slots", then instead of using 966every delivery slot for the first job, we can use only every other 967slot, and still the overall delivery efficiency of the first job 968remains the same. So the delivery <tt>11112222</tt> becomes 969<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, . 970denotes the free slot). Now what do we do with free slots? 971 972</p> 973 974<p> 975 976As you might have guessed, we will use them for sneaking the mail 977with little recipients in. For example, if we have one four-recipient 978mail followed by four one recipients mail, the delivery sequence 979(that is, the sequence in which the jobs are assigned to the 980delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine 981for sneaking in the single recipient mail, but how do we sneak in 982the mail with more than one recipient? Say if we have one four-recipient 983mail followed by two two-recipient mails? 984 985</p> 986 987<p> 988 989The simple answer would be to use delivery sequence <tt>12121313</tt>. 990But the problem is that this does not scale well. Imagine you have 991mail with a thousand recipients followed by mail with a hundred recipients. 992It is tempting to suggest the delivery sequence like <tt>121212....</tt>, 993but alas! Imagine there arrives another mail with say ten recipients. 994But there are no free slots anymore, so it can't slip by, not even 995if it had only one recipient. It will be stuck until the 996hundred-recipient mail is delivered, which really sucks. 997 998</p> 999 1000<p> 1001 1002So, it becomes obvious that while inflating the message to get 1003free slots is a great idea, one has to be really careful of how the 1004free slots are assigned, otherwise one might corner himself. So, 1005how does <tt>nqmgr</tt> really use the free slots? 1006 1007</p> 1008 1009<p> 1010 1011The key idea is that one does not have to generate the free slots 1012in a uniform way. The delivery sequence <tt>111...1</tt> is no 1013worse than <tt>1.1.1.1</tt>, in fact, it is even better as some 1014entries are in the first case selected earlier than in the second 1015case, and none is selected later! So it is possible to first 1016"accumulate" the free delivery slots and then use them all at once. 1017It is even possible to accumulate some, then use them, then accumulate 1018some more and use them again, as in <tt>11..1.1</tt> . 1019 1020</p> 1021 1022<p> 1023 1024Let's get back to the one hundred recipient example. We now know 1025that we could first accumulate one hundred free slots, and only 1026after then to preempt the first job and sneak the one hundred 1027recipient mail in. Applying the algorithm recursively, we see the 1028hundred recipient job can accumulate ten free delivery slots, and 1029then we could preempt it and sneak in the ten-recipient mail... 1030Wait wait wait! Could we? Aren't we overinflating the original one 1031thousand recipient mail? 1032 1033</p> 1034 1035<p> 1036 1037Well, despite the fact that it looks so at the first glance, another trick will 1038allow us to answer "no, we are not!". If we had said that we will 1039inflate the delivery time twice at maximum, and then we consider 1040every other slot as a free slot, then we would overinflate in case 1041of the recursive preemption. BUT! The trick is that if we use only 1042every n-th slot as a free slot for n>2, there is always some worst 1043inflation factor which we can guarantee not to be breached, even 1044if we apply the algorithm recursively. To be precise, if for every 1045k>1 normally used slots we accumulate one free delivery slot, than 1046the inflation factor is not worse than k/(k-1) no matter how many 1047recursive preemptions happen. And it's not worse than (k+1)/k if 1048only non-recursive preemption happens. Now, having got through the 1049theory and the related math, let's see how <tt>nqmgr</tt> implements 1050this. 1051 1052</p> 1053 1054<p> 1055 1056Each job has so called "available delivery slot" counter. Each 1057transport has a <i>transport</i>_delivery_slot_cost parameter, which 1058defaults to default_delivery_slot_cost parameter which is set to 5 1059by default. This is the k from the paragraph above. Each time k 1060entries of the job are selected for delivery, this counter is 1061incremented by one. Once there are some slots accumulated, a job which 1062requires no more than that number of slots to be fully delivered 1063can preempt this job. 1064 1065</p> 1066 1067<p> 1068 1069[Well, the truth is, the counter is incremented every time an entry 1070is selected and it is divided by k when it is used. 1071But to understand, it's good enough to use 1072the above approximation of the truth.] 1073 1074</p> 1075 1076<p> 1077 1078OK, so now we know the conditions which must be satisfied so one 1079job can preempt another one. But what job gets preempted, how do 1080we choose what job preempts it if there are several valid candidates, 1081and when does all this exactly happen? 1082 1083</p> 1084 1085<p> 1086 1087The answer for the first part is simple. The job whose entry was 1088selected the last time is the so called current job. Normally, it is 1089the first job on the scheduler's job list, but destination concurrency 1090limits may change this as we will see later. It is always only the 1091current job which may get preempted. 1092 1093</p> 1094 1095<p> 1096 1097Now for the second part. The current job has a certain amount of 1098recipient entries, and as such may accumulate at maximum some amount 1099of available delivery slots. It might have already accumulated some, 1100and perhaps even already used some when it was preempted before 1101(remember a job can be preempted several times). In either case, 1102we know how many are accumulated and how many are left to deliver, 1103so we know how many it may yet accumulate at maximum. Every other 1104job which may be delivered by less than that number of slots is a 1105valid candidate for preemption. How do we choose among them? 1106 1107</p> 1108 1109<p> 1110 1111The answer is - the one with maximum enqueue_time/recipient_entry_count. 1112That is, the older the job is, the more we should try to deliver 1113it in order to get best message delivery rates. These rates are of 1114course subject to how many recipients the message has, therefore 1115the division by the recipient (entry) count. No one shall be surprised 1116that a message with n recipients takes n times longer to deliver than 1117a message with one recipient. 1118 1119</p> 1120 1121<p> 1122 1123Now let's recap the previous two paragraphs. Isn't it too complicated? 1124Why don't the candidates come only among the jobs which can be 1125delivered within the number of slots the current job already 1126accumulated? Why do we need to estimate how much it has yet to 1127accumulate? If you found out the answer, congratulate yourself. If 1128we did it this simple way, we would always choose the candidate 1129with the fewest recipient entries. If there were enough single recipient 1130mails coming in, they would always slip by the bulk mail as soon 1131as possible, and the two or more recipients mail would never get 1132a chance, no matter how long they have been sitting around in the 1133job list. 1134 1135</p> 1136 1137<p> 1138 1139This candidate selection has an interesting implication - that when 1140we choose the best candidate for preemption (this is done in 1141qmgr_choose_candidate()), it may happen that we may not use it for 1142preemption immediately. This leads to an answer to the last part 1143of the original question - when does the preemption happen? 1144 1145</p> 1146 1147<p> 1148 1149The preemption attempt happens every time next transport's recipient 1150entry is to be chosen for delivery. To avoid needless overhead, the 1151preemption is not attempted if the current job could never accumulate 1152more than <i>transport</i>_minimum_delivery_slots (defaults to 1153default_minimum_delivery_slots which defaults to 3). If there are 1154already enough accumulated slots to preempt the current job by the 1155chosen best candidate, it is done immediately. This basically means 1156that the candidate is moved in front of the current job on the 1157scheduler's job list and decreasing the accumulated slot counter 1158by the amount used by the candidate. If there are not enough slots... 1159well, I could say that nothing happens and the another preemption 1160is attempted the next time. But that's not the complete truth. 1161 1162</p> 1163 1164<p> 1165 1166The truth is that it turns out that it is not really necessary to 1167wait until the jobs counter accumulates all the delivery slots in 1168advance. Say we have ten-recipient mail followed by two two-recipient 1169mails. If the preemption happened when enough delivery slots accumulate 1170(assuming slot cost 2), the delivery sequence becomes 1171<tt>11112211113311</tt>. Now what would we get if we would wait 1172only for 50% of the necessary slots to accumulate and we promise 1173we would wait for the remaining 50% later, after we get back 1174to the preempted job? If we use such a slot loan, the delivery sequence 1175becomes <tt>11221111331111</tt>. As we can see, it makes it not 1176considerably worse for the delivery of the ten-recipient mail, but 1177it allows the small messages to be delivered sooner. 1178 1179</p> 1180 1181<p> 1182 1183The concept of these slot loans is where the 1184<i>transport</i>_delivery_slot_discount and 1185<i>transport</i>_delivery_slot_loan come from (they default to 1186default_delivery_slot_discount and default_delivery_slot_loan, whose 1187values are by default 50 and 3, respectively). The discount (resp. 1188loan) specifies how many percent (resp. how many slots) one "gets 1189in advance", when the number of slots required to deliver the best 1190candidate is compared with the number of slots the current slot had 1191accumulated so far. 1192 1193</p> 1194 1195<p> 1196 1197And that pretty much concludes this chapter. 1198 1199</p> 1200 1201<p> 1202 1203[Now you should have a feeling that you pretty much understand the 1204scheduler and the preemption, or at least that you will have 1205after you read the last chapter a couple more times. You shall clearly 1206see the job list and the preemption happening at its head, in ideal 1207delivery conditions. The feeling of understanding shall last until 1208you start wondering what happens if some of the jobs are blocked, 1209which you might eventually figure out correctly from what had been 1210said already. But I would be surprised if your mental image of the 1211scheduler's functionality is not completely shattered once you 1212start wondering how it works when not all recipients may be read 1213in-core. More on that later.] 1214 1215</p> 1216 1217<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency 1218limits affect the scheduling algorithm </a> </h3> 1219 1220<p> 1221 1222The <tt>nqmgr</tt> uses the same algorithm for destination concurrency 1223control as <tt>oqmgr</tt>. Now what happens when the destination 1224limits are reached and no more entries for that destination may be 1225selected by the scheduler? 1226 1227</p> 1228 1229<p> 1230 1231From the user's point of view it is all simple. If some of the peers 1232of a job can't be selected, those peers are simply skipped by the 1233entry selection algorithm (the pseudo-code described before) and 1234only the selectable ones are used. If none of the peers may be 1235selected, the job is declared a "blocker job". Blocker jobs are 1236skipped by the entry selection algorithm and they are also excluded 1237from the candidates for preemption of the current job. Thus the scheduler 1238effectively behaves as if the blocker jobs didn't exist on the job 1239list at all. As soon as at least one of the peers of a blocker job 1240becomes unblocked (that is, the delivery agent handling the delivery 1241of the recipient entry for the given destination successfully finishes), 1242the job's blocker status is removed and the job again participates 1243in all further scheduler actions normally. 1244 1245</p> 1246 1247<p> 1248 1249So the summary is that the users don't really have to be concerned 1250about the interaction of the destination limits and scheduling 1251algorithm. It works well on its own and there are no knobs they 1252would need to control it. 1253 1254</p> 1255 1256<p> 1257 1258From a programmer's point of view, the blocker jobs complicate the 1259scheduler quite a lot. Without them, the jobs on the job list would 1260be normally delivered in strict FIFO order. If the current job is 1261preempted, the job preempting it is completely delivered unless it 1262is preempted itself. Without blockers, the current job is thus 1263always either the first job on the job list, or the top of the stack 1264of jobs preempting the first job on the job list. 1265 1266</p> 1267 1268<p> 1269 1270The visualization of the job list and the preemption stack without 1271blockers would be like this: 1272 1273</p> 1274 1275<blockquote> 1276<pre> 1277first job-> 1--2--3--5--6--8--... <- job list 1278on job list | 1279 4 <- preemption stack 1280 | 1281current job-> 7 1282</pre> 1283</blockquote> 1284 1285<p> 1286 1287In the example above we see that job 1 was preempted by job 4 and 1288then job 4 was preempted by job 7. After job 7 is completed, remaining 1289entries of job 4 are selected, and once they are all selected, job 12901 continues. 1291 1292</p> 1293 1294<p> 1295 1296As we see, it's all very clean and straightforward. Now how does 1297this change because of blockers? 1298 1299</p> 1300 1301<p> 1302 1303The answer is: a lot. Any job may become a blocker job at any time, 1304and also become a normal job again at any time. This has several 1305important implications: 1306 1307</p> 1308 1309<ol> 1310 1311<li> <p> 1312 1313The jobs may be completed in arbitrary order. For example, in the 1314example above, if the current job 7 becomes blocked, the next job 13154 may complete before the job 7 becomes unblocked again. Or if both 13167 and 4 are blocked, then 1 is completed, then 7 becomes unblocked 1317and is completed, then 2 is completed and only after that 4 becomes 1318unblocked and is completed... You get the idea. 1319 1320</p> 1321 1322<p> 1323 1324[Interesting side note: even when jobs are delivered out of order, 1325from a single destination's point of view the jobs are still delivered 1326in the expected order (that is, FIFO unless there was some preemption 1327involved). This is because whenever a destination queue becomes 1328unblocked (the destination limit allows selection of more recipient 1329entries for that destination), all jobs which have peers for that 1330destination are unblocked at once.] 1331 1332</p> 1333 1334<li> <p> 1335 1336The idea of the preemption stack at the head of the job list is 1337gone. That is, it must be possible to preempt any job on the job 1338list. For example, if the jobs 7, 4, 1 and 2 in the example above 1339become all blocked, job 3 becomes the current job. And of course 1340we do not want the preemption to be affected by the fact that there 1341are some blocked jobs or not. Therefore, if it turns out that job 13423 might be preempted by job 6, the implementation shall make it 1343possible. 1344 1345</p> 1346 1347<li> <p> 1348 1349The idea of the linear preemption stack itself is gone. It's no 1350longer true that one job is always preempted by only one job at one 1351time (that is directly preempted, not counting the recursively 1352nested jobs). For example, in the example above, job 1 is directly 1353preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes 1354blocked, and job 4 is being delivered. If it accumulates enough 1355delivery slots, it is natural that it might be preempted for example 1356by job 8. Now job 4 is preempted by both job 7 AND job 8 at the 1357same time. 1358 1359</p> 1360 1361</ol> 1362 1363<p> 1364 1365Now combine the points 2) and 3) with point 1) again and you realize 1366that the relations on the once linear job list became pretty 1367complicated. If we extend the point 3) example: jobs 7 and 8 preempt 1368job 4, now job 8 becomes blocked too, then job 4 completes. Tricky, 1369huh? 1370 1371</p> 1372 1373<p> 1374 1375If I illustrate the relations after the above mentioned examples 1376(but those in point 1), the situation would look like this: 1377 1378</p> 1379 1380<blockquote> 1381<pre> 1382 v- parent 1383 1384adoptive parent -> 1--2--3--5--... <- "stack" level 0 1385 | | 1386parent gone -> ? 6 <- "stack" level 1 1387 / \ 1388children -> 7 8 ^- child <- "stack" level 2 1389 1390 ^- siblings 1391</pre> 1392</blockquote> 1393 1394<p> 1395 1396Now how does <tt>nqmgr</tt> deal with all these complicated relations? 1397 1398</p> 1399 1400<p> 1401 1402Well, it maintains them all as described, but fortunately, all these 1403relations are necessary only for the purpose of proper counting of 1404available delivery slots. For the purpose of ordering the jobs for 1405entry selection, the original rule still applies: "the job preempting 1406the current job is moved in front of the current job on the job 1407list". So for entry selection purposes, the job relations remain 1408as simple as this: 1409 1410</p> 1411 1412<blockquote> 1413<pre> 14147--8--1--2--6--3--5--.. <- scheduler's job list order 1415</pre> 1416</blockquote> 1417 1418<p> 1419 1420The job list order and the preemption parent/child/siblings relations 1421are maintained separately. And because the selection works only 1422with the job list, you can happily forget about those complicated 1423relations unless you want to study the <tt>nqmgr</tt> sources. In 1424that case the text above might provide some helpful introduction 1425to the problem domain. Otherwise I suggest you just forget about 1426all this and stick with the user's point of view: the blocker jobs 1427are simply ignored. 1428 1429</p> 1430 1431<p> 1432 1433[By now, you should have a feeling that there are more things going 1434on under the hood than you ever wanted to know. You decide that 1435forgetting about this chapter is the best you can do for the sake 1436of your mind's health and you basically stick with the idea how the 1437scheduler works in ideal conditions, when there are no blockers, 1438which is good enough.] 1439 1440</p> 1441 1442<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource 1443limits </a> </h3> 1444 1445<p> 1446 1447When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed 1448that all recipients of all messages in the "active" queue are completely 1449read into memory. This is simply not true. There is an upper 1450bound on the amount of memory the <tt>nqmgr</tt> may use, and 1451therefore it must impose some limits on the information it may store 1452in memory at any given time. 1453 1454</p> 1455 1456<p> 1457 1458First of all, not all messages may be read in-core at once. At any 1459time, only qmgr_message_active_limit messages may be read in-core 1460at maximum. When read into memory, the messages are picked from the 1461"incoming" and "deferred" queues and moved to the "active" queue 1462(incoming having priority), so if there are more than 1463qmgr_message_active_limit messages queued in the "active" queue, the 1464rest will have to wait until (some of) the messages in the "active" queue 1465are completely delivered (or deferred). 1466 1467</p> 1468 1469<p> 1470 1471Even with the limited amount of in-core messages, there is another 1472limit which must be imposed in order to avoid memory exhaustion. 1473Each message may contain a huge number of recipients (tens or hundreds 1474of thousands are not uncommon), so if <tt>nqmgr</tt> read all 1475recipients of all messages in the "active" queue, it may easily run 1476out of memory. Therefore there must be some upper bound on the 1477amount of message recipients which are read into memory at the 1478same time. 1479 1480</p> 1481 1482<p> 1483 1484Before discussing how exactly <tt>nqmgr</tt> implements the recipient 1485limits, let's see how the sole existence of the limits themselves 1486affects the <tt>nqmgr</tt> and its scheduler. 1487 1488</p> 1489 1490<p> 1491 1492The message limit is straightforward - it just limits the size of 1493the 1494lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which 1495message can preempt the current one. Messages not in the "active" queue 1496are simply not considered at all. 1497 1498</p> 1499 1500<p> 1501 1502The recipient limit complicates more things. First of all, the 1503message reading code must support reading the recipients in batches, 1504which among other things means accessing the queue file several 1505times and continuing where the last recipient batch ended. This is 1506invoked by the scheduler whenever the current job has space for more 1507recipients, subject to transport's refill_limit and refill_delay parameters. 1508It is also done any time when all 1509in-core recipients of the message are dealt with (which may also 1510mean they were deferred) but there are still more in the queue file. 1511 1512</p> 1513 1514<p> 1515 1516The second complication is that with some recipients left unread 1517in the queue file, the scheduler can't operate with exact counts 1518of recipient entries. With unread recipients, it is not clear how 1519many recipient entries there will be, as they are subject to 1520per-destination grouping. It is not even clear to what transports 1521(and thus jobs) the recipients will be assigned. And with messages 1522coming from the "deferred" queue, it is not even clear how many unread 1523recipients are still to be delivered. This all means that the 1524scheduler must use only estimates of how many recipients entries 1525there will be. Fortunately, it is possible to estimate the minimum 1526and maximum correctly, so the scheduler can always err on the safe 1527side. Obviously, the better the estimates, the better the results, so 1528it is best when we are able to read all recipients in-core and turn 1529the estimates into exact counts, or at least try to read as many 1530as possible to make the estimates as accurate as possible. 1531 1532</p> 1533 1534<p> 1535 1536The third complication is that it is no longer true that the scheduler 1537is done with a job once all of its in-core recipients are delivered. 1538It is possible that the job will be revived later, when another 1539batch of recipients is read in core. It is also possible that some 1540jobs will be created for the first time long after the first batch 1541of recipients was read in core. The <tt>nqmgr</tt> code must be 1542ready to handle all such situations. 1543 1544</p> 1545 1546<p> 1547 1548And finally, the fourth complication is that the <tt>nqmgr</tt> 1549code must somehow impose the recipient limit itself. Now how does 1550it achieve it? 1551 1552</p> 1553 1554<p> 1555 1556Perhaps the easiest solution would be to say that each message may 1557have at maximum X recipients stored in-core, but such a solution would 1558be poor for several reasons. With reasonable qmgr_message_active_limit 1559values, the X would have to be quite low to maintain a reasonable 1560memory footprint. And with low X lots of things would not work well. 1561The <tt>nqmgr</tt> would have problems to use the 1562<i>transport</i>_destination_recipient_limit efficiently. The 1563scheduler's preemption would be suboptimal as the recipient count 1564estimates would be inaccurate. The message queue file would have 1565to be accessed many times to read in more recipients again and 1566again. 1567 1568</p> 1569 1570<p> 1571 1572Therefore it seems reasonable to have a solution which does not use 1573a limit imposed on a per-message basis, but which maintains a pool 1574of available recipient slots, which can be shared among all messages 1575in the most efficient manner. And as we do not want separate 1576transports to compete for resources whenever possible, it seems 1577appropriate to maintain such a recipient pool for each transport 1578separately. This is the general idea, now how does it work in 1579practice? 1580 1581</p> 1582 1583<p> 1584 1585First we have to solve a little chicken-and-egg problem. If we want 1586to use the per-transport recipient pools, we first need to know to 1587what transport(s) the message is assigned. But we will find that 1588out only after we first read in the recipients. So it is obvious 1589that we first have to read in some recipients, use them to find out 1590to what transports the message is to be assigned, and only after 1591that can we use the per-transport recipient pools. 1592 1593</p> 1594 1595<p> 1596 1597Now how many recipients shall we read for the first time? This is 1598what qmgr_message_recipient_minimum and qmgr_message_recipient_limit 1599values control. The qmgr_message_recipient_minimum value specifies 1600how many recipients of each message we will read the first time, 1601no matter what. It is necessary to read at least one recipient 1602before we can assign the message to a transport and create the first 1603job. However, reading only qmgr_message_recipient_minimum recipients 1604even if there are only few messages with few recipients in-core would 1605be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit 1606recipients in-core so far, the first batch of recipients may be 1607larger than qmgr_message_recipient_minimum - as large as is required 1608to reach the qmgr_message_recipient_limit limit. 1609 1610</p> 1611 1612<p> 1613 1614Once the first batch of recipients was read in core and the message 1615jobs were created, the size of the subsequent recipient batches (if 1616any - of course it's best when all recipients are read in one batch) 1617is based solely on the position of the message jobs on their 1618corresponding transports' job lists. Each transport has a pool of 1619<i>transport</i>_recipient_limit recipient slots which it can 1620distribute among its jobs (how this is done is described later). 1621The subsequent recipient batch may be as large as the sum of all 1622recipient slots of all jobs of the message permits (plus the 1623qmgr_message_recipient_minimum amount which always applies). 1624 1625</p> 1626 1627<p> 1628 1629For example, if a message has three jobs, the first with 1 recipient 1630still in-core and 4 recipient slots, the second with 5 recipients in-core 1631and 5 recipient slots, and the third with 2 recipients in-core and 0 1632recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs' 1633recipients slots in total. This means that we could immediately 1634read 2+qmgr_message_recipient_minimum more recipients of that message 1635in core. 1636 1637</p> 1638 1639<p> 1640 1641The above example illustrates several things which might be worth 1642mentioning explicitly: first, note that although the per-transport 1643slots are assigned to particular jobs, we can't guarantee that once 1644the next batch of recipients is read in core, that the corresponding 1645amounts of recipients will be assigned to those jobs. The jobs lend 1646its slots to the message as a whole, so it is possible that some 1647jobs end up sponsoring other jobs of their message. For example, 1648if in the example above the 2 newly read recipients were assigned 1649to the second job, the first job sponsored the second job with 2 1650slots. The second notable thing is the third job, which has more 1651recipients in-core than it has slots. Apart from sponsoring by other 1652job we just saw it can be result of the first recipient batch, which 1653is sponsored from global recipient pool of qmgr_message_recipient_limit 1654recipients. It can be also sponsored from the message recipient 1655pool of qmgr_message_recipient_minimum recipients. 1656 1657</p> 1658 1659<p> 1660 1661Now how does each transport distribute the recipient slots among 1662its jobs? The strategy is quite simple. As most scheduler activity 1663happens on the head of the job list, it is our intention to make 1664sure that the scheduler has the best estimates of the recipient 1665counts for those jobs. As we mentioned above, this means that we 1666want to try to make sure that the messages of those jobs have all 1667recipients read in-core. Therefore the transport distributes the 1668slots "along" the job list from start to end. In this case the job 1669list sorted by message enqueue time is used, because it doesn't 1670change over time as the scheduler's job list does. 1671 1672</p> 1673 1674<p> 1675 1676More specifically, each time a job is created and appended to the 1677job list, it gets all unused recipient slots from its transport's 1678pool. It keeps them until all recipients of its message are read. 1679When this happens, all unused recipient slots are transferred to 1680the next job (which is now in fact the first such job) on the job 1681list which still has some recipients unread, or eventually back to 1682the transport pool if there is no such job. Such a transfer then also 1683happens whenever a recipient entry of that job is delivered. 1684 1685</p> 1686 1687<p> 1688 1689There is also a scenario when a job is not appended to the end of 1690the job list (for example it was created as a result of a second or 1691later recipient batch). Then it works exactly as above, except that 1692if it was put in front of the first unread job (that is, the job 1693of a message which still has some unread recipients in the queue file), 1694that job is first forced to return all of its unused recipient slots 1695to the transport pool. 1696 1697</p> 1698 1699<p> 1700 1701The algorithm just described leads to the following state: The first 1702unread job on the job list always gets all the remaining recipient 1703slots of that transport (if there are any). The jobs queued before 1704this job are completely read (that is, all recipients of their 1705message were already read in core) and have at maximum as many slots 1706as they still have recipients in-core (the maximum is there because 1707of the sponsoring mentioned before) and the jobs after this job get 1708nothing from the transport recipient pool (unless they got something 1709before and then the first unread job was created and enqueued in 1710front of them later - in such a case, they also get at maximum as many 1711slots as they have recipients in-core). 1712 1713</p> 1714 1715<p> 1716 1717Things work fine in such a state for most of the time, because the 1718current job is either completely read in-core or has as many recipient 1719slots as there are, but there is one situation which we still have 1720to take care of specially. Imagine if the current job is preempted 1721by some unread job from the job list and there are no more recipient 1722slots available, so this new current job could read only batches 1723of qmgr_message_recipient_minimum recipients at a time. This would 1724really degrade performance. For this reason, each transport has an 1725extra pool of <i>transport</i>_extra_recipient_limit recipient 1726slots, dedicated exactly for this situation. Each time an unread 1727job preempts the current job, it gets half of the remaining recipient 1728slots from the normal pool and this extra pool. 1729 1730</p> 1731 1732<p> 1733 1734And that's it. It sure does sound pretty complicated, but fortunately 1735most people don't really have to care exactly how it works as long 1736as it works. Perhaps the only important things to know for most 1737people are the following upper bound formulas: 1738 1739</p> 1740 1741<p> 1742 1743Each transport has at maximum 1744 1745</p> 1746 1747<blockquote> 1748<pre> 1749max( 1750qmgr_message_recipient_minimum * qmgr_message_active_limit 1751+ *_recipient_limit + *_extra_recipient_limit, 1752qmgr_message_recipient_limit 1753) 1754</pre> 1755</blockquote> 1756 1757<p> 1758 1759recipients in core. 1760 1761</p> 1762 1763<p> 1764 1765The total amount of recipients in core is 1766 1767</p> 1768 1769<blockquote> 1770<pre> 1771max( 1772qmgr_message_recipient_minimum * qmgr_message_active_limit 1773+ sum( *_recipient_limit + *_extra_recipient_limit ), 1774qmgr_message_recipient_limit 1775) 1776</pre> 1777</blockquote> 1778 1779<p> 1780 1781where the sum is over all used transports. 1782 1783</p> 1784 1785<p> 1786 1787And this terribly complicated chapter concludes the documentation 1788of the <tt>nqmgr</tt> scheduler. 1789 1790</p> 1791 1792<p> 1793 1794[By now you should theoretically know the <tt>nqmgr</tt> scheduler 1795inside out. In practice, you still hope that you will never have 1796to really understand the last or last two chapters completely, and 1797fortunately most people really won't. Understanding how the scheduler 1798works in ideal conditions is more than good enough for the vast majority 1799of users.] 1800 1801</p> 1802 1803<h2> <a name="credits"> Credits </a> </h2> 1804 1805<ul> 1806 1807<li> Wietse Venema designed and implemented the initial queue manager 1808with per-domain FIFO scheduling, and per-delivery +/-1 concurrency 1809feedback. 1810 1811<li> Patrik Rak designed and implemented preemption where mail with 1812fewer recipients can slip past mail with more recipients in a 1813controlled manner, and wrote up its documentation. 1814 1815<li> Wietse Venema initiated a discussion with Patrik Rak and Victor 1816Duchovni on alternatives for the +/-1 feedback scheduler's aggressive 1817behavior. This is when K/N feedback was reviewed (N = concurrency). 1818The discussion ended without a good solution for both negative 1819feedback and dead site detection. 1820 1821<li> Victor Duchovni resumed work on concurrency feedback in the 1822context of concurrency-limited servers. 1823 1824<li> Wietse Venema then re-designed the concurrency scheduler in 1825terms of the simplest possible concepts: less-than-1 concurrency 1826feedback per delivery, forward and reverse concurrency feedback 1827hysteresis, and pseudo-cohort failure. At this same time, concurrency 1828feedback was separated from dead site detection. 1829 1830<li> These simplifications, and their modular implementation, helped 1831to develop further insights into the different roles that positive 1832and negative concurrency feedback play, and helped to identify some 1833worst-case scenarios. 1834 1835</ul> 1836 1837</body> 1838 1839</html> 1840