1<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 4<html> 5 6<head> 7 8<title>Postfix Queue Scheduler</title> 9 10<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 11<link rel='stylesheet' type='text/css' href='postfix-doc.css'> 12 13</head> 14 15<body> 16 17<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix 18Queue Scheduler</h1> 19 20<hr> 21 22<h2> Disclaimer </h2> 23 24<p> Many of the <i>transport</i>-specific configuration parameters 25discussed in this document will not show up in "postconf" command 26output before Postfix version 2.9. This limitation applies to many 27parameters whose name is a combination of a master.cf service name 28such as "relay" and a built-in suffix such as 29"_destination_concurrency_limit". </p> 30 31<h2> Overview </h2> 32 33<p> The queue manager is by far the most complex part of the Postfix 34mail system. It schedules delivery of new mail, retries failed 35deliveries at specific times, and removes mail from the queue after 36the last delivery attempt. There are two major classes of mechanisms 37that control the operation of the queue manager. </p> 38 39<p> Topics covered by this document: </p> 40 41<ul> 42 43<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned 44with the number of concurrent deliveries to a specific destination, 45including decisions on when to suspend deliveries after persistent 46failures. 47 48<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with 49the selection of email messages and recipients for a given destination. 50 51<li> <a href="#credits"> Credits</a>, something this document would not be 52complete without. 53 54</ul> 55 56<!-- 57 58<p> Once started, the qmgr(8) process runs until "postfix reload" 59or "postfix stop". As a persistent process, the queue manager has 60to meet strict requirements with respect to code correctness and 61robustness. Unlike non-persistent daemon processes, the queue manager 62cannot benefit from Postfix's process rejuvenation mechanism that 63limit the impact from resource leaks and other coding errors 64(translation: replacing a process after a short time covers up bugs 65before they can become a problem). </p> 66 67--> 68 69<h2> <a name="concurrency"> Concurrency scheduling </a> </h2> 70 71<p> The following sections document the Postfix 2.5 concurrency 72scheduler, after a discussion of the limitations of the earlier 73concurrency scheduler. This is followed by results of medium-concurrency 74experiments, and a discussion of trade-offs between performance and 75robustness. </p> 76 77<p> The material is organized as follows: </p> 78 79<ul> 80 81<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing 82concurrency scheduler </a> 83 84<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5 85concurrency feedback algorithm </a> 86 87<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead 88destination" detection algorithm </a> 89 90<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5 91concurrency scheduler </a> 92 93<li> <a href="#concurrency_results"> Results for delivery to 94concurrency limited servers </a> 95 96<li> <a href="#concurrency_discussion"> Discussion of concurrency 97limited server results </a> 98 99<li> <a href="#concurrency_limitations"> Limitations of less-than-1 100per delivery feedback </a> 101 102<li> <a href="#concurrency_config"> Concurrency configuration 103parameters </a> 104 105</ul> 106 107<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing 108concurrency scheduler </a> </h3> 109 110<p> From the start, Postfix has used a simple but robust algorithm 111where the per-destination delivery concurrency is decremented by 1 112after delivery failed due to connection or handshake failure, and 113incremented by 1 otherwise. Of course the concurrency is never 114allowed to exceed the maximum per-destination concurrency limit. 115And when a destination's concurrency level drops to zero, the 116destination is declared "dead" and delivery is suspended. </p> 117 118<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p> 119 120<ul> 121 122<li> <p> Overshoot due to exponential delivery concurrency growth 123with each pseudo-cohort(*). This can be an issue with high-concurrency 124channels. For example, with the default initial concurrency of 5, 125concurrency would proceed over time as (5-10-20). </p> 126 127<li> <p> Throttling down to zero concurrency after a single 128pseudo-cohort(*) failure. This was especially an issue with 129low-concurrency channels where a single failure could be sufficient 130to mark a destination as "dead", causing the suspension of further 131deliveries to the affected destination. </p> 132 133</ul> 134 135<p> (*) A pseudo-cohort is a number of delivery requests equal to 136a destination's delivery concurrency. </p> 137 138<p> The revised concurrency scheduler has a highly modular structure. 139It uses separate mechanisms for per-destination concurrency control 140and for "dead destination" detection. The concurrency control in 141turn is built from two separate mechanisms: it supports less-than-1 142feedback per delivery to allow for more gradual concurrency 143adjustments, and it uses feedback hysteresis to suppress concurrency 144oscillations. And instead of waiting for delivery concurrency to 145throttle down to zero, a destination is declared "dead" after a 146configurable number of pseudo-cohorts reports connection or handshake 147failure. </p> 148 149<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 150concurrency feedback algorithm </a> </h3> 151 152<p> We want to increment a destination's delivery concurrency when 153some (not necessarily consecutive) number of deliveries complete 154without connection or handshake failure. This is implemented with 155positive feedback g(N) where N is the destination's delivery 156concurrency. With g(N)=1 feedback per delivery, concurrency increases 157by 1 after each positive feedback event; this gives us the old 158scheduler's exponential growth in time. With g(N)=1/N feedback per 159delivery, concurrency increases by 1 after an entire pseudo-cohort 160N of positive feedback reports; this gives us linear growth in time. 161Less-than-1 feedback per delivery and integer truncation naturally 162give us hysteresis, so that transitions to larger concurrency happen 163every 1/g(N) positive feedback events. </p> 164 165<p> We want to decrement a destination's delivery concurrency when 166some (not necessarily consecutive) number of deliveries complete 167after connection or handshake failure. This is implemented with 168negative feedback f(N) where N is the destination's delivery 169concurrency. With f(N)=1 feedback per delivery, concurrency decreases 170by 1 after each negative feedback event; this gives us the old 171scheduler's behavior where concurrency is throttled down dramatically 172after a single pseudo-cohort failure. With f(N)=1/N feedback per 173delivery, concurrency backs off more gently. Again, less-than-1 174feedback per delivery and integer truncation naturally give us 175hysteresis, so that transitions to lower concurrency happen every 1761/f(N) negative feedback events. </p> 177 178<p> However, with negative feedback we introduce a subtle twist. 179We "reverse" the negative hysteresis cycle so that the transition 180to lower concurrency happens at the <b>beginning</b> of a sequence 181of 1/f(N) negative feedback events. Otherwise, a correction for 182overload would be made too late. This makes the choice of f(N) 183relatively unimportant, as borne out by measurements later in this 184document. </p> 185 186<p> In summary, the main ingredients for the Postfix 2.5 concurrency 187feedback algorithm are a) the option of less-than-1 positive feedback 188per delivery to avoid overwhelming servers, b) the option of 189less-than-1 negative feedback per delivery to avoid giving up too 190fast, c) feedback hysteresis to avoid rapid oscillation, and d) a 191"reverse" hysteresis cycle for negative feedback, so that it can 192correct for overload quickly. </p> 193 194<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3> 195 196<p> We want to suspend deliveries to a specific destination after 197some number of deliveries suffers connection or handshake failure. 198The old scheduler declares a destination "dead" when negative (-1) 199feedback throttles the delivery concurrency down to zero. With 200less-than-1 feedback per delivery, this throttling down would 201obviously take too long. We therefore have to separate "dead 202destination" detection from concurrency feedback. This is implemented 203by introducing the concept of pseudo-cohort failure. The Postfix 2042.5 concurrency scheduler declares a destination "dead" after a 205configurable number of pseudo-cohorts suffers from connection or 206handshake failures. The old scheduler corresponds to the special 207case where the pseudo-cohort failure limit is equal to 1. </p> 208 209<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3> 210 211<p> The pseudo code shows how the ideas behind new concurrency 212scheduler are implemented as of November 2007. The actual code can 213be found in the module qmgr/qmgr_queue.c. </p> 214 215<pre> 216Types: 217 Each destination has one set of the following variables 218 int concurrency 219 double success 220 double failure 221 double fail_cohorts 222 223Feedback functions: 224 N is concurrency; x, y are arbitrary numbers in [0..1] inclusive 225 positive feedback: g(N) = x/N | x/sqrt(N) | x 226 negative feedback: f(N) = y/N | y/sqrt(N) | y 227 228Initialization: 229 concurrency = initial_concurrency 230 success = 0 231 failure = 0 232 fail_cohorts = 0 233 234After success: 235 fail_cohorts = 0 236 Be prepared for feedback > hysteresis, or rounding error 237 success += g(concurrency) 238 while (success >= 1) Hysteresis 1 239 concurrency += 1 Hysteresis 1 240 failure = 0 241 success -= 1 Hysteresis 1 242 Be prepared for overshoot 243 if (concurrency > concurrency limit) 244 concurrency = concurrency limit 245 246Safety: 247 Don't apply positive feedback unless 248 concurrency < busy_refcount + init_dest_concurrency 249 otherwise negative feedback effect could be delayed 250 251After failure: 252 if (concurrency > 0) 253 fail_cohorts += 1.0 / concurrency 254 if (fail_cohorts > cohort_failure_limit) 255 concurrency = 0 256 if (concurrency > 0) 257 Be prepared for feedback > hysteresis, rounding errors 258 failure -= f(concurrency) 259 while (failure < 0) 260 concurrency -= 1 Hysteresis 1 261 failure += 1 Hysteresis 1 262 success = 0 263 Be prepared for overshoot 264 if (concurrency < 1) 265 concurrency = 1 266</pre> 267 268<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3> 269 270<p> Discussions about the concurrency scheduler redesign started 271early 2004, when the primary goal was to find alternatives that did 272not exhibit exponential growth or rapid concurrency throttling. No 273code was implemented until late 2007, when the primary concern had 274shifted towards better handling of server concurrency limits. For 275this reason we measure how well the new scheduler does this 276job. The table below compares mail delivery performance of the old 277+/-1 feedback per delivery with several less-than-1 feedback 278functions, for different limited-concurrency server scenarios. 279Measurements were done with a FreeBSD 6.2 client and with FreeBSD 2806.2 and various Linux servers. </p> 281 282<p> Server configuration: </p> 283 284<ul> <li> The mail flow was slowed down with 1 second latency per 285recipient ("smtpd_client_restrictions = sleep 1"). The purpose was 286to make results less dependent on hardware details, by avoiding 287slow-downs by queue file I/O, logging I/O, and network I/O. 288 289<li> Concurrency was limited by the server process limit 290("default_process_limit = 5" and "smtpd_client_event_limit_exceptions 291= static:all"). Postfix was stopped and started after changing the 292process limit, because the same number is also used as the backlog 293argument to the listen(2) system call, and "postfix reload" does 294not re-issue this call. 295 296<li> Mail was discarded with "local_recipient_maps = static:all" and 297"local_transport = discard". The discard action in access maps or 298header/body checks 299could not be used as it fails to update the in_flow_delay counters. 300 301</ul> 302 303<p> Client configuration: </p> 304 305<ul> 306 307<li> Queue file overhead was minimized by sending one message to a 308virtual alias that expanded into 2000 different remote recipients. 309All recipients were accounted for according to the maillog file. 310The virtual_alias_expansion_limit setting was increased to avoid 311complaints from the cleanup(8) server. 312 313<li> The number of deliveries was maximized with 314"smtp_destination_recipient_limit = 2". A smaller limit would cause 315Postfix to schedule the concurrency per recipient instead of domain, 316which is not what we want. 317 318<li> Maximum concurrency was limited with 319"smtp_destination_concurrency_limit = 20", and 320initial_destination_concurrency was set to the same value. 321 322<li> The positive and negative concurrency feedback hysteresis was 3231. Concurrency was incremented by 1 at the END of 1/feedback steps 324of positive feedback, and was decremented by 1 at the START of 3251/feedback steps of negative feedback. 326 327<li> The SMTP client used the default 30s SMTP connect timeout and 328300s SMTP greeting timeout. 329 330</ul> 331 332<h4> Impact of the 30s SMTP connect timeout </h4> 333 334<p> The first results are for a FreeBSD 6.2 server, where our 335artificially low listen(2) backlog results in a very short kernel 336queue for established connections. The table shows that all deferred 337deliveries failed due to a 30s connection timeout, and none failed 338due to a server greeting timeout. This measurement simulates what 339happens when the server's connection queue is completely full under 340load, and the TCP engine drops new connections. </p> 341 342<blockquote> 343 344<table> 345 346<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 347style</th> <th>connection<br> caching</th> <th>percentage<br> 348deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 349<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 350 351<tr> <td align="center" colspan="9"> <hr> </td> </tr> 352 353<tr><td align="center">20</td> <td align="center">5</td> <td 354align="center">1/N</td> <td align="center">no</td> <td 355align="center">9.9</td> <td align="center">19.4</td> <td 356align="center">0.49</td> <td align="center">198</td> <td 357align="center">-</td> </tr> 358 359<tr><td align="center">20</td> <td align="center">5</td> <td 360align="center">1/N</td> <td align="center">yes</td> <td 361align="center">10.3</td> <td align="center">19.4</td> <td 362align="center">0.49</td> <td align="center">206</td> <td 363align="center">-</td> </tr> 364 365<tr><td align="center">20</td> <td align="center">5</td> <td 366align="center">1/sqrt(N)</td> <td align="center">no</td> 367<td align="center">10.4</td> <td align="center">19.6</td> <td 368align="center">0.59</td> <td align="center">208</td> <td 369align="center">-</td> </tr> 370 371<tr><td align="center">20</td> <td align="center">5</td> <td 372align="center">1/sqrt(N)</td> <td align="center">yes</td> 373<td align="center">10.6</td> <td align="center">19.6</td> <td 374align="center">0.61</td> <td align="center">212</td> <td 375align="center">-</td> </tr> 376 377<tr><td align="center">20</td> <td align="center">5</td> <td 378align="center">1</td> <td align="center">no</td> <td 379align="center">10.1</td> <td align="center">19.5</td> <td 380align="center">1.29</td> <td align="center">202</td> <td 381align="center">-</td> </tr> 382 383<tr><td align="center">20</td> <td align="center">5</td> <td 384align="center">1</td> <td align="center">yes</td> <td 385align="center">10.8</td> <td align="center">19.3</td> <td 386align="center">1.57</td> <td align="center">216</td> <td 387align="center">-</td> </tr> 388 389<tr> <td align="center" colspan="9"> <hr> </td> </tr> 390 391</table> 392 393<p> A busy server with a completely full connection queue. N is 394the client delivery concurrency. Failed deliveries time out after 39530s without completing the TCP handshake. See text for a discussion 396of results. </p> 397 398</blockquote> 399 400<h4> Impact of the 300s SMTP greeting timeout </h4> 401 402<p> The next table shows results for a Fedora Core 8 server (results 403for RedHat 7.3 are identical). In this case, the artificially small 404listen(2) backlog argument does not impact our measurement. The 405table shows that practically all deferred deliveries fail after the 406300s SMTP greeting timeout. As these timeouts were 10x longer than 407with the first measurement, we increased the recipient count (and 408thus the running time) by a factor of 10 to keep the results 409comparable. The deferred mail percentages are a factor 10 lower 410than with the first measurement, because the 1s per-recipient delay 411was 1/300th of the greeting timeout instead of 1/30th of the 412connection timeout. </p> 413 414<blockquote> 415 416<table> 417 418<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 419style</th> <th>connection<br> caching</th> <th>percentage<br> 420deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 421<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 422 423<tr> <td align="center" colspan="9"> <hr> </td> </tr> 424 425<tr> <td align="center">20</td> <td align="center">5</td> <td 426align="center">1/N</td> <td align="center">no</td> <td 427align="center">1.16</td> <td align="center">19.8</td> <td 428align="center">0.37</td> <td align="center">-</td> <td 429align="center">230</td> </tr> 430 431<tr> <td align="center">20</td> <td align="center">5</td> <td 432align="center">1/N</td> <td align="center">yes</td> <td 433align="center">1.36</td> <td align="center">19.8</td> <td 434align="center">0.36</td> <td align="center">-</td> <td 435align="center">272</td> </tr> 436 437<tr> <td align="center">20</td> <td align="center">5</td> <td 438align="center">1/sqrt(N)</td> <td align="center">no</td> 439<td align="center">1.21</td> <td align="center">19.9</td> <td 440align="center">0.23</td> <td align="center">4</td> <td 441align="center">238</td> </tr> 442 443<tr> <td align="center">20</td> <td align="center">5</td> <td 444align="center">1/sqrt(N)</td> <td align="center">yes</td> 445<td align="center">1.36</td> <td align="center">20.0</td> <td 446align="center">0.23</td> <td align="center">-</td> <td 447align="center">272</td> </tr> 448 449<tr> <td align="center">20</td> <td align="center">5</td> <td 450align="center">1</td> <td align="center">no</td> <td 451align="center">1.18</td> <td align="center">20.0</td> <td 452align="center">0.16</td> <td align="center">-</td> <td 453align="center">236</td> </tr> 454 455<tr> <td align="center">20</td> <td align="center">5</td> <td 456align="center">1</td> <td align="center">yes</td> <td 457align="center">1.39</td> <td align="center">20.0</td> <td 458align="center">0.16</td> <td align="center">-</td> <td 459align="center">278</td> </tr> 460 461<tr> <td align="center" colspan="9"> <hr> </td> </tr> 462 463</table> 464 465<p> A busy server with a non-full connection queue. N is the client 466delivery concurrency. Failed deliveries complete at the TCP level, 467but time out after 300s while waiting for the SMTP greeting. See 468text for a discussion of results. </p> 469 470</blockquote> 471 472<h4> Impact of active server concurrency limiter </h4> 473 474<p> The final concurrency-limited result shows what happens when 475SMTP connections don't time out, but are rejected immediately with 476the Postfix server's smtpd_client_connection_count_limit feature 477(the server replies with a 421 status and disconnects immediately). 478Similar results can be expected with concurrency limiting features 479built into other MTAs or firewalls. For this measurement we specified 480a server concurrency limit and a client initial destination concurrency 481of 5, and a server process limit of 10; all other conditions were 482the same as with the first measurement. The same result would be 483obtained with a FreeBSD or Linux server, because the "pushing back" 484is done entirely by the receiving side. </p> 485 486<blockquote> 487 488<table> 489 490<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 491style</th> <th>connection<br> caching</th> <th>percentage<br> 492deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 493<th>theoretical<br>defer rate</th> </tr> 494 495<tr> <td align="center" colspan="9"> <hr> </td> </tr> 496 497<tr> <td align="center">20</td> <td align="center">5</td> <td 498align="center">1/N</td> <td align="center">no</td> <td 499align="center">16.5</td> <td align="center">5.17</td> <td 500align="center">0.38</td> <td align="center">1/6</td> </tr> 501 502<tr> <td align="center">20</td> <td align="center">5</td> <td 503align="center">1/N</td> <td align="center">yes</td> <td 504align="center">16.5</td> <td align="center">5.17</td> <td 505align="center">0.38</td> <td align="center">1/6</td> </tr> 506 507<tr> <td align="center">20</td> <td align="center">5</td> <td 508align="center">1/sqrt(N)</td> <td align="center">no</td> 509<td align="center">24.5</td> <td align="center">5.28</td> <td 510align="center">0.45</td> <td align="center">1/4</td> </tr> 511 512<tr> <td align="center">20</td> <td align="center">5</td> <td 513align="center">1/sqrt(N)</td> <td align="center">yes</td> 514<td align="center">24.3</td> <td align="center">5.28</td> <td 515align="center">0.46</td> <td align="center">1/4</td> </tr> 516 517<tr> <td align="center">20</td> <td align="center">5</td> <td 518align="center">1</td> <td align="center">no</td> <td 519align="center">49.7</td> <td align="center">5.63</td> <td 520align="center">0.67</td> <td align="center">1/2</td> </tr> 521 522<tr> <td align="center">20</td> <td align="center">5</td> <td 523align="center">1</td> <td align="center">yes</td> <td 524align="center">49.7</td> <td align="center">5.68</td> <td 525align="center">0.70</td> <td align="center">1/2</td> </tr> 526 527<tr> <td align="center" colspan="9"> <hr> </td> </tr> 528 529</table> 530 531<p> A server with active per-client concurrency limiter that replies 532with 421 and disconnects. N is the client delivery concurrency. 533The theoretical defer rate is 1/(1+roundup(1/feedback)). This is 534always 1/2 with the fixed +/-1 feedback per delivery; with the 535concurrency-dependent feedback variants, the defer rate decreases 536with increasing concurrency. See text for a discussion of results. 537</p> 538 539</blockquote> 540 541<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3> 542 543<p> All results in the previous sections are based on the first 544delivery runs only; they do not include any second etc. delivery 545attempts. It's also worth noting that the measurements look at 546steady-state behavior only. They don't show what happens when the 547client starts sending at a much higher or lower concurrency. 548</p> 549 550<p> The first two examples show that the effect of feedback 551is negligible when concurrency is limited due to congestion. This 552is because the initial concurrency is already at the client's 553concurrency maximum, and because there is 10-100 times more positive 554than negative feedback. Under these conditions, it is no surprise 555that the contribution from SMTP connection caching is also negligible. 556</p> 557 558<p> In the last example, the old +/-1 feedback per delivery will 559defer 50% of the mail when confronted with an active (anvil-style) 560server concurrency limit, where the server hangs up immediately 561with a 421 status (a TCP-level RST would have the same result). 562Less aggressive feedback mechanisms fare better than more aggressive 563ones. Concurrency-dependent feedback fares even better at higher 564concurrencies than shown here, but has limitations as discussed in 565the next section. </p> 566 567<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3> 568 569<p> Less-than-1 feedback is of interest primarily when sending large 570amounts of mail to destinations with active concurrency limiters 571(servers that reply with 421, or firewalls that send RST). When 572sending small amounts of mail per destination, less-than-1 per-delivery 573feedback won't have a noticeable effect on the per-destination 574concurrency, because the number of deliveries to the same destination 575is too small. You might just as well use zero per-delivery feedback 576and stay with the initial per-destination concurrency. And when 577mail deliveries fail due to congestion instead of active concurrency 578limiters, the measurements above show that per-delivery feedback 579has no effect. With large amounts of mail you might just as well 580use zero per-delivery feedback and start with the maximal per-destination 581concurrency. </p> 582 583<p> The scheduler with less-than-1 concurrency 584feedback per delivery solves a problem with servers that have active 585concurrency limiters. This works only because feedback is handled 586in a peculiar manner: positive feedback will increment the concurrency 587by 1 at the <b>end</b> of a sequence of events of length 1/feedback, 588while negative feedback will decrement concurrency by 1 at the 589<b>beginning</b> of such a sequence. This is how Postfix adjusts 590quickly for overshoot without causing lots of mail to be deferred. 591Without this difference in feedback treatment, less-than-1 feedback 592per delivery would defer 50% of the mail, and would be no better 593in this respect than the old +/-1 feedback per delivery. </p> 594 595<p> Unfortunately, the same feature that corrects quickly for 596concurrency overshoot also makes the scheduler more sensitive for 597noisy negative feedback. The reason is that one lonely negative 598feedback event has the same effect as a complete sequence of length 5991/feedback: in both cases delivery concurrency is dropped by 1 600immediately. As a worst-case scenario, consider multiple servers 601behind a load balancer on a single IP address, and no backup MX 602address. When 1 out of K servers fails to complete the SMTP handshake 603or drops the connection, a scheduler with 1/N (N = concurrency) 604feedback stops increasing its concurrency once it reaches a concurrency 605level of about K, even though the good servers behind the load 606balancer are perfectly capable of handling more traffic. </p> 607 608<p> This noise problem gets worse as the amount of positive feedback 609per delivery gets smaller. A compromise is to use fixed less-than-1 610positive feedback values instead of concurrency-dependent positive 611feedback. For example, to tolerate 1 of 4 bad servers in the above 612load balancer scenario, use positive feedback of 1/4 per "good" 613delivery (no connect or handshake error), and use an equal or smaller 614amount of negative feedback per "bad" delivery. The downside of 615using concurrency-independent feedback is that some of the old +/-1 616feedback problems will return at large concurrencies. Sites that 617must deliver mail at non-trivial per-destination concurrencies will 618require special configuration. </p> 619 620<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3> 621 622<p> The Postfix 2.5 concurrency scheduler is controlled with the 623following configuration parameters, where "<i>transport</i>_foo" 624provides a transport-specific parameter override. All parameter 625default settings are compatible with earlier Postfix versions. </p> 626 627<blockquote> 628 629<table border="0"> 630 631<tr> <th> Parameter name </th> <th> Postfix version </th> <th> 632Description </th> </tr> 633 634<tr> <td colspan="3"> <hr> </td> </tr> 635 636<tr> <td> initial_destination_concurrency<br> 637<i>transport</i>_initial_destination_concurrency </td> <td 638align="center"> all<br> 2.5 </td> <td> Initial per-destination 639delivery concurrency </td> </tr> 640 641<tr> <td> default_destination_concurrency_limit<br> 642<i>transport</i>_destination_concurrency_limit </td> <td align="center"> 643all<br> all </td> <td> Maximum per-destination delivery concurrency 644</td> </tr> 645 646<tr> <td> default_destination_concurrency_positive_feedback<br> 647<i>transport</i>_destination_concurrency_positive_feedback </td> 648<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive 649feedback amount, per delivery that does not fail with connection 650or handshake failure </td> </tr> 651 652<tr> <td> default_destination_concurrency_negative_feedback<br> 653<i>transport</i>_destination_concurrency_negative_feedback </td> 654<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative 655feedback amount, per delivery that fails with connection or handshake 656failure </td> </tr> 657 658<tr> <td> default_destination_concurrency_failed_cohort_limit<br> 659<i>transport</i>_destination_concurrency_failed_cohort_limit </td> 660<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed 661pseudo-cohorts after which a destination is declared "dead" and 662delivery is suspended </td> </tr> 663 664<tr> <td> destination_concurrency_feedback_debug</td> <td align="center"> 6652.5 </td> <td> Enable verbose logging of concurrency scheduler 666activity </td> </tr> 667 668<tr> <td colspan="3"> <hr> </td> </tr> 669 670</table> 671 672</blockquote> 673 674<h2> <a name="jobs"> Preemptive scheduling </a> </h2> 675 676<p> 677 678The following sections describe the new queue manager and its 679preemptive scheduler algorithm. Note that the document was originally 680written to describe the changes between the new queue manager (in 681this text referred to as <tt>nqmgr</tt>, the name it was known by 682before it became the default queue manager) and the old queue manager 683(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt> 684every so often. 685 686</p> 687 688<p> 689 690This document is divided into sections as follows: 691 692</p> 693 694<ul> 695 696<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by 697nqmgr </a> 698 699<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 700up the message </a> - how it is assigned to transports, jobs, peers, 701entries 702 703<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection 704works </a> 705 706<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption 707works </a> - what messages may be preempted and how and what messages 708are chosen to preempt them 709 710<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency 711limits affect the scheduling algorithm </a> 712 713<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource 714limits </a> 715 716</ul> 717 718<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by 719nqmgr </a> </h3> 720 721<p> 722 723Let's start by recapitulating the structures and terms used when 724referring to the queue manager and how it operates. Many of these are 725partially described elsewhere, but it is nice to have a coherent 726overview in one place: 727 728</p> 729 730<ul> 731 732<li> <p> Each message structure represents one mail message which 733Postfix is to deliver. The message recipients specify to what 734destinations is the message to be delivered and what transports are 735going to be used for the delivery. </p> 736 737<li> <p> Each recipient entry groups a batch of recipients of one 738message which are all going to be delivered to the same destination 739(and over the same transport). 740</p> 741 742<li> <p> Each transport structure groups everything what is going 743to be delivered by delivery agents dedicated for that transport. 744Each transport maintains a set of queues (describing the destinations 745it shall talk to) and jobs (referencing the messages it shall 746deliver). </p> 747 748<li> <p> Each transport queue (not to be confused with the on-disk 749"active" queue or "incoming" queue) groups everything what is going be 750delivered to given destination (aka nexthop) by its transport. Each 751queue belongs to one transport, so each destination may be referred 752to by several queues, one for each transport. Each queue maintains 753a list of all recipient entries (batches of message recipients) 754which shall be delivered to given destination (the todo list), and 755a list of recipient entries already being delivered by the delivery 756agents (the busy list). </p> 757 758<li> <p> Each queue corresponds to multiple peer structures. Each 759peer structure is like the queue structure, belonging to one transport 760and referencing one destination. The difference is that it lists 761only the recipient entries which all originate from the same message, 762unlike the queue structure, whose entries may originate from various 763messages. For messages with few recipients, there is usually just 764one recipient entry for each destination, resulting in one recipient 765entry per peer. But for large mailing list messages the recipients 766may need to be split to multiple recipient entries, in which case 767the peer structure may list many entries for single destination. 768</p> 769 770<li> <p> Each transport job groups everything it takes to deliver 771one message via its transport. Each job represents one message 772within the context of the transport. The job belongs to one transport 773and message, so each message may have multiple jobs, one for each 774transport. The job groups all the peer structures, which describe 775the destinations the job's message has to be delivered to. </p> 776 777</ul> 778 779<p> 780 781The first four structures are common to both <tt>nqmgr</tt> and 782<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>. 783 784</p> 785 786<p> 787 788These terms are used extensively in the text below, feel free to 789look up the description above anytime you'll feel you have lost a 790sense what is what. 791 792</p> 793 794<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 795up the message </a> </h3> 796 797<p> 798 799Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue, 800the following happens: It reads all necessary information from the 801queue file as <tt>oqmgr</tt> does, and also reads as many recipients 802as possible - more on that later, for now let's just pretend it 803always reads all recipients. 804 805</p> 806 807<p> 808 809Then it resolves the recipients as <tt>oqmgr</tt> does, which 810means obtaining (address, nexthop, transport) triple for each 811recipient. For each triple, it finds the transport; if it does not 812exist yet, it instantiates it (unless it's dead). Within the 813transport, it finds the destination queue for the given nexthop; if it 814does not exist yet, it instantiates it (unless it's dead). The 815triple is then bound to given destination queue. This happens in 816qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>. 817 818</p> 819 820<p> 821 822Then for each triple which was bound to some queue (and thus 823transport), the program finds the job which represents the message 824within that transport's context; if it does not exist yet, it 825instantiates it. Within the job, it finds the peer which represents 826the bound destination queue within this jobs context; if it does 827not exist yet, it instantiates it. Finally, it stores the address 828from the resolved triple to the recipient entry which is appended 829to both the queue entry list and the peer entry list. The addresses 830for the same nexthop are batched in the entries up to the 831<i>transport</i>_destination_recipient_limit for that transport. 832This happens in qmgr_message_assign(), and apart 833from that it operates with job and peer structures, it is basically the 834same as in <tt>oqmgr</tt>. 835 836</p> 837 838<p> 839 840When the job is instantiated, it is enqueued on the transport's job 841list based on the time its message was picked up by <tt>nqmgr</tt>. 842For first batch of recipients this means it is appended to the end 843of the job list, but the ordering of the job list by the enqueue 844time is important as we will see shortly. 845 846</p> 847 848<p> 849 850[Now you should have a pretty good idea what the state of the 851<tt>nqmgr</tt> is after a couple of messages were picked up, and what the 852relation is between all those job, peer, queue and entry structures.] 853 854</p> 855 856<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection 857works </a> </h3> 858 859<p> 860 861Having prepared all those above mentioned structures, the task of 862the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries 863one at a time and pass them to the delivery agent for corresponding 864transport. Now how does this work? 865 866</p> 867 868<p> 869 870The first approximation of the new scheduling algorithm is like this: 871 872</p> 873 874<blockquote> 875<pre> 876foreach transport (round-robin-by-transport) 877do 878 if transport busy continue 879 if transport process limit reached continue 880 foreach transport's job (in the order of the transport's job list) 881 do 882 foreach job's peer (round-robin-by-destination) 883 if peer->queue->concurrency < peer->queue->window 884 return next peer entry. 885 done 886 done 887done 888</pre> 889</blockquote> 890 891<p> 892 893Now what is the "order of the transport's job list"? As we know 894already, the job list is by default kept in the order the message 895was picked up by the <tt>nqmgr</tt>. So by default we get the 896top-level round-robin transport, and within each transport we get 897the FIFO message delivery. The round-robin of the peers by the 898destination is perhaps of little importance in most real-life cases 899(unless the <i>transport</i>_destination_recipient_limit is reached, 900in one job there 901is only one peer structure for each destination), but theoretically 902it makes sure that even within single jobs, destinations are treated 903fairly. 904 905</p> 906 907<p> 908 909[By now you should have a feeling you really know how the scheduler 910works, except for the preemption, under ideal conditions - that is, 911no recipient resource limits and no destination concurrency problems.] 912 913</p> 914 915<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption 916works </a> </h3> 917 918<p> 919 920As you might perhaps expect by now, the transport's job list does 921not remain sorted by the job's message enqueue time all the time. 922The most cool thing about <tt>nqmgr</tt> is not the simple FIFO 923delivery, but that it is able to slip mail with little recipients 924past the mailing-list bulk mail. This is what the job preemption 925is about - shuffling the jobs on the transport's job list to get 926the best message delivery rates. Now how is it achieved? 927 928</p> 929 930<p> 931 932First I have to tell you that there are in fact two job lists in 933each transport. One is the scheduler's job list, which the scheduler 934is free to play with, while the other one keeps the jobs always 935listed in the order of the enqueue time and is used for recipient 936pool management we will discuss later. For now, we will deal with 937the scheduler's job list only. 938 939</p> 940 941<p> 942 943So, we have the job list, which is first ordered by the time the 944jobs' messages were enqueued, oldest messages first, the most recently 945picked one at the end. For now, let's assume that there are no 946destination concurrency problems. Without preemption, we pick some 947entry of the first (oldest) job on the queue, assign it to delivery 948agent, pick another one from the same job, assign it again, and so 949on, until all the entries are used and the job is delivered. We 950would then move onto the next job and so on and on. Now how do we 951manage to sneak in some entries from the recently added jobs when 952the first job on the job list belongs to a message going to the 953mailing-list and has thousands of recipient entries? 954 955</p> 956 957<p> 958 959The <tt>nqmgr</tt>'s answer is that we can artificially "inflate" 960the delivery time of that first job by some constant for free - it 961is basically the same trick you might remember as "accumulation of 962potential" from the amortized complexity lessons. For example, 963instead of delivering the entries of the first job on the job list 964every time a delivery agent becomes available, we can do it only 965every second time. If you view the moments the delivery agent becomes 966available on a timeline as "delivery slots", then instead of using 967every delivery slot for the first job, we can use only every other 968slot, and still the overall delivery efficiency of the first job 969remains the same. So the delivery <tt>11112222</tt> becomes 970<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, . 971denotes the free slot). Now what do we do with free slots? 972 973</p> 974 975<p> 976 977As you might have guessed, we will use them for sneaking the mail 978with little recipients in. For example, if we have one four-recipient 979mail followed by four one recipients mail, the delivery sequence 980(that is, the sequence in which the jobs are assigned to the 981delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine 982for sneaking in the single recipient mail, but how do we sneak in 983the mail with more than one recipient? Say if we have one four-recipient 984mail followed by two two-recipient mails? 985 986</p> 987 988<p> 989 990The simple answer would be to use delivery sequence <tt>12121313</tt>. 991But the problem is that this does not scale well. Imagine you have 992mail with a thousand recipients followed by mail with a hundred recipients. 993It is tempting to suggest the delivery sequence like <tt>121212....</tt>, 994but alas! Imagine there arrives another mail with say ten recipients. 995But there are no free slots anymore, so it can't slip by, not even 996if it had only one recipient. It will be stuck until the 997hundred-recipient mail is delivered, which really sucks. 998 999</p> 1000 1001<p> 1002 1003So, it becomes obvious that while inflating the message to get 1004free slots is a great idea, one has to be really careful of how the 1005free slots are assigned, otherwise one might corner himself. So, 1006how does <tt>nqmgr</tt> really use the free slots? 1007 1008</p> 1009 1010<p> 1011 1012The key idea is that one does not have to generate the free slots 1013in a uniform way. The delivery sequence <tt>111...1</tt> is no 1014worse than <tt>1.1.1.1</tt>, in fact, it is even better as some 1015entries are in the first case selected earlier than in the second 1016case, and none is selected later! So it is possible to first 1017"accumulate" the free delivery slots and then use them all at once. 1018It is even possible to accumulate some, then use them, then accumulate 1019some more and use them again, as in <tt>11..1.1</tt> . 1020 1021</p> 1022 1023<p> 1024 1025Let's get back to the one hundred recipient example. We now know 1026that we could first accumulate one hundred free slots, and only 1027after then to preempt the first job and sneak the one hundred 1028recipient mail in. Applying the algorithm recursively, we see the 1029hundred recipient job can accumulate ten free delivery slots, and 1030then we could preempt it and sneak in the ten-recipient mail... 1031Wait wait wait! Could we? Aren't we overinflating the original one 1032thousand recipient mail? 1033 1034</p> 1035 1036<p> 1037 1038Well, despite the fact that it looks so at the first glance, another trick will 1039allow us to answer "no, we are not!". If we had said that we will 1040inflate the delivery time twice at maximum, and then we consider 1041every other slot as a free slot, then we would overinflate in case 1042of the recursive preemption. BUT! The trick is that if we use only 1043every n-th slot as a free slot for n>2, there is always some worst 1044inflation factor which we can guarantee not to be breached, even 1045if we apply the algorithm recursively. To be precise, if for every 1046k>1 normally used slots we accumulate one free delivery slot, than 1047the inflation factor is not worse than k/(k-1) no matter how many 1048recursive preemptions happen. And it's not worse than (k+1)/k if 1049only non-recursive preemption happens. Now, having got through the 1050theory and the related math, let's see how <tt>nqmgr</tt> implements 1051this. 1052 1053</p> 1054 1055<p> 1056 1057Each job has so called "available delivery slot" counter. Each 1058transport has a <i>transport</i>_delivery_slot_cost parameter, which 1059defaults to default_delivery_slot_cost parameter which is set to 5 1060by default. This is the k from the paragraph above. Each time k 1061entries of the job are selected for delivery, this counter is 1062incremented by one. Once there are some slots accumulated, a job which 1063requires no more than that number of slots to be fully delivered 1064can preempt this job. 1065 1066</p> 1067 1068<p> 1069 1070[Well, the truth is, the counter is incremented every time an entry 1071is selected and it is divided by k when it is used. 1072But to understand, it's good enough to use 1073the above approximation of the truth.] 1074 1075</p> 1076 1077<p> 1078 1079OK, so now we know the conditions which must be satisfied so one 1080job can preempt another one. But what job gets preempted, how do 1081we choose what job preempts it if there are several valid candidates, 1082and when does all this exactly happen? 1083 1084</p> 1085 1086<p> 1087 1088The answer for the first part is simple. The job whose entry was 1089selected the last time is the so called current job. Normally, it is 1090the first job on the scheduler's job list, but destination concurrency 1091limits may change this as we will see later. It is always only the 1092current job which may get preempted. 1093 1094</p> 1095 1096<p> 1097 1098Now for the second part. The current job has a certain amount of 1099recipient entries, and as such may accumulate at maximum some amount 1100of available delivery slots. It might have already accumulated some, 1101and perhaps even already used some when it was preempted before 1102(remember a job can be preempted several times). In either case, 1103we know how many are accumulated and how many are left to deliver, 1104so we know how many it may yet accumulate at maximum. Every other 1105job which may be delivered by less than that number of slots is a 1106valid candidate for preemption. How do we choose among them? 1107 1108</p> 1109 1110<p> 1111 1112The answer is - the one with maximum enqueue_time/recipient_entry_count. 1113That is, the older the job is, the more we should try to deliver 1114it in order to get best message delivery rates. These rates are of 1115course subject to how many recipients the message has, therefore 1116the division by the recipient (entry) count. No one shall be surprised 1117that a message with n recipients takes n times longer to deliver than 1118a message with one recipient. 1119 1120</p> 1121 1122<p> 1123 1124Now let's recap the previous two paragraphs. Isn't it too complicated? 1125Why don't the candidates come only among the jobs which can be 1126delivered within the number of slots the current job already 1127accumulated? Why do we need to estimate how much it has yet to 1128accumulate? If you found out the answer, congratulate yourself. If 1129we did it this simple way, we would always choose the candidate 1130with the fewest recipient entries. If there were enough single recipient 1131mails coming in, they would always slip by the bulk mail as soon 1132as possible, and the two or more recipients mail would never get 1133a chance, no matter how long they have been sitting around in the 1134job list. 1135 1136</p> 1137 1138<p> 1139 1140This candidate selection has an interesting implication - that when 1141we choose the best candidate for preemption (this is done in 1142qmgr_choose_candidate()), it may happen that we may not use it for 1143preemption immediately. This leads to an answer to the last part 1144of the original question - when does the preemption happen? 1145 1146</p> 1147 1148<p> 1149 1150The preemption attempt happens every time next transport's recipient 1151entry is to be chosen for delivery. To avoid needless overhead, the 1152preemption is not attempted if the current job could never accumulate 1153more than <i>transport</i>_minimum_delivery_slots (defaults to 1154default_minimum_delivery_slots which defaults to 3). If there are 1155already enough accumulated slots to preempt the current job by the 1156chosen best candidate, it is done immediately. This basically means 1157that the candidate is moved in front of the current job on the 1158scheduler's job list and decreasing the accumulated slot counter 1159by the amount used by the candidate. If there are not enough slots... 1160well, I could say that nothing happens and the another preemption 1161is attempted the next time. But that's not the complete truth. 1162 1163</p> 1164 1165<p> 1166 1167The truth is that it turns out that it is not really necessary to 1168wait until the jobs counter accumulates all the delivery slots in 1169advance. Say we have ten-recipient mail followed by two two-recipient 1170mails. If the preemption happened when enough delivery slots accumulate 1171(assuming slot cost 2), the delivery sequence becomes 1172<tt>11112211113311</tt>. Now what would we get if we would wait 1173only for 50% of the necessary slots to accumulate and we promise 1174we would wait for the remaining 50% later, after we get back 1175to the preempted job? If we use such a slot loan, the delivery sequence 1176becomes <tt>11221111331111</tt>. As we can see, it makes it not 1177considerably worse for the delivery of the ten-recipient mail, but 1178it allows the small messages to be delivered sooner. 1179 1180</p> 1181 1182<p> 1183 1184The concept of these slot loans is where the 1185<i>transport</i>_delivery_slot_discount and 1186<i>transport</i>_delivery_slot_loan come from (they default to 1187default_delivery_slot_discount and default_delivery_slot_loan, whose 1188values are by default 50 and 3, respectively). The discount (resp. 1189loan) specifies how many percent (resp. how many slots) one "gets 1190in advance", when the number of slots required to deliver the best 1191candidate is compared with the number of slots the current slot had 1192accumulated so far. 1193 1194</p> 1195 1196<p> 1197 1198And that pretty much concludes this chapter. 1199 1200</p> 1201 1202<p> 1203 1204[Now you should have a feeling that you pretty much understand the 1205scheduler and the preemption, or at least that you will have 1206after you read the last chapter a couple more times. You shall clearly 1207see the job list and the preemption happening at its head, in ideal 1208delivery conditions. The feeling of understanding shall last until 1209you start wondering what happens if some of the jobs are blocked, 1210which you might eventually figure out correctly from what had been 1211said already. But I would be surprised if your mental image of the 1212scheduler's functionality is not completely shattered once you 1213start wondering how it works when not all recipients may be read 1214in-core. More on that later.] 1215 1216</p> 1217 1218<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency 1219limits affect the scheduling algorithm </a> </h3> 1220 1221<p> 1222 1223The <tt>nqmgr</tt> uses the same algorithm for destination concurrency 1224control as <tt>oqmgr</tt>. Now what happens when the destination 1225limits are reached and no more entries for that destination may be 1226selected by the scheduler? 1227 1228</p> 1229 1230<p> 1231 1232From the user's point of view it is all simple. If some of the peers 1233of a job can't be selected, those peers are simply skipped by the 1234entry selection algorithm (the pseudo-code described before) and 1235only the selectable ones are used. If none of the peers may be 1236selected, the job is declared a "blocker job". Blocker jobs are 1237skipped by the entry selection algorithm and they are also excluded 1238from the candidates for preemption of the current job. Thus the scheduler 1239effectively behaves as if the blocker jobs didn't exist on the job 1240list at all. As soon as at least one of the peers of a blocker job 1241becomes unblocked (that is, the delivery agent handling the delivery 1242of the recipient entry for the given destination successfully finishes), 1243the job's blocker status is removed and the job again participates 1244in all further scheduler actions normally. 1245 1246</p> 1247 1248<p> 1249 1250So the summary is that the users don't really have to be concerned 1251about the interaction of the destination limits and scheduling 1252algorithm. It works well on its own and there are no knobs they 1253would need to control it. 1254 1255</p> 1256 1257<p> 1258 1259From a programmer's point of view, the blocker jobs complicate the 1260scheduler quite a lot. Without them, the jobs on the job list would 1261be normally delivered in strict FIFO order. If the current job is 1262preempted, the job preempting it is completely delivered unless it 1263is preempted itself. Without blockers, the current job is thus 1264always either the first job on the job list, or the top of the stack 1265of jobs preempting the first job on the job list. 1266 1267</p> 1268 1269<p> 1270 1271The visualization of the job list and the preemption stack without 1272blockers would be like this: 1273 1274</p> 1275 1276<blockquote> 1277<pre> 1278first job-> 1--2--3--5--6--8--... <- job list 1279on job list | 1280 4 <- preemption stack 1281 | 1282current job-> 7 1283</pre> 1284</blockquote> 1285 1286<p> 1287 1288In the example above we see that job 1 was preempted by job 4 and 1289then job 4 was preempted by job 7. After job 7 is completed, remaining 1290entries of job 4 are selected, and once they are all selected, job 12911 continues. 1292 1293</p> 1294 1295<p> 1296 1297As we see, it's all very clean and straightforward. Now how does 1298this change because of blockers? 1299 1300</p> 1301 1302<p> 1303 1304The answer is: a lot. Any job may become a blocker job at any time, 1305and also become a normal job again at any time. This has several 1306important implications: 1307 1308</p> 1309 1310<ol> 1311 1312<li> <p> 1313 1314The jobs may be completed in arbitrary order. For example, in the 1315example above, if the current job 7 becomes blocked, the next job 13164 may complete before the job 7 becomes unblocked again. Or if both 13177 and 4 are blocked, then 1 is completed, then 7 becomes unblocked 1318and is completed, then 2 is completed and only after that 4 becomes 1319unblocked and is completed... You get the idea. 1320 1321</p> 1322 1323<p> 1324 1325[Interesting side note: even when jobs are delivered out of order, 1326from a single destination's point of view the jobs are still delivered 1327in the expected order (that is, FIFO unless there was some preemption 1328involved). This is because whenever a destination queue becomes 1329unblocked (the destination limit allows selection of more recipient 1330entries for that destination), all jobs which have peers for that 1331destination are unblocked at once.] 1332 1333</p> 1334 1335<li> <p> 1336 1337The idea of the preemption stack at the head of the job list is 1338gone. That is, it must be possible to preempt any job on the job 1339list. For example, if the jobs 7, 4, 1 and 2 in the example above 1340become all blocked, job 3 becomes the current job. And of course 1341we do not want the preemption to be affected by the fact that there 1342are some blocked jobs or not. Therefore, if it turns out that job 13433 might be preempted by job 6, the implementation shall make it 1344possible. 1345 1346</p> 1347 1348<li> <p> 1349 1350The idea of the linear preemption stack itself is gone. It's no 1351longer true that one job is always preempted by only one job at one 1352time (that is directly preempted, not counting the recursively 1353nested jobs). For example, in the example above, job 1 is directly 1354preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes 1355blocked, and job 4 is being delivered. If it accumulates enough 1356delivery slots, it is natural that it might be preempted for example 1357by job 8. Now job 4 is preempted by both job 7 AND job 8 at the 1358same time. 1359 1360</p> 1361 1362</ol> 1363 1364<p> 1365 1366Now combine the points 2) and 3) with point 1) again and you realize 1367that the relations on the once linear job list became pretty 1368complicated. If we extend the point 3) example: jobs 7 and 8 preempt 1369job 4, now job 8 becomes blocked too, then job 4 completes. Tricky, 1370huh? 1371 1372</p> 1373 1374<p> 1375 1376If I illustrate the relations after the above mentioned examples 1377(but those in point 1), the situation would look like this: 1378 1379</p> 1380 1381<blockquote> 1382<pre> 1383 v- parent 1384 1385adoptive parent -> 1--2--3--5--... <- "stack" level 0 1386 | | 1387parent gone -> ? 6 <- "stack" level 1 1388 / \ 1389children -> 7 8 ^- child <- "stack" level 2 1390 1391 ^- siblings 1392</pre> 1393</blockquote> 1394 1395<p> 1396 1397Now how does <tt>nqmgr</tt> deal with all these complicated relations? 1398 1399</p> 1400 1401<p> 1402 1403Well, it maintains them all as described, but fortunately, all these 1404relations are necessary only for the purpose of proper counting of 1405available delivery slots. For the purpose of ordering the jobs for 1406entry selection, the original rule still applies: "the job preempting 1407the current job is moved in front of the current job on the job 1408list". So for entry selection purposes, the job relations remain 1409as simple as this: 1410 1411</p> 1412 1413<blockquote> 1414<pre> 14157--8--1--2--6--3--5--.. <- scheduler's job list order 1416</pre> 1417</blockquote> 1418 1419<p> 1420 1421The job list order and the preemption parent/child/siblings relations 1422are maintained separately. And because the selection works only 1423with the job list, you can happily forget about those complicated 1424relations unless you want to study the <tt>nqmgr</tt> sources. In 1425that case the text above might provide some helpful introduction 1426to the problem domain. Otherwise I suggest you just forget about 1427all this and stick with the user's point of view: the blocker jobs 1428are simply ignored. 1429 1430</p> 1431 1432<p> 1433 1434[By now, you should have a feeling that there are more things going 1435on under the hood than you ever wanted to know. You decide that 1436forgetting about this chapter is the best you can do for the sake 1437of your mind's health and you basically stick with the idea how the 1438scheduler works in ideal conditions, when there are no blockers, 1439which is good enough.] 1440 1441</p> 1442 1443<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource 1444limits </a> </h3> 1445 1446<p> 1447 1448When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed 1449that all recipients of all messages in the "active" queue are completely 1450read into memory. This is simply not true. There is an upper 1451bound on the amount of memory the <tt>nqmgr</tt> may use, and 1452therefore it must impose some limits on the information it may store 1453in memory at any given time. 1454 1455</p> 1456 1457<p> 1458 1459First of all, not all messages may be read in-core at once. At any 1460time, only qmgr_message_active_limit messages may be read in-core 1461at maximum. When read into memory, the messages are picked from the 1462"incoming" and "deferred" queues and moved to the "active" queue 1463(incoming having priority), so if there are more than 1464qmgr_message_active_limit messages queued in the "active" queue, the 1465rest will have to wait until (some of) the messages in the "active" queue 1466are completely delivered (or deferred). 1467 1468</p> 1469 1470<p> 1471 1472Even with the limited amount of in-core messages, there is another 1473limit which must be imposed in order to avoid memory exhaustion. 1474Each message may contain a huge number of recipients (tens or hundreds 1475of thousands are not uncommon), so if <tt>nqmgr</tt> read all 1476recipients of all messages in the "active" queue, it may easily run 1477out of memory. Therefore there must be some upper bound on the 1478amount of message recipients which are read into memory at the 1479same time. 1480 1481</p> 1482 1483<p> 1484 1485Before discussing how exactly <tt>nqmgr</tt> implements the recipient 1486limits, let's see how the sole existence of the limits themselves 1487affects the <tt>nqmgr</tt> and its scheduler. 1488 1489</p> 1490 1491<p> 1492 1493The message limit is straightforward - it just limits the size of 1494the 1495lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which 1496message can preempt the current one. Messages not in the "active" queue 1497are simply not considered at all. 1498 1499</p> 1500 1501<p> 1502 1503The recipient limit complicates more things. First of all, the 1504message reading code must support reading the recipients in batches, 1505which among other things means accessing the queue file several 1506times and continuing where the last recipient batch ended. This is 1507invoked by the scheduler whenever the current job has space for more 1508recipients, subject to transport's refill_limit and refill_delay parameters. 1509It is also done any time when all 1510in-core recipients of the message are dealt with (which may also 1511mean they were deferred) but there are still more in the queue file. 1512 1513</p> 1514 1515<p> 1516 1517The second complication is that with some recipients left unread 1518in the queue file, the scheduler can't operate with exact counts 1519of recipient entries. With unread recipients, it is not clear how 1520many recipient entries there will be, as they are subject to 1521per-destination grouping. It is not even clear to what transports 1522(and thus jobs) the recipients will be assigned. And with messages 1523coming from the "deferred" queue, it is not even clear how many unread 1524recipients are still to be delivered. This all means that the 1525scheduler must use only estimates of how many recipients entries 1526there will be. Fortunately, it is possible to estimate the minimum 1527and maximum correctly, so the scheduler can always err on the safe 1528side. Obviously, the better the estimates, the better the results, so 1529it is best when we are able to read all recipients in-core and turn 1530the estimates into exact counts, or at least try to read as many 1531as possible to make the estimates as accurate as possible. 1532 1533</p> 1534 1535<p> 1536 1537The third complication is that it is no longer true that the scheduler 1538is done with a job once all of its in-core recipients are delivered. 1539It is possible that the job will be revived later, when another 1540batch of recipients is read in core. It is also possible that some 1541jobs will be created for the first time long after the first batch 1542of recipients was read in core. The <tt>nqmgr</tt> code must be 1543ready to handle all such situations. 1544 1545</p> 1546 1547<p> 1548 1549And finally, the fourth complication is that the <tt>nqmgr</tt> 1550code must somehow impose the recipient limit itself. Now how does 1551it achieve it? 1552 1553</p> 1554 1555<p> 1556 1557Perhaps the easiest solution would be to say that each message may 1558have at maximum X recipients stored in-core, but such a solution would 1559be poor for several reasons. With reasonable qmgr_message_active_limit 1560values, the X would have to be quite low to maintain a reasonable 1561memory footprint. And with low X lots of things would not work well. 1562The <tt>nqmgr</tt> would have problems to use the 1563<i>transport</i>_destination_recipient_limit efficiently. The 1564scheduler's preemption would be suboptimal as the recipient count 1565estimates would be inaccurate. The message queue file would have 1566to be accessed many times to read in more recipients again and 1567again. 1568 1569</p> 1570 1571<p> 1572 1573Therefore it seems reasonable to have a solution which does not use 1574a limit imposed on a per-message basis, but which maintains a pool 1575of available recipient slots, which can be shared among all messages 1576in the most efficient manner. And as we do not want separate 1577transports to compete for resources whenever possible, it seems 1578appropriate to maintain such a recipient pool for each transport 1579separately. This is the general idea, now how does it work in 1580practice? 1581 1582</p> 1583 1584<p> 1585 1586First we have to solve a little chicken-and-egg problem. If we want 1587to use the per-transport recipient pools, we first need to know to 1588what transport(s) the message is assigned. But we will find that 1589out only after we first read in the recipients. So it is obvious 1590that we first have to read in some recipients, use them to find out 1591to what transports the message is to be assigned, and only after 1592that can we use the per-transport recipient pools. 1593 1594</p> 1595 1596<p> 1597 1598Now how many recipients shall we read for the first time? This is 1599what qmgr_message_recipient_minimum and qmgr_message_recipient_limit 1600values control. The qmgr_message_recipient_minimum value specifies 1601how many recipients of each message we will read the first time, 1602no matter what. It is necessary to read at least one recipient 1603before we can assign the message to a transport and create the first 1604job. However, reading only qmgr_message_recipient_minimum recipients 1605even if there are only few messages with few recipients in-core would 1606be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit 1607recipients in-core so far, the first batch of recipients may be 1608larger than qmgr_message_recipient_minimum - as large as is required 1609to reach the qmgr_message_recipient_limit limit. 1610 1611</p> 1612 1613<p> 1614 1615Once the first batch of recipients was read in core and the message 1616jobs were created, the size of the subsequent recipient batches (if 1617any - of course it's best when all recipients are read in one batch) 1618is based solely on the position of the message jobs on their 1619corresponding transports' job lists. Each transport has a pool of 1620<i>transport</i>_recipient_limit recipient slots which it can 1621distribute among its jobs (how this is done is described later). 1622The subsequent recipient batch may be as large as the sum of all 1623recipient slots of all jobs of the message permits (plus the 1624qmgr_message_recipient_minimum amount which always applies). 1625 1626</p> 1627 1628<p> 1629 1630For example, if a message has three jobs, the first with 1 recipient 1631still in-core and 4 recipient slots, the second with 5 recipients in-core 1632and 5 recipient slots, and the third with 2 recipients in-core and 0 1633recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs' 1634recipients slots in total. This means that we could immediately 1635read 2+qmgr_message_recipient_minimum more recipients of that message 1636in core. 1637 1638</p> 1639 1640<p> 1641 1642The above example illustrates several things which might be worth 1643mentioning explicitly: first, note that although the per-transport 1644slots are assigned to particular jobs, we can't guarantee that once 1645the next batch of recipients is read in core, that the corresponding 1646amounts of recipients will be assigned to those jobs. The jobs lend 1647its slots to the message as a whole, so it is possible that some 1648jobs end up sponsoring other jobs of their message. For example, 1649if in the example above the 2 newly read recipients were assigned 1650to the second job, the first job sponsored the second job with 2 1651slots. The second notable thing is the third job, which has more 1652recipients in-core than it has slots. Apart from sponsoring by other 1653job we just saw it can be result of the first recipient batch, which 1654is sponsored from global recipient pool of qmgr_message_recipient_limit 1655recipients. It can be also sponsored from the message recipient 1656pool of qmgr_message_recipient_minimum recipients. 1657 1658</p> 1659 1660<p> 1661 1662Now how does each transport distribute the recipient slots among 1663its jobs? The strategy is quite simple. As most scheduler activity 1664happens on the head of the job list, it is our intention to make 1665sure that the scheduler has the best estimates of the recipient 1666counts for those jobs. As we mentioned above, this means that we 1667want to try to make sure that the messages of those jobs have all 1668recipients read in-core. Therefore the transport distributes the 1669slots "along" the job list from start to end. In this case the job 1670list sorted by message enqueue time is used, because it doesn't 1671change over time as the scheduler's job list does. 1672 1673</p> 1674 1675<p> 1676 1677More specifically, each time a job is created and appended to the 1678job list, it gets all unused recipient slots from its transport's 1679pool. It keeps them until all recipients of its message are read. 1680When this happens, all unused recipient slots are transferred to 1681the next job (which is now in fact the first such job) on the job 1682list which still has some recipients unread, or eventually back to 1683the transport pool if there is no such job. Such a transfer then also 1684happens whenever a recipient entry of that job is delivered. 1685 1686</p> 1687 1688<p> 1689 1690There is also a scenario when a job is not appended to the end of 1691the job list (for example it was created as a result of a second or 1692later recipient batch). Then it works exactly as above, except that 1693if it was put in front of the first unread job (that is, the job 1694of a message which still has some unread recipients in the queue file), 1695that job is first forced to return all of its unused recipient slots 1696to the transport pool. 1697 1698</p> 1699 1700<p> 1701 1702The algorithm just described leads to the following state: The first 1703unread job on the job list always gets all the remaining recipient 1704slots of that transport (if there are any). The jobs queued before 1705this job are completely read (that is, all recipients of their 1706message were already read in core) and have at maximum as many slots 1707as they still have recipients in-core (the maximum is there because 1708of the sponsoring mentioned before) and the jobs after this job get 1709nothing from the transport recipient pool (unless they got something 1710before and then the first unread job was created and enqueued in 1711front of them later - in such a case, they also get at maximum as many 1712slots as they have recipients in-core). 1713 1714</p> 1715 1716<p> 1717 1718Things work fine in such a state for most of the time, because the 1719current job is either completely read in-core or has as many recipient 1720slots as there are, but there is one situation which we still have 1721to take care of specially. Imagine if the current job is preempted 1722by some unread job from the job list and there are no more recipient 1723slots available, so this new current job could read only batches 1724of qmgr_message_recipient_minimum recipients at a time. This would 1725really degrade performance. For this reason, each transport has an 1726extra pool of <i>transport</i>_extra_recipient_limit recipient 1727slots, dedicated exactly for this situation. Each time an unread 1728job preempts the current job, it gets half of the remaining recipient 1729slots from the normal pool and this extra pool. 1730 1731</p> 1732 1733<p> 1734 1735And that's it. It sure does sound pretty complicated, but fortunately 1736most people don't really have to care exactly how it works as long 1737as it works. Perhaps the only important things to know for most 1738people are the following upper bound formulas: 1739 1740</p> 1741 1742<p> 1743 1744Each transport has at maximum 1745 1746</p> 1747 1748<blockquote> 1749<pre> 1750max( 1751qmgr_message_recipient_minimum * qmgr_message_active_limit 1752+ *_recipient_limit + *_extra_recipient_limit, 1753qmgr_message_recipient_limit 1754) 1755</pre> 1756</blockquote> 1757 1758<p> 1759 1760recipients in core. 1761 1762</p> 1763 1764<p> 1765 1766The total amount of recipients in core is 1767 1768</p> 1769 1770<blockquote> 1771<pre> 1772max( 1773qmgr_message_recipient_minimum * qmgr_message_active_limit 1774+ sum( *_recipient_limit + *_extra_recipient_limit ), 1775qmgr_message_recipient_limit 1776) 1777</pre> 1778</blockquote> 1779 1780<p> 1781 1782where the sum is over all used transports. 1783 1784</p> 1785 1786<p> 1787 1788And this terribly complicated chapter concludes the documentation 1789of the <tt>nqmgr</tt> scheduler. 1790 1791</p> 1792 1793<p> 1794 1795[By now you should theoretically know the <tt>nqmgr</tt> scheduler 1796inside out. In practice, you still hope that you will never have 1797to really understand the last or last two chapters completely, and 1798fortunately most people really won't. Understanding how the scheduler 1799works in ideal conditions is more than good enough for the vast majority 1800of users.] 1801 1802</p> 1803 1804<h2> <a name="credits"> Credits </a> </h2> 1805 1806<ul> 1807 1808<li> Wietse Venema designed and implemented the initial queue manager 1809with per-domain FIFO scheduling, and per-delivery +/-1 concurrency 1810feedback. 1811 1812<li> Patrik Rak designed and implemented preemption where mail with 1813fewer recipients can slip past mail with more recipients in a 1814controlled manner, and wrote up its documentation. 1815 1816<li> Wietse Venema initiated a discussion with Patrik Rak and Victor 1817Duchovni on alternatives for the +/-1 feedback scheduler's aggressive 1818behavior. This is when K/N feedback was reviewed (N = concurrency). 1819The discussion ended without a good solution for both negative 1820feedback and dead site detection. 1821 1822<li> Victor Duchovni resumed work on concurrency feedback in the 1823context of concurrency-limited servers. 1824 1825<li> Wietse Venema then re-designed the concurrency scheduler in 1826terms of the simplest possible concepts: less-than-1 concurrency 1827feedback per delivery, forward and reverse concurrency feedback 1828hysteresis, and pseudo-cohort failure. At this same time, concurrency 1829feedback was separated from dead site detection. 1830 1831<li> These simplifications, and their modular implementation, helped 1832to develop further insights into the different roles that positive 1833and negative concurrency feedback play, and helped to identify some 1834worst-case scenarios. 1835 1836</ul> 1837 1838</body> 1839 1840</html> 1841