1<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 4<html> 5 6<head> 7 8<title>Postfix Queue Scheduler</title> 9 10<meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> 11 12</head> 13 14<body> 15 16<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix 17Queue Scheduler</h1> 18 19<hr> 20 21<h2> Disclaimer </h2> 22 23<p> Many of the <i>transport</i>-specific configuration parameters 24discussed in this document will not show up in "postconf" command 25output before Postfix version 2.9. This limitation applies to many 26parameters whose name is a combination of a master.cf service name 27such as "relay" and a built-in suffix such as 28"_destination_concurrency_limit". </p> 29 30<h2> Overview </h2> 31 32<p> The queue manager is by far the most complex part of the Postfix 33mail system. It schedules delivery of new mail, retries failed 34deliveries at specific times, and removes mail from the queue after 35the last delivery attempt. There are two major classes of mechanisms 36that control the operation of the queue manager. </p> 37 38<p> Topics covered by this document: </p> 39 40<ul> 41 42<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned 43with the number of concurrent deliveries to a specific destination, 44including decisions on when to suspend deliveries after persistent 45failures. 46 47<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with 48the selection of email messages and recipients for a given destination. 49 50<li> <a href="#credits"> Credits</a>, something this document would not be 51complete without. 52 53</ul> 54 55<!-- 56 57<p> Once started, the qmgr(8) process runs until "postfix reload" 58or "postfix stop". As a persistent process, the queue manager has 59to meet strict requirements with respect to code correctness and 60robustness. Unlike non-persistent daemon processes, the queue manager 61cannot benefit from Postfix's process rejuvenation mechanism that 62limit the impact from resource leaks and other coding errors 63(translation: replacing a process after a short time covers up bugs 64before they can become a problem). </p> 65 66--> 67 68<h2> <a name="concurrency"> Concurrency scheduling </a> </h2> 69 70<p> The following sections document the Postfix 2.5 concurrency 71scheduler, after a discussion of the limitations of the earlier 72concurrency scheduler. This is followed by results of medium-concurrency 73experiments, and a discussion of trade-offs between performance and 74robustness. </p> 75 76<p> The material is organized as follows: </p> 77 78<ul> 79 80<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing 81concurrency scheduler </a> 82 83<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5 84concurrency feedback algorithm </a> 85 86<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead 87destination" detection algorithm </a> 88 89<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5 90concurrency scheduler </a> 91 92<li> <a href="#concurrency_results"> Results for delivery to 93concurrency limited servers </a> 94 95<li> <a href="#concurrency_discussion"> Discussion of concurrency 96limited server results </a> 97 98<li> <a href="#concurrency_limitations"> Limitations of less-than-1 99per delivery feedback </a> 100 101<li> <a href="#concurrency_config"> Concurrency configuration 102parameters </a> 103 104</ul> 105 106<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing 107concurrency scheduler </a> </h3> 108 109<p> From the start, Postfix has used a simple but robust algorithm 110where the per-destination delivery concurrency is decremented by 1 111after delivery failed due to connection or handshake failure, and 112incremented by 1 otherwise. Of course the concurrency is never 113allowed to exceed the maximum per-destination concurrency limit. 114And when a destination's concurrency level drops to zero, the 115destination is declared "dead" and delivery is suspended. </p> 116 117<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p> 118 119<ul> 120 121<li> <p> Overshoot due to exponential delivery concurrency growth 122with each pseudo-cohort(*). This can be an issue with high-concurrency 123channels. For example, with the default initial concurrency of 5, 124concurrency would proceed over time as (5-10-20). </p> 125 126<li> <p> Throttling down to zero concurrency after a single 127pseudo-cohort(*) failure. This was especially an issue with 128low-concurrency channels where a single failure could be sufficient 129to mark a destination as "dead", causing the suspension of further 130deliveries to the affected destination. </p> 131 132</ul> 133 134<p> (*) A pseudo-cohort is a number of delivery requests equal to 135a destination's delivery concurrency. </p> 136 137<p> The revised concurrency scheduler has a highly modular structure. 138It uses separate mechanisms for per-destination concurrency control 139and for "dead destination" detection. The concurrency control in 140turn is built from two separate mechanisms: it supports less-than-1 141feedback per delivery to allow for more gradual concurrency 142adjustments, and it uses feedback hysteresis to suppress concurrency 143oscillations. And instead of waiting for delivery concurrency to 144throttle down to zero, a destination is declared "dead" after a 145configurable number of pseudo-cohorts reports connection or handshake 146failure. </p> 147 148<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 149concurrency feedback algorithm </a> </h3> 150 151<p> We want to increment a destination's delivery concurrency when 152some (not necessarily consecutive) number of deliveries complete 153without connection or handshake failure. This is implemented with 154positive feedback g(N) where N is the destination's delivery 155concurrency. With g(N)=1 feedback per delivery, concurrency increases 156by 1 after each positive feedback event; this gives us the old 157scheduler's exponential growth in time. With g(N)=1/N feedback per 158delivery, concurrency increases by 1 after an entire pseudo-cohort 159N of positive feedback reports; this gives us linear growth in time. 160Less-than-1 feedback per delivery and integer truncation naturally 161give us hysteresis, so that transitions to larger concurrency happen 162every 1/g(N) positive feedback events. </p> 163 164<p> We want to decrement a destination's delivery concurrency when 165some (not necessarily consecutive) number of deliveries complete 166after connection or handshake failure. This is implemented with 167negative feedback f(N) where N is the destination's delivery 168concurrency. With f(N)=1 feedback per delivery, concurrency decreases 169by 1 after each negative feedback event; this gives us the old 170scheduler's behavior where concurrency is throttled down dramatically 171after a single pseudo-cohort failure. With f(N)=1/N feedback per 172delivery, concurrency backs off more gently. Again, less-than-1 173feedback per delivery and integer truncation naturally give us 174hysteresis, so that transitions to lower concurrency happen every 1751/f(N) negative feedback events. </p> 176 177<p> However, with negative feedback we introduce a subtle twist. 178We "reverse" the negative hysteresis cycle so that the transition 179to lower concurrency happens at the <b>beginning</b> of a sequence 180of 1/f(N) negative feedback events. Otherwise, a correction for 181overload would be made too late. This makes the choice of f(N) 182relatively unimportant, as borne out by measurements later in this 183document. </p> 184 185<p> In summary, the main ingredients for the Postfix 2.5 concurrency 186feedback algorithm are a) the option of less-than-1 positive feedback 187per delivery to avoid overwhelming servers, b) the option of 188less-than-1 negative feedback per delivery to avoid giving up too 189fast, c) feedback hysteresis to avoid rapid oscillation, and d) a 190"reverse" hysteresis cycle for negative feedback, so that it can 191correct for overload quickly. </p> 192 193<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3> 194 195<p> We want to suspend deliveries to a specific destination after 196some number of deliveries suffers connection or handshake failure. 197The old scheduler declares a destination "dead" when negative (-1) 198feedback throttles the delivery concurrency down to zero. With 199less-than-1 feedback per delivery, this throttling down would 200obviously take too long. We therefore have to separate "dead 201destination" detection from concurrency feedback. This is implemented 202by introducing the concept of pseudo-cohort failure. The Postfix 2032.5 concurrency scheduler declares a destination "dead" after a 204configurable number of pseudo-cohorts suffers from connection or 205handshake failures. The old scheduler corresponds to the special 206case where the pseudo-cohort failure limit is equal to 1. </p> 207 208<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3> 209 210<p> The pseudo code shows how the ideas behind new concurrency 211scheduler are implemented as of November 2007. The actual code can 212be found in the module qmgr/qmgr_queue.c. </p> 213 214<pre> 215Types: 216 Each destination has one set of the following variables 217 int concurrency 218 double success 219 double failure 220 double fail_cohorts 221 222Feedback functions: 223 N is concurrency; x, y are arbitrary numbers in [0..1] inclusive 224 positive feedback: g(N) = x/N | x/sqrt(N) | x 225 negative feedback: f(N) = y/N | y/sqrt(N) | y 226 227Initialization: 228 concurrency = initial_concurrency 229 success = 0 230 failure = 0 231 fail_cohorts = 0 232 233After success: 234 fail_cohorts = 0 235 Be prepared for feedback > hysteresis, or rounding error 236 success += g(concurrency) 237 while (success >= 1) Hysteresis 1 238 concurrency += 1 Hysteresis 1 239 failure = 0 240 success -= 1 Hysteresis 1 241 Be prepared for overshoot 242 if (concurrency > concurrency limit) 243 concurrency = concurrency limit 244 245Safety: 246 Don't apply positive feedback unless 247 concurrency < busy_refcount + init_dest_concurrency 248 otherwise negative feedback effect could be delayed 249 250After failure: 251 if (concurrency > 0) 252 fail_cohorts += 1.0 / concurrency 253 if (fail_cohorts > cohort_failure_limit) 254 concurrency = 0 255 if (concurrency > 0) 256 Be prepared for feedback > hysteresis, rounding errors 257 failure -= f(concurrency) 258 while (failure < 0) 259 concurrency -= 1 Hysteresis 1 260 failure += 1 Hysteresis 1 261 success = 0 262 Be prepared for overshoot 263 if (concurrency < 1) 264 concurrency = 1 265</pre> 266 267<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3> 268 269<p> Discussions about the concurrency scheduler redesign started 270early 2004, when the primary goal was to find alternatives that did 271not exhibit exponential growth or rapid concurrency throttling. No 272code was implemented until late 2007, when the primary concern had 273shifted towards better handling of server concurrency limits. For 274this reason we measure how well the new scheduler does this 275job. The table below compares mail delivery performance of the old 276+/-1 feedback per delivery with several less-than-1 feedback 277functions, for different limited-concurrency server scenarios. 278Measurements were done with a FreeBSD 6.2 client and with FreeBSD 2796.2 and various Linux servers. </p> 280 281<p> Server configuration: </p> 282 283<ul> <li> The mail flow was slowed down with 1 second latency per 284recipient ("smtpd_client_restrictions = sleep 1"). The purpose was 285to make results less dependent on hardware details, by avoiding 286slow-downs by queue file I/O, logging I/O, and network I/O. 287 288<li> Concurrency was limited by the server process limit 289("default_process_limit = 5" and "smtpd_client_event_limit_exceptions 290= static:all"). Postfix was stopped and started after changing the 291process limit, because the same number is also used as the backlog 292argument to the listen(2) system call, and "postfix reload" does 293not re-issue this call. 294 295<li> Mail was discarded with "local_recipient_maps = static:all" and 296"local_transport = discard". The discard action in access maps or 297header/body checks 298could not be used as it fails to update the in_flow_delay counters. 299 300</ul> 301 302<p> Client configuration: </p> 303 304<ul> 305 306<li> Queue file overhead was minimized by sending one message to a 307virtual alias that expanded into 2000 different remote recipients. 308All recipients were accounted for according to the maillog file. 309The virtual_alias_expansion_limit setting was increased to avoid 310complaints from the cleanup(8) server. 311 312<li> The number of deliveries was maximized with 313"smtp_destination_recipient_limit = 2". A smaller limit would cause 314Postfix to schedule the concurrency per recipient instead of domain, 315which is not what we want. 316 317<li> Maximum concurrency was limited with 318"smtp_destination_concurrency_limit = 20", and 319initial_destination_concurrency was set to the same value. 320 321<li> The positive and negative concurrency feedback hysteresis was 3221. Concurrency was incremented by 1 at the END of 1/feedback steps 323of positive feedback, and was decremented by 1 at the START of 3241/feedback steps of negative feedback. 325 326<li> The SMTP client used the default 30s SMTP connect timeout and 327300s SMTP greeting timeout. 328 329</ul> 330 331<h4> Impact of the 30s SMTP connect timeout </h4> 332 333<p> The first results are for a FreeBSD 6.2 server, where our 334artificially low listen(2) backlog results in a very short kernel 335queue for established connections. The table shows that all deferred 336deliveries failed due to a 30s connection timeout, and none failed 337due to a server greeting timeout. This measurement simulates what 338happens when the server's connection queue is completely full under 339load, and the TCP engine drops new connections. </p> 340 341<blockquote> 342 343<table> 344 345<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 346style</th> <th>connection<br> caching</th> <th>percentage<br> 347deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 348<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 349 350<tr> <td align="center" colspan="9"> <hr> </td> </tr> 351 352<tr><td align="center">20</td> <td align="center">5</td> <td 353align="center">1/N</td> <td align="center">no</td> <td 354align="center">9.9</td> <td align="center">19.4</td> <td 355align="center">0.49</td> <td align="center">198</td> <td 356align="center">-</td> </tr> 357 358<tr><td align="center">20</td> <td align="center">5</td> <td 359align="center">1/N</td> <td align="center">yes</td> <td 360align="center">10.3</td> <td align="center">19.4</td> <td 361align="center">0.49</td> <td align="center">206</td> <td 362align="center">-</td> </tr> 363 364<tr><td align="center">20</td> <td align="center">5</td> <td 365align="center">1/sqrt(N)</td> <td align="center">no</td> 366<td align="center">10.4</td> <td align="center">19.6</td> <td 367align="center">0.59</td> <td align="center">208</td> <td 368align="center">-</td> </tr> 369 370<tr><td align="center">20</td> <td align="center">5</td> <td 371align="center">1/sqrt(N)</td> <td align="center">yes</td> 372<td align="center">10.6</td> <td align="center">19.6</td> <td 373align="center">0.61</td> <td align="center">212</td> <td 374align="center">-</td> </tr> 375 376<tr><td align="center">20</td> <td align="center">5</td> <td 377align="center">1</td> <td align="center">no</td> <td 378align="center">10.1</td> <td align="center">19.5</td> <td 379align="center">1.29</td> <td align="center">202</td> <td 380align="center">-</td> </tr> 381 382<tr><td align="center">20</td> <td align="center">5</td> <td 383align="center">1</td> <td align="center">yes</td> <td 384align="center">10.8</td> <td align="center">19.3</td> <td 385align="center">1.57</td> <td align="center">216</td> <td 386align="center">-</td> </tr> 387 388<tr> <td align="center" colspan="9"> <hr> </td> </tr> 389 390</table> 391 392<p> A busy server with a completely full connection queue. N is 393the client delivery concurrency. Failed deliveries time out after 39430s without completing the TCP handshake. See text for a discussion 395of results. </p> 396 397</blockquote> 398 399<h4> Impact of the 300s SMTP greeting timeout </h4> 400 401<p> The next table shows results for a Fedora Core 8 server (results 402for RedHat 7.3 are identical). In this case, the artificially small 403listen(2) backlog argument does not impact our measurement. The 404table shows that practically all deferred deliveries fail after the 405300s SMTP greeting timeout. As these timeouts were 10x longer than 406with the first measurement, we increased the recipient count (and 407thus the running time) by a factor of 10 to keep the results 408comparable. The deferred mail percentages are a factor 10 lower 409than with the first measurement, because the 1s per-recipient delay 410was 1/300th of the greeting timeout instead of 1/30th of the 411connection timeout. </p> 412 413<blockquote> 414 415<table> 416 417<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 418style</th> <th>connection<br> caching</th> <th>percentage<br> 419deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 420<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 421 422<tr> <td align="center" colspan="9"> <hr> </td> </tr> 423 424<tr> <td align="center">20</td> <td align="center">5</td> <td 425align="center">1/N</td> <td align="center">no</td> <td 426align="center">1.16</td> <td align="center">19.8</td> <td 427align="center">0.37</td> <td align="center">-</td> <td 428align="center">230</td> </tr> 429 430<tr> <td align="center">20</td> <td align="center">5</td> <td 431align="center">1/N</td> <td align="center">yes</td> <td 432align="center">1.36</td> <td align="center">19.8</td> <td 433align="center">0.36</td> <td align="center">-</td> <td 434align="center">272</td> </tr> 435 436<tr> <td align="center">20</td> <td align="center">5</td> <td 437align="center">1/sqrt(N)</td> <td align="center">no</td> 438<td align="center">1.21</td> <td align="center">19.9</td> <td 439align="center">0.23</td> <td align="center">4</td> <td 440align="center">238</td> </tr> 441 442<tr> <td align="center">20</td> <td align="center">5</td> <td 443align="center">1/sqrt(N)</td> <td align="center">yes</td> 444<td align="center">1.36</td> <td align="center">20.0</td> <td 445align="center">0.23</td> <td align="center">-</td> <td 446align="center">272</td> </tr> 447 448<tr> <td align="center">20</td> <td align="center">5</td> <td 449align="center">1</td> <td align="center">no</td> <td 450align="center">1.18</td> <td align="center">20.0</td> <td 451align="center">0.16</td> <td align="center">-</td> <td 452align="center">236</td> </tr> 453 454<tr> <td align="center">20</td> <td align="center">5</td> <td 455align="center">1</td> <td align="center">yes</td> <td 456align="center">1.39</td> <td align="center">20.0</td> <td 457align="center">0.16</td> <td align="center">-</td> <td 458align="center">278</td> </tr> 459 460<tr> <td align="center" colspan="9"> <hr> </td> </tr> 461 462</table> 463 464<p> A busy server with a non-full connection queue. N is the client 465delivery concurrency. Failed deliveries complete at the TCP level, 466but time out after 300s while waiting for the SMTP greeting. See 467text for a discussion of results. </p> 468 469</blockquote> 470 471<h4> Impact of active server concurrency limiter </h4> 472 473<p> The final concurrency-limited result shows what happens when 474SMTP connections don't time out, but are rejected immediately with 475the Postfix server's smtpd_client_connection_count_limit feature 476(the server replies with a 421 status and disconnects immediately). 477Similar results can be expected with concurrency limiting features 478built into other MTAs or firewalls. For this measurement we specified 479a server concurrency limit and a client initial destination concurrency 480of 5, and a server process limit of 10; all other conditions were 481the same as with the first measurement. The same result would be 482obtained with a FreeBSD or Linux server, because the "pushing back" 483is done entirely by the receiving side. </p> 484 485<blockquote> 486 487<table> 488 489<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 490style</th> <th>connection<br> caching</th> <th>percentage<br> 491deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 492<th>theoretical<br>defer rate</th> </tr> 493 494<tr> <td align="center" colspan="9"> <hr> </td> </tr> 495 496<tr> <td align="center">20</td> <td align="center">5</td> <td 497align="center">1/N</td> <td align="center">no</td> <td 498align="center">16.5</td> <td align="center">5.17</td> <td 499align="center">0.38</td> <td align="center">1/6</td> </tr> 500 501<tr> <td align="center">20</td> <td align="center">5</td> <td 502align="center">1/N</td> <td align="center">yes</td> <td 503align="center">16.5</td> <td align="center">5.17</td> <td 504align="center">0.38</td> <td align="center">1/6</td> </tr> 505 506<tr> <td align="center">20</td> <td align="center">5</td> <td 507align="center">1/sqrt(N)</td> <td align="center">no</td> 508<td align="center">24.5</td> <td align="center">5.28</td> <td 509align="center">0.45</td> <td align="center">1/4</td> </tr> 510 511<tr> <td align="center">20</td> <td align="center">5</td> <td 512align="center">1/sqrt(N)</td> <td align="center">yes</td> 513<td align="center">24.3</td> <td align="center">5.28</td> <td 514align="center">0.46</td> <td align="center">1/4</td> </tr> 515 516<tr> <td align="center">20</td> <td align="center">5</td> <td 517align="center">1</td> <td align="center">no</td> <td 518align="center">49.7</td> <td align="center">5.63</td> <td 519align="center">0.67</td> <td align="center">1/2</td> </tr> 520 521<tr> <td align="center">20</td> <td align="center">5</td> <td 522align="center">1</td> <td align="center">yes</td> <td 523align="center">49.7</td> <td align="center">5.68</td> <td 524align="center">0.70</td> <td align="center">1/2</td> </tr> 525 526<tr> <td align="center" colspan="9"> <hr> </td> </tr> 527 528</table> 529 530<p> A server with active per-client concurrency limiter that replies 531with 421 and disconnects. N is the client delivery concurrency. 532The theoretical defer rate is 1/(1+roundup(1/feedback)). This is 533always 1/2 with the fixed +/-1 feedback per delivery; with the 534concurrency-dependent feedback variants, the defer rate decreases 535with increasing concurrency. See text for a discussion of results. 536</p> 537 538</blockquote> 539 540<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3> 541 542<p> All results in the previous sections are based on the first 543delivery runs only; they do not include any second etc. delivery 544attempts. It's also worth noting that the measurements look at 545steady-state behavior only. They don't show what happens when the 546client starts sending at a much higher or lower concurrency. 547</p> 548 549<p> The first two examples show that the effect of feedback 550is negligible when concurrency is limited due to congestion. This 551is because the initial concurrency is already at the client's 552concurrency maximum, and because there is 10-100 times more positive 553than negative feedback. Under these conditions, it is no surprise 554that the contribution from SMTP connection caching is also negligible. 555</p> 556 557<p> In the last example, the old +/-1 feedback per delivery will 558defer 50% of the mail when confronted with an active (anvil-style) 559server concurrency limit, where the server hangs up immediately 560with a 421 status (a TCP-level RST would have the same result). 561Less aggressive feedback mechanisms fare better than more aggressive 562ones. Concurrency-dependent feedback fares even better at higher 563concurrencies than shown here, but has limitations as discussed in 564the next section. </p> 565 566<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3> 567 568<p> Less-than-1 feedback is of interest primarily when sending large 569amounts of mail to destinations with active concurrency limiters 570(servers that reply with 421, or firewalls that send RST). When 571sending small amounts of mail per destination, less-than-1 per-delivery 572feedback won't have a noticeable effect on the per-destination 573concurrency, because the number of deliveries to the same destination 574is too small. You might just as well use zero per-delivery feedback 575and stay with the initial per-destination concurrency. And when 576mail deliveries fail due to congestion instead of active concurrency 577limiters, the measurements above show that per-delivery feedback 578has no effect. With large amounts of mail you might just as well 579use zero per-delivery feedback and start with the maximal per-destination 580concurrency. </p> 581 582<p> The scheduler with less-than-1 concurrency 583feedback per delivery solves a problem with servers that have active 584concurrency limiters. This works only because feedback is handled 585in a peculiar manner: positive feedback will increment the concurrency 586by 1 at the <b>end</b> of a sequence of events of length 1/feedback, 587while negative feedback will decrement concurrency by 1 at the 588<b>beginning</b> of such a sequence. This is how Postfix adjusts 589quickly for overshoot without causing lots of mail to be deferred. 590Without this difference in feedback treatment, less-than-1 feedback 591per delivery would defer 50% of the mail, and would be no better 592in this respect than the old +/-1 feedback per delivery. </p> 593 594<p> Unfortunately, the same feature that corrects quickly for 595concurrency overshoot also makes the scheduler more sensitive for 596noisy negative feedback. The reason is that one lonely negative 597feedback event has the same effect as a complete sequence of length 5981/feedback: in both cases delivery concurrency is dropped by 1 599immediately. As a worst-case scenario, consider multiple servers 600behind a load balancer on a single IP address, and no backup MX 601address. When 1 out of K servers fails to complete the SMTP handshake 602or drops the connection, a scheduler with 1/N (N = concurrency) 603feedback stops increasing its concurrency once it reaches a concurrency 604level of about K, even though the good servers behind the load 605balancer are perfectly capable of handling more traffic. </p> 606 607<p> This noise problem gets worse as the amount of positive feedback 608per delivery gets smaller. A compromise is to use fixed less-than-1 609positive feedback values instead of concurrency-dependent positive 610feedback. For example, to tolerate 1 of 4 bad servers in the above 611load balancer scenario, use positive feedback of 1/4 per "good" 612delivery (no connect or handshake error), and use an equal or smaller 613amount of negative feedback per "bad" delivery. The downside of 614using concurrency-independent feedback is that some of the old +/-1 615feedback problems will return at large concurrencies. Sites that 616must deliver mail at non-trivial per-destination concurrencies will 617require special configuration. </p> 618 619<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3> 620 621<p> The Postfix 2.5 concurrency scheduler is controlled with the 622following configuration parameters, where "<i>transport</i>_foo" 623provides a transport-specific parameter override. All parameter 624default settings are compatible with earlier Postfix versions. </p> 625 626<blockquote> 627 628<table border="0"> 629 630<tr> <th> Parameter name </th> <th> Postfix version </th> <th> 631Description </th> </tr> 632 633<tr> <td colspan="3"> <hr> </td> </tr> 634 635<tr> <td> initial_destination_concurrency<br> 636<i>transport</i>_initial_destination_concurrency </td> <td 637align="center"> all<br> 2.5 </td> <td> Initial per-destination 638delivery concurrency </td> </tr> 639 640<tr> <td> default_destination_concurrency_limit<br> 641<i>transport</i>_destination_concurrency_limit </td> <td align="center"> 642all<br> all </td> <td> Maximum per-destination delivery concurrency 643</td> </tr> 644 645<tr> <td> default_destination_concurrency_positive_feedback<br> 646<i>transport</i>_destination_concurrency_positive_feedback </td> 647<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive 648feedback amount, per delivery that does not fail with connection 649or handshake failure </td> </tr> 650 651<tr> <td> default_destination_concurrency_negative_feedback<br> 652<i>transport</i>_destination_concurrency_negative_feedback </td> 653<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative 654feedback amount, per delivery that fails with connection or handshake 655failure </td> </tr> 656 657<tr> <td> default_destination_concurrency_failed_cohort_limit<br> 658<i>transport</i>_destination_concurrency_failed_cohort_limit </td> 659<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed 660pseudo-cohorts after which a destination is declared "dead" and 661delivery is suspended </td> </tr> 662 663<tr> <td> destination_concurrency_feedback_debug</td> <td align="center"> 6642.5 </td> <td> Enable verbose logging of concurrency scheduler 665activity </td> </tr> 666 667<tr> <td colspan="3"> <hr> </td> </tr> 668 669</table> 670 671</blockquote> 672 673<h2> <a name="jobs"> Preemptive scheduling </a> </h2> 674 675<p> 676 677The following sections describe the new queue manager and its 678preemptive scheduler algorithm. Note that the document was originally 679written to describe the changes between the new queue manager (in 680this text referred to as <tt>nqmgr</tt>, the name it was known by 681before it became the default queue manager) and the old queue manager 682(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt> 683every so often. 684 685</p> 686 687<p> 688 689This document is divided into sections as follows: 690 691</p> 692 693<ul> 694 695<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by 696nqmgr </a> 697 698<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 699up the message </a> - how it is assigned to transports, jobs, peers, 700entries 701 702<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection 703works </a> 704 705<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption 706works </a> - what messages may be preempted and how and what messages 707are chosen to preempt them 708 709<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency 710limits affect the scheduling algorithm </a> 711 712<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource 713limits </a> 714 715</ul> 716 717<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by 718nqmgr </a> </h3> 719 720<p> 721 722Let's start by recapitulating the structures and terms used when 723referring to queue manager and how it operates. Many of these are 724partially described elsewhere, but it is nice to have a coherent 725overview in one place: 726 727</p> 728 729<ul> 730 731<li> <p> Each message structure represents one mail message which 732Postfix is to deliver. The message recipients specify to what 733destinations is the message to be delivered and what transports are 734going to be used for the delivery. </p> 735 736<li> <p> Each recipient entry groups a batch of recipients of one 737message which are all going to be delivered to the same destination 738(and over the same transport). 739</p> 740 741<li> <p> Each transport structure groups everything what is going 742to be delivered by delivery agents dedicated for that transport. 743Each transport maintains a set of queues (describing the destinations 744it shall talk to) and jobs (referencing the messages it shall 745deliver). </p> 746 747<li> <p> Each transport queue (not to be confused with the on-disk 748active queue or incoming queue) groups everything what is going be 749delivered to given destination (aka nexthop) by its transport. Each 750queue belongs to one transport, so each destination may be referred 751to by several queues, one for each transport. Each queue maintains 752a list of all recipient entries (batches of message recipients) 753which shall be delivered to given destination (the todo list), and 754a list of recipient entries already being delivered by the delivery 755agents (the busy list). </p> 756 757<li> <p> Each queue corresponds to multiple peer structures. Each 758peer structure is like the queue structure, belonging to one transport 759and referencing one destination. The difference is that it lists 760only the recipient entries which all originate from the same message, 761unlike the queue structure, whose entries may originate from various 762messages. For messages with few recipients, there is usually just 763one recipient entry for each destination, resulting in one recipient 764entry per peer. But for large mailing list messages the recipients 765may need to be split to multiple recipient entries, in which case 766the peer structure may list many entries for single destination. 767</p> 768 769<li> <p> Each transport job groups everything it takes to deliver 770one message via its transport. Each job represents one message 771within the context of the transport. The job belongs to one transport 772and message, so each message may have multiple jobs, one for each 773transport. The job groups all the peer structures, which describe 774the destinations the job's message has to be delivered to. </p> 775 776</ul> 777 778<p> 779 780The first four structures are common to both <tt>nqmgr</tt> and 781<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>. 782 783</p> 784 785<p> 786 787These terms are used extensively in the text below, feel free to 788look up the description above anytime you'll feel you have lost a 789sense what is what. 790 791</p> 792 793<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 794up the message </a> </h3> 795 796<p> 797 798Whenever <tt>nqmgr</tt> moves a queue file into the active queue, 799the following happens: It reads all necessary information from the 800queue file as <tt>oqmgr</tt> does, and also reads as many recipients 801as possible - more on that later, for now let's just pretend it 802always reads all recipients. 803 804</p> 805 806<p> 807 808Then it resolves the recipients as <tt>oqmgr</tt> does, which 809means obtaining (address, nexthop, transport) triple for each 810recipient. For each triple, it finds the transport; if it does not 811exist yet, it instantiates it (unless it's dead). Within the 812transport, it finds the destination queue for given nexthop; if it 813does not exist yet, it instantiates it (unless it's dead). The 814triple is then bound to given destination queue. This happens in 815qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>. 816 817</p> 818 819<p> 820 821Then for each triple which was bound to some queue (and thus 822transport), the program finds the job which represents the message 823within that transport's context; if it does not exist yet, it 824instantiates it. Within the job, it finds the peer which represents 825the bound destination queue within this jobs context; if it does 826not exist yet, it instantiates it. Finally, it stores the address 827from the resolved triple to the recipient entry which is appended 828to both the queue entry list and the peer entry list. The addresses 829for same nexthop are batched in the entries up to recipient_concurrency 830limit for that transport. This happens in qmgr_assign() and apart 831from that it operates with job and peer structures it is basically the 832same as in <tt>oqmgr</tt>. 833 834</p> 835 836<p> 837 838When the job is instantiated, it is enqueued on the transport's job 839list based on the time its message was picked up by <tt>nqmgr</tt>. 840For first batch of recipients this means it is appended to the end 841of the job list, but the ordering of the job list by the enqueue 842time is important as we will see shortly. 843 844</p> 845 846<p> 847 848[Now you should have pretty good idea what is the state of the 849<tt>nqmgr</tt> after couple of messages was picked up, what is the 850relation between all those job, peer, queue and entry structures.] 851 852</p> 853 854<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection 855works </a> </h3> 856 857<p> 858 859Having prepared all those above mentioned structures, the task of 860the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries 861one at a time and pass them to the delivery agent for corresponding 862transport. Now how does this work? 863 864</p> 865 866<p> 867 868The first approximation of the new scheduling algorithm is like this: 869 870</p> 871 872<blockquote> 873<pre> 874foreach transport (round-robin-by-transport) 875do 876 if transport busy continue 877 if transport process limit reached continue 878 foreach transport's job (in the order of the transport's job list) 879 do 880 foreach job's peer (round-robin-by-destination) 881 if peer->queue->concurrency < peer->queue->window 882 return next peer entry. 883 done 884 done 885done 886</pre> 887</blockquote> 888 889<p> 890 891Now what is the "order of the transport's job list"? As we know 892already, the job list is by default kept in the order the message 893was picked up by the <tt>nqmgr</tt>. So by default we get the 894top-level round-robin transport, and within each transport we get 895the FIFO message delivery. The round-robin of the peers by the 896destination is perhaps of little importance in most real-life cases 897(unless the recipient_concurrency limit is reached, in one job there 898is only one peer structure for each destination), but theoretically 899it makes sure that even within single jobs, destinations are treated 900fairly. 901 902</p> 903 904<p> 905 906[By now you should have a feeling you really know how the scheduler 907works, except for the preemption, under ideal conditions - that is, 908no recipient resource limits and no destination concurrency problems.] 909 910</p> 911 912<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption 913works </a> </h3> 914 915<p> 916 917As you might perhaps expect by now, the transport's job list does 918not remain sorted by the job's message enqueue time all the time. 919The most cool thing about <tt>nqmgr</tt> is not the simple FIFO 920delivery, but that it is able to slip mail with little recipients 921past the mailing-list bulk mail. This is what the job preemption 922is about - shuffling the jobs on the transport's job list to get 923the best message delivery rates. Now how is it achieved? 924 925</p> 926 927<p> 928 929First I have to tell you that there are in fact two job lists in 930each transport. One is the scheduler's job list, which the scheduler 931is free to play with, while the other one keeps the jobs always 932listed in the order of the enqueue time and is used for recipient 933pool management we will discuss later. For now, we will deal with 934the scheduler's job list only. 935 936</p> 937 938<p> 939 940So, we have the job list, which is first ordered by the time the 941jobs' messages were enqueued, oldest messages first, the most recently 942picked one at the end. For now, let's assume that there are no 943destination concurrency problems. Without preemption, we pick some 944entry of the first (oldest) job on the queue, assign it to delivery 945agent, pick another one from the same job, assign it again, and so 946on, until all the entries are used and the job is delivered. We 947would then move onto the next job and so on and on. Now how do we 948manage to sneak in some entries from the recently added jobs when 949the first job on the job list belongs to a message going to the 950mailing-list and has thousands of recipient entries? 951 952</p> 953 954<p> 955 956The <tt>nqmgr</tt>'s answer is that we can artificially "inflate" 957the delivery time of that first job by some constant for free - it 958is basically the same trick you might remember as "accumulation of 959potential" from the amortized complexity lessons. For example, 960instead of delivering the entries of the first job on the job list 961every time a delivery agent becomes available, we can do it only 962every second time. If you view the moments the delivery agent becomes 963available on a timeline as "delivery slots", then instead of using 964every delivery slot for the first job, we can use only every other 965slot, and still the overall delivery efficiency of the first job 966remains the same. So the delivery <tt>11112222</tt> becomes 967<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, . 968denotes the free slot). Now what do we do with free slots? 969 970</p> 971 972<p> 973 974As you might have guessed, we will use them for sneaking the mail 975with little recipients in. For example, if we have one four-recipient 976mail followed by four one recipients mail, the delivery sequence 977(that is, the sequence in which the jobs are assigned to the 978delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine 979for sneaking in the single recipient mail, but how do we sneak in 980the mail with more than one recipient? Say if we have one four-recipient 981mail followed by two two-recipient mails? 982 983</p> 984 985<p> 986 987The simple answer would be to use delivery sequence <tt>12121313</tt>. 988But the problem is that this does not scale well. Imagine you have 989mail with thousand recipients followed by mail with hundred recipients. 990It is tempting to suggest the delivery sequence like <tt>121212....</tt>, 991but alas! Imagine there arrives another mail with say ten recipients. 992But there are no free slots anymore, so it can't slip by, not even 993if it had just only one recipients. It will be stuck until the 994hundred-recipient mail is delivered, which really sucks. 995 996</p> 997 998<p> 999 1000So, it becomes obvious that while inflating the message to get 1001free slots is great idea, one has to be really careful of how the 1002free slots are assigned, otherwise one might corner himself. So, 1003how does <tt>nqmgr</tt> really use the free slots? 1004 1005</p> 1006 1007<p> 1008 1009The key idea is that one does not have to generate the free slots 1010in a uniform way. The delivery sequence <tt>111...1</tt> is no 1011worse than <tt>1.1.1.1</tt>, in fact, it is even better as some 1012entries are in the first case selected earlier than in the second 1013case, and none is selected later! So it is possible to first 1014"accumulate" the free delivery slots and then use them all at once. 1015It is even possible to accumulate some, then use them, then accumulate 1016some more and use them again, as in <tt>11..1.1</tt> . 1017 1018</p> 1019 1020<p> 1021 1022Let's get back to the one hundred recipient example. We now know 1023that we could first accumulate one hundred free slots, and only 1024after then to preempt the first job and sneak the one hundred 1025recipient mail in. Applying the algorithm recursively, we see the 1026hundred recipient job can accumulate ten free delivery slots, and 1027then we could preempt it and sneak in the ten-recipient mail... 1028Wait wait wait! Could we? Aren't we overinflating the original one 1029thousand recipient mail? 1030 1031</p> 1032 1033<p> 1034 1035Well, despite it looks so at the first glance, another trick will 1036allow us to answer "no, we are not!". If we had said that we will 1037inflate the delivery time twice at maximum, and then we consider 1038every other slot as a free slot, then we would overinflate in case 1039of the recursive preemption. BUT! The trick is that if we use only 1040every n-th slot as a free slot for n>2, there is always some worst 1041inflation factor which we can guarantee not to be breached, even 1042if we apply the algorithm recursively. To be precise, if for every 1043k>1 normally used slots we accumulate one free delivery slot, than 1044the inflation factor is not worse than k/(k-1) no matter how many 1045recursive preemptions happen. And it's not worse than (k+1)/k if 1046only non-recursive preemption happens. Now, having got through the 1047theory and the related math, let's see how <tt>nqmgr</tt> implements 1048this. 1049 1050</p> 1051 1052<p> 1053 1054Each job has so called "available delivery slot" counter. Each 1055transport has a <i>transport</i>_delivery_slot_cost parameter, which 1056defaults to default_delivery_slot_cost parameter which is set to 5 1057by default. This is the k from the paragraph above. Each time k 1058entries of the job are selected for delivery, this counter is 1059incremented by one. Once there are some slots accumulated, job which 1060requires no more than that number of slots to be fully delivered 1061can preempt this job. 1062 1063</p> 1064 1065<p> 1066 1067[Well, the truth is, the counter is incremented every time an entry 1068is selected and it is divided by k when it is used. 1069But for the understanding it's good enough to use 1070the above approximation of the truth.] 1071 1072</p> 1073 1074<p> 1075 1076OK, so now we know the conditions which must be satisfied so one 1077job can preempt another one. But what job gets preempted, how do 1078we choose what job preempts it if there are several valid candidates, 1079and when does all this exactly happen? 1080 1081</p> 1082 1083<p> 1084 1085The answer for the first part is simple. The job whose entry was 1086selected the last time is so called current job. Normally, it is 1087the first job on the scheduler's job list, but destination concurrency 1088limits may change this as we will see later. It is always only the 1089current job which may get preempted. 1090 1091</p> 1092 1093<p> 1094 1095Now for the second part. The current job has certain amount of 1096recipient entries, and as such may accumulate at maximum some amount 1097of available delivery slots. It might have already accumulated some, 1098and perhaps even already used some when it was preempted before 1099(remember a job can be preempted several times). In either case, 1100we know how many are accumulated and how many are left to deliver, 1101so we know how many it may yet accumulate at maximum. Every other 1102job which may be delivered by less than that number of slots is a 1103valid candidate for preemption. How do we choose among them? 1104 1105</p> 1106 1107<p> 1108 1109The answer is - the one with maximum enqueue_time/recipient_entry_count. 1110That is, the older the job is, the more we should try to deliver 1111it in order to get best message delivery rates. These rates are of 1112course subject to how many recipients the message has, therefore 1113the division by the recipient (entry) count. No one shall be surprised 1114that message with n recipients takes n times longer to deliver than 1115message with one recipient. 1116 1117</p> 1118 1119<p> 1120 1121Now let's recap the previous two paragraphs. Isn't it too complicated? 1122Why don't the candidates come only among the jobs which can be 1123delivered within the number of slots the current job already 1124accumulated? Why do we need to estimate how much it has yet to 1125accumulate? If you found out the answer, congratulate yourself. If 1126we did it this simple way, we would always choose the candidate 1127with least recipient entries. If there were enough single recipient 1128mails coming in, they would always slip by the bulk mail as soon 1129as possible, and the two and more recipients mail would never get 1130a chance, no matter how long they have been sitting around in the 1131job list. 1132 1133</p> 1134 1135<p> 1136 1137This candidate selection has interesting implication - that when 1138we choose the best candidate for preemption (this is done in 1139qmgr_choose_candidate()), it may happen that we may not use it for 1140preemption immediately. This leads to an answer to the last part 1141of the original question - when does the preemption happen? 1142 1143</p> 1144 1145<p> 1146 1147The preemption attempt happens every time next transport's recipient 1148entry is to be chosen for delivery. To avoid needless overhead, the 1149preemption is not attempted if the current job could never accumulate 1150more than <i>transport</i>_minimum_delivery_slots (defaults to 1151default_minimum_delivery_slots which defaults to 3). If there is 1152already enough accumulated slots to preempt the current job by the 1153chosen best candidate, it is done immediately. This basically means 1154that the candidate is moved in front of the current job on the 1155scheduler's job list and decreasing the accumulated slot counter 1156by the amount used by the candidate. If there is not enough slots... 1157well, I could say that nothing happens and the another preemption 1158is attempted the next time. But that's not the complete truth. 1159 1160</p> 1161 1162<p> 1163 1164The truth is that it turns out that it is not really necessary to 1165wait until the jobs counter accumulates all the delivery slots in 1166advance. Say we have ten-recipient mail followed by two two-recipient 1167mails. If the preemption happened when enough delivery slot accumulate 1168(assuming slot cost 2), the delivery sequence becomes 1169<tt>11112211113311</tt>. Now what would we get if we would wait 1170only for 50% of the necessary slots to accumulate and we promise 1171we would wait for the remaining 50% later, after we get back 1172to the preempted job? If we use such slot loan, the delivery sequence 1173becomes <tt>11221111331111</tt>. As we can see, it makes it no 1174considerably worse for the delivery of the ten-recipient mail, but 1175it allows the small messages to be delivered sooner. 1176 1177</p> 1178 1179<p> 1180 1181The concept of these slot loans is where the 1182<i>transport</i>_delivery_slot_discount and 1183<i>transport</i>_delivery_slot_loan come from (they default to 1184default_delivery_slot_discount and default_delivery_slot_loan, whose 1185values are by default 50 and 3, respectively). The discount (resp. 1186loan) specifies how many percent (resp. how many slots) one "gets 1187in advance", when the number of slots required to deliver the best 1188candidate is compared with the number of slots the current slot had 1189accumulated so far. 1190 1191</p> 1192 1193<p> 1194 1195And it pretty much concludes this chapter. 1196 1197</p> 1198 1199<p> 1200 1201[Now you should have a feeling that you pretty much understand the 1202scheduler and the preemption, or at least that you will have it 1203after you read the last chapter couple more times. You shall clearly 1204see the job list and the preemption happening at its head, in ideal 1205delivery conditions. The feeling of understanding shall last until 1206you start wondering what happens if some of the jobs are blocked, 1207which you might eventually figure out correctly from what had been 1208said already. But I would be surprised if your mental image of the 1209scheduler's functionality is not completely shattered once you 1210start wondering how it works when not all recipients may be read 1211in-core. More on that later.] 1212 1213</p> 1214 1215<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency 1216limits affect the scheduling algorithm </a> </h3> 1217 1218<p> 1219 1220The <tt>nqmgr</tt> uses the same algorithm for destination concurrency 1221control as <tt>oqmgr</tt>. Now what happens when the destination 1222limits are reached and no more entries for that destination may be 1223selected by the scheduler? 1224 1225</p> 1226 1227<p> 1228 1229From user's point of view it is all simple. If some of the peers 1230of a job can't be selected, those peers are simply skipped by the 1231entry selection algorithm (the pseudo-code described before) and 1232only the selectable ones are used. If none of the peers may be 1233selected, the job is declared a "blocker job". Blocker jobs are 1234skipped by the entry selection algorithm and they are also excluded 1235from the candidates for preemption of current job. Thus the scheduler 1236effectively behaves as if the blocker jobs didn't exist on the job 1237list at all. As soon as at least one of the peers of a blocker job 1238becomes unblocked (that is, the delivery agent handling the delivery 1239of the recipient entry for given destination successfully finishes), 1240the job's blocker status is removed and the job again participates 1241in all further scheduler actions normally. 1242 1243</p> 1244 1245<p> 1246 1247So the summary is that the users don't really have to be concerned 1248about the interaction of the destination limits and scheduling 1249algorithm. It works well on its own and there are no knobs they 1250would need to control it. 1251 1252</p> 1253 1254<p> 1255 1256From a programmer's point of view, the blocker jobs complicate the 1257scheduler quite a lot. Without them, the jobs on the job list would 1258be normally delivered in strict FIFO order. If the current job is 1259preempted, the job preempting it is completely delivered unless it 1260is preempted itself. Without blockers, the current job is thus 1261always either the first job on the job list, or the top of the stack 1262of jobs preempting the first job on the job list. 1263 1264</p> 1265 1266<p> 1267 1268The visualization of the job list and the preemption stack without 1269blockers would be like this: 1270 1271</p> 1272 1273<blockquote> 1274<pre> 1275first job-> 1--2--3--5--6--8--... <- job list 1276on job list | 1277 4 <- preemption stack 1278 | 1279current job-> 7 1280</pre> 1281</blockquote> 1282 1283<p> 1284 1285In the example above we see that job 1 was preempted by job 4 and 1286then job 4 was preempted by job 7. After job 7 is completed, remaining 1287entries of job 4 are selected, and once they are all selected, job 12881 continues. 1289 1290</p> 1291 1292<p> 1293 1294As we see, it's all very clean and straightforward. Now how does 1295this change because of blockers? 1296 1297</p> 1298 1299<p> 1300 1301The answer is: a lot. Any job may become blocker job at any time, 1302and also become normal job again at any time. This has several 1303important implications: 1304 1305</p> 1306 1307<ol> 1308 1309<li> <p> 1310 1311The jobs may be completed in arbitrary order. For example, in the 1312example above, if the current job 7 becomes blocked, the next job 13134 may complete before the job 7 becomes unblocked again. Or if both 13147 and 4 are blocked, then 1 is completed, then 7 becomes unblocked 1315and is completed, then 2 is completed and only after that 4 becomes 1316unblocked and is completed... You get the idea. 1317 1318</p> 1319 1320<p> 1321 1322[Interesting side note: even when jobs are delivered out of order, 1323from single destination's point of view the jobs are still delivered 1324in the expected order (that is, FIFO unless there was some preemption 1325involved). This is because whenever a destination queue becomes 1326unblocked (the destination limit allows selection of more recipient 1327entries for that destination), all jobs which have peers for that 1328destination are unblocked at once.] 1329 1330</p> 1331 1332<li> <p> 1333 1334The idea of the preemption stack at the head of the job list is 1335gone. That is, it must be possible to preempt any job on the job 1336list. For example, if the jobs 7, 4, 1 and 2 in the example above 1337become all blocked, job 3 becomes the current job. And of course 1338we do not want the preemption to be affected by the fact that there 1339are some blocked jobs or not. Therefore, if it turns out that job 13403 might be preempted by job 6, the implementation shall make it 1341possible. 1342 1343</p> 1344 1345<li> <p> 1346 1347The idea of the linear preemption stack itself is gone. It's no 1348longer true that one job is always preempted by only one job at one 1349time (that is directly preempted, not counting the recursively 1350nested jobs). For example, in the example above, job 1 is directly 1351preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes 1352blocked, and job 4 is being delivered. If it accumulates enough 1353delivery slots, it is natural that it might be preempted for example 1354by job 8. Now job 4 is preempted by both job 7 AND job 8 at the 1355same time. 1356 1357</p> 1358 1359</ol> 1360 1361<p> 1362 1363Now combine the points 2) and 3) with point 1) again and you realize 1364that the relations on the once linear job list became pretty 1365complicated. If we extend the point 3) example: jobs 7 and 8 preempt 1366job 4, now job 8 becomes blocked too, then job 4 completes. Tricky, 1367huh? 1368 1369</p> 1370 1371<p> 1372 1373If I illustrate the relations after the above mentioned examples 1374(but those in point 1)), the situation would look like this: 1375 1376</p> 1377 1378<blockquote> 1379<pre> 1380 v- parent 1381 1382adoptive parent -> 1--2--3--5--... <- "stack" level 0 1383 | | 1384parent gone -> ? 6 <- "stack" level 1 1385 / \ 1386children -> 7 8 ^- child <- "stack" level 2 1387 1388 ^- siblings 1389</pre> 1390</blockquote> 1391 1392<p> 1393 1394Now how does <tt>nqmgr</tt> deal with all these complicated relations? 1395 1396</p> 1397 1398<p> 1399 1400Well, it maintains them all as described, but fortunately, all these 1401relations are necessary only for purposes of proper counting of 1402available delivery slots. For purposes of ordering the jobs for 1403entry selection, the original rule still applies: "the job preempting 1404the current job is moved in front of the current job on the job 1405list". So for entry selection purposes, the job relations remain 1406as simple as this: 1407 1408</p> 1409 1410<blockquote> 1411<pre> 14127--8--1--2--6--3--5--.. <- scheduler's job list order 1413</pre> 1414</blockquote> 1415 1416<p> 1417 1418The job list order and the preemption parent/child/siblings relations 1419are maintained separately. And because the selection works only 1420with the job list, you can happily forget about those complicated 1421relations unless you want to study the <tt>nqmgr</tt> sources. In 1422that case the text above might provide some helpful introduction 1423to the problem domain. Otherwise I suggest you just forget about 1424all this and stick with the user's point of view: the blocker jobs 1425are simply ignored. 1426 1427</p> 1428 1429<p> 1430 1431[By now, you should have a feeling that there is more things going 1432under the hood than you ever wanted to know. You decide that 1433forgetting about this chapter is the best you can do for the sake 1434of your mind's health and you basically stick with the idea how the 1435scheduler works in ideal conditions, when there are no blockers, 1436which is good enough.] 1437 1438</p> 1439 1440<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource 1441limits </a> </h3> 1442 1443<p> 1444 1445When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed 1446that all recipients of all messages in the active queue are completely 1447read into the memory. This is simply not true. There is an upper 1448bound on the amount of memory the <tt>nqmgr</tt> may use, and 1449therefore it must impose some limits on the information it may store 1450in the memory at any given time. 1451 1452</p> 1453 1454<p> 1455 1456First of all, not all messages may be read in-core at once. At any 1457time, only qmgr_message_active_limit messages may be read in-core 1458at maximum. When read into memory, the messages are picked from the 1459incoming and deferred message queues and moved to the active queue 1460(incoming having priority), so if there is more than 1461qmgr_message_active_limit messages queued in the active queue, the 1462rest will have to wait until (some of) the messages in the active 1463queue are completely delivered (or deferred). 1464 1465</p> 1466 1467<p> 1468 1469Even with the limited amount of in-core messages, there is another 1470limit which must be imposed in order to avoid memory exhaustion. 1471Each message may contain huge amount of recipients (tens or hundreds 1472of thousands are not uncommon), so if <tt>nqmgr</tt> read all 1473recipients of all messages in the active queue, it may easily run 1474out of memory. Therefore there must be some upper bound on the 1475amount of message recipients which are read into the memory at the 1476same time. 1477 1478</p> 1479 1480<p> 1481 1482Before discussing how exactly <tt>nqmgr</tt> implements the recipient 1483limits, let's see how the sole existence of the limits themselves 1484affects the <tt>nqmgr</tt> and its scheduler. 1485 1486</p> 1487 1488<p> 1489 1490The message limit is straightforward - it just limits the size of 1491the 1492lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which 1493message can preempt the current one. Messages not in the active 1494queue simply are not considered at all. 1495 1496</p> 1497 1498<p> 1499 1500The recipient limit complicates more things. First of all, the 1501message reading code must support reading the recipients in batches, 1502which among other things means accessing the queue file several 1503times and continuing where the last recipient batch ended. This is 1504invoked by the scheduler whenever the current job has space for more 1505recipients, subject to transport's refill_limit and refill_delay parameters. 1506It is also done any time when all 1507in-core recipients of the message are dealt with (which may also 1508mean they were deferred) but there are still more in the queue file. 1509 1510</p> 1511 1512<p> 1513 1514The second complication is that with some recipients left unread 1515in the queue file, the scheduler can't operate with exact counts 1516of recipient entries. With unread recipients, it is not clear how 1517many recipient entries there will be, as they are subject to 1518per-destination grouping. It is not even clear to what transports 1519(and thus jobs) the recipients will be assigned. And with messages 1520coming from the deferred queue, it is not even clear how many unread 1521recipients are still to be delivered. This all means that the 1522scheduler must use only estimates of how many recipients entries 1523there will be. Fortunately, it is possible to estimate the minimum 1524and maximum correctly, so the scheduler can always err on the safe 1525side. Obviously, the better the estimates, the better results, so 1526it is best when we are able to read all recipients in-core and turn 1527the estimates into exact counts, or at least try to read as many 1528as possible to make the estimates as accurate as possible. 1529 1530</p> 1531 1532<p> 1533 1534The third complication is that it is no longer true that the scheduler 1535is done with a job once all of its in-core recipients are delivered. 1536It is possible that the job will be revived later, when another 1537batch of recipients is read in core. It is also possible that some 1538jobs will be created for the first time long after the first batch 1539of recipients was read in core. The <tt>nqmgr</tt> code must be 1540ready to handle all such situations. 1541 1542</p> 1543 1544<p> 1545 1546And finally, the fourth complication is that the <tt>nqmgr</tt> 1547code must somehow impose the recipient limit itself. Now how does 1548it achieve it? 1549 1550</p> 1551 1552<p> 1553 1554Perhaps the easiest solution would be to say that each message may 1555have at maximum X recipients stored in-core, but such solution would 1556be poor for several reasons. With reasonable qmgr_message_active_limit 1557values, the X would have to be quite low to maintain reasonable 1558memory footprint. And with low X lots of things would not work well. 1559The <tt>nqmgr</tt> would have problems to use the 1560<i>transport</i>_destination_recipient_limit efficiently. The 1561scheduler's preemption would be suboptimal as the recipient count 1562estimates would be inaccurate. The message queue file would have 1563to be accessed many times to read in more recipients again and 1564again. 1565 1566</p> 1567 1568<p> 1569 1570Therefore it seems reasonable to have a solution which does not use 1571a limit imposed on per-message basis, but which maintains a pool 1572of available recipient slots, which can be shared among all messages 1573in the most efficient manner. And as we do not want separate 1574transports to compete for resources whenever possible, it seems 1575appropriate to maintain such recipient pool for each transport 1576separately. This is the general idea, now how does it work in 1577practice? 1578 1579</p> 1580 1581<p> 1582 1583First we have to solve little chicken-and-egg problem. If we want 1584to use the per-transport recipient pools, we first need to know to 1585what transport(s) is the message assigned. But we will find that 1586out only after we read in the recipients first. So it is obvious 1587that we first have to read in some recipients, use them to find out 1588to what transports is the message to be assigned, and only after 1589that we can use the per-transport recipient pools. 1590 1591</p> 1592 1593<p> 1594 1595Now how many recipients shall we read for the first time? This is 1596what qmgr_message_recipient_minimum and qmgr_message_recipient_limit 1597values control. The qmgr_message_recipient_minimum value specifies 1598how many recipients of each message we will read for the first time, 1599no matter what. It is necessary to read at least one recipient 1600before we can assign the message to a transport and create the first 1601job. However, reading only qmgr_message_recipient_minimum recipients 1602even if there are only few messages with few recipients in-core would 1603be wasteful. Therefore if there is less than qmgr_message_recipient_limit 1604recipients in-core so far, the first batch of recipients may be 1605larger than qmgr_message_recipient_minimum - as large as is required 1606to reach the qmgr_message_recipient_limit limit. 1607 1608</p> 1609 1610<p> 1611 1612Once the first batch of recipients was read in core and the message 1613jobs were created, the size of the subsequent recipient batches (if 1614any - of course it's best when all recipients are read in one batch) 1615is based solely on the position of the message jobs on their 1616corresponding transports' job lists. Each transport has a pool of 1617<i>transport</i>_recipient_limit recipient slots which it can 1618distribute among its jobs (how this is done is described later). 1619The subsequent recipient batch may be as large as the sum of all 1620recipient slots of all jobs of the message permits (plus the 1621qmgr_message_recipient_minimum amount which always applies). 1622 1623</p> 1624 1625<p> 1626 1627For example, if a message has three jobs, first with 1 recipient 1628still in-core and 4 recipient slots, second with 5 recipient in-core 1629and 5 recipient slots, and third with 2 recipients in-core and 0 1630recipient slots, it has 1+5+2=7 recipients in-core and 4+5+0=9 jobs' 1631recipients slots in total. This means that we could immediately 1632read 2+qmgr_message_recipient_minimum more recipients of that message 1633in core. 1634 1635</p> 1636 1637<p> 1638 1639The above example illustrates several things which might be worth 1640mentioning explicitly: first, note that although the per-transport 1641slots are assigned to particular jobs, we can't guarantee that once 1642the next batch of recipients is read in core, that the corresponding 1643amounts of recipients will be assigned to those jobs. The jobs lend 1644its slots to the message as a whole, so it is possible that some 1645jobs end up sponsoring other jobs of their message. For example, 1646if in the example above the 2 newly read recipients were assigned 1647to the second job, the first job sponsored the second job with 2 1648slots. The second notable thing is the third job, which has more 1649recipients in-core than it has slots. Apart from sponsoring by other 1650job we just saw it can be result of the first recipient batch, which 1651is sponsored from global recipient pool of qmgr_message_recipient_limit 1652recipients. It can be also sponsored from the message recipient 1653pool of qmgr_message_recipient_minimum recipients. 1654 1655</p> 1656 1657<p> 1658 1659Now how does each transport distribute the recipient slots among 1660its jobs? The strategy is quite simple. As most scheduler activity 1661happens on the head of the job list, it is our intention to make 1662sure that the scheduler has the best estimates of the recipient 1663counts for those jobs. As we mentioned above, this means that we 1664want to try to make sure that the messages of those jobs have all 1665recipients read in-core. Therefore the transport distributes the 1666slots "along" the job list from start to end. In this case the job 1667list sorted by message enqueue time is used, because it doesn't 1668change over time as the scheduler's job list does. 1669 1670</p> 1671 1672<p> 1673 1674More specifically, each time a job is created and appended to the 1675job list, it gets all unused recipient slots from its transport's 1676pool. It keeps them until all recipients of its message are read. 1677When this happens, all unused recipient slots are transferred to 1678the next job (which is now in fact now first such job) on the job 1679list which still has some recipients unread, or eventually back to 1680the transport pool if there is no such job. Such transfer then also 1681happens whenever a recipient entry of that job is delivered. 1682 1683</p> 1684 1685<p> 1686 1687There is also a scenario when a job is not appended to the end of 1688the job list (for example it was created as a result of second or 1689later recipient batch). Then it works exactly as above, except that 1690if it was put in front of the first unread job (that is, the job 1691of a message which still has some unread recipients in queue file), 1692that job is first forced to return all of its unused recipient slots 1693to the transport pool. 1694 1695</p> 1696 1697<p> 1698 1699The algorithm just described leads to the following state: The first 1700unread job on the job list always gets all the remaining recipient 1701slots of that transport (if there are any). The jobs queued before 1702this job are completely read (that is, all recipients of their 1703message were already read in core) and have at maximum as many slots 1704as they still have recipients in-core (the maximum is there because 1705of the sponsoring mentioned before) and the jobs after this job get 1706nothing from the transport recipient pool (unless they got something 1707before and then the first unread job was created and enqueued in 1708front of them later - in such case the also get at maximum as many 1709slots as they have recipients in-core). 1710 1711</p> 1712 1713<p> 1714 1715Things work fine in such state for most of the time, because the 1716current job is either completely read in-core or has as much recipient 1717slots as there are, but there is one situation which we still have 1718to take care of specially. Imagine if the current job is preempted 1719by some unread job from the job list and there are no more recipient 1720slots available, so this new current job could read only batches 1721of qmgr_message_recipient_minimum recipients at a time. This would 1722really degrade performance. For this reason, each transport has 1723extra pool of <i>transport</i>_extra_recipient_limit recipient 1724slots, dedicated exactly for this situation. Each time an unread 1725job preempts the current job, it gets half of the remaining recipient 1726slots from the normal pool and this extra pool. 1727 1728</p> 1729 1730<p> 1731 1732And that's it. It sure does sound pretty complicated, but fortunately 1733most people don't really have to care how exactly it works as long 1734as it works. Perhaps the only important things to know for most 1735people are the following upper bound formulas: 1736 1737</p> 1738 1739<p> 1740 1741Each transport has at maximum 1742 1743</p> 1744 1745<blockquote> 1746<pre> 1747max( 1748qmgr_message_recipient_minimum * qmgr_message_active_limit 1749+ *_recipient_limit + *_extra_recipient_limit, 1750qmgr_message_recipient_limit 1751) 1752</pre> 1753</blockquote> 1754 1755<p> 1756 1757recipients in core. 1758 1759</p> 1760 1761<p> 1762 1763The total amount of recipients in core is 1764 1765</p> 1766 1767<blockquote> 1768<pre> 1769max( 1770qmgr_message_recipient_minimum * qmgr_message_active_limit 1771+ sum( *_recipient_limit + *_extra_recipient_limit ), 1772qmgr_message_recipient_limit 1773) 1774</pre> 1775</blockquote> 1776 1777<p> 1778 1779where the sum is over all used transports. 1780 1781</p> 1782 1783<p> 1784 1785And this terribly complicated chapter concludes the documentation 1786of <tt>nqmgr</tt> scheduler. 1787 1788</p> 1789 1790<p> 1791 1792[By now you should theoretically know the <tt>nqmgr</tt> scheduler 1793inside out. In practice, you still hope that you will never have 1794to really understand the last or last two chapters completely, and 1795fortunately most people really won't. Understanding how the scheduler 1796works in ideal conditions is more than good enough for vast majority 1797of users.] 1798 1799</p> 1800 1801<h2> <a name="credits"> Credits </a> </h2> 1802 1803<ul> 1804 1805<li> Wietse Venema designed and implemented the initial queue manager 1806with per-domain FIFO scheduling, and per-delivery +/-1 concurrency 1807feedback. 1808 1809<li> Patrik Rak designed and implemented preemption where mail with 1810fewer recipients can slip past mail with more recipients in a 1811controlled manner, and wrote up its documentation. 1812 1813<li> Wietse Venema initiated a discussion with Patrik Rak and Victor 1814Duchovni on alternatives for the +/-1 feedback scheduler's aggressive 1815behavior. This is when K/N feedback was reviewed (N = concurrency). 1816The discussion ended without a good solution for both negative 1817feedback and dead site detection. 1818 1819<li> Victor Duchovni resumed work on concurrency feedback in the 1820context of concurrency-limited servers. 1821 1822<li> Wietse Venema then re-designed the concurrency scheduler in 1823terms of the simplest possible concepts: less-than-1 concurrency 1824feedback per delivery, forward and reverse concurrency feedback 1825hysteresis, and pseudo-cohort failure. At this same time, concurrency 1826feedback was separated from dead site detection. 1827 1828<li> These simplifications, and their modular implementation, helped 1829to develop further insights into the different roles that positive 1830and negative concurrency feedback play, and helped to identify some 1831worst-case scenarios. 1832 1833</ul> 1834 1835</body> 1836 1837</html> 1838