xref: /netbsd-src/external/ibm-public/postfix/dist/proto/SCHEDULER_README.html (revision 9616dacfef448e70e3fbbd865bddf60d54b656c5)
1<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
2	"http://www.w3.org/TR/html4/loose.dtd">
3
4<html>
5
6<head>
7
8<title>Postfix Queue Scheduler</title>
9
10<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
11
12</head>
13
14<body>
15
16<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
17Queue Scheduler</h1>
18
19<hr>
20
21<h2> Disclaimer </h2>
22
23<p> Many of the <i>transport</i>-specific configuration parameters
24discussed in this document will not show up in "postconf" command
25output before Postfix version 2.9. This limitation applies to many
26parameters whose name is a combination of a master.cf service name
27such as "relay" and a built-in suffix such as
28"_destination_concurrency_limit". </p>
29
30<h2> Overview </h2>
31
32<p> The queue manager is by far the most complex part of the Postfix
33mail system. It schedules delivery of new mail, retries failed
34deliveries at specific times, and removes mail from the queue after
35the last delivery attempt.  There are two major classes of mechanisms
36that control the operation of the queue manager. </p>
37
38<p> Topics covered by this document: </p>
39
40<ul>
41
42<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned
43with the number of concurrent deliveries to a specific destination,
44including decisions on when to suspend deliveries after persistent
45failures.
46
47<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with
48the selection of email messages and recipients for a given destination.
49
50<li> <a href="#credits"> Credits</a>, something this document would not be
51complete without.
52
53</ul>
54
55<!--
56
57<p> Once started, the qmgr(8) process runs until "postfix reload"
58or "postfix stop".  As a persistent process, the queue manager has
59to meet strict requirements with respect to code correctness and
60robustness. Unlike non-persistent daemon processes, the queue manager
61cannot benefit from Postfix's process rejuvenation mechanism that
62limit the impact from resource leaks and other coding errors
63(translation: replacing a process after a short time covers up bugs
64before they can become a problem).  </p>
65
66-->
67
68<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
69
70<p> The following sections document the Postfix 2.5 concurrency
71scheduler, after a discussion of the limitations of the earlier
72concurrency scheduler. This is followed by results of medium-concurrency
73experiments, and a discussion of trade-offs between performance and
74robustness.  </p>
75
76<p> The material is organized as follows: </p>
77
78<ul>
79
80<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
81concurrency scheduler </a>
82
83<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
84concurrency feedback algorithm </a>
85
86<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
87destination" detection algorithm </a>
88
89<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
90concurrency scheduler </a>
91
92<li> <a href="#concurrency_results"> Results for delivery to
93concurrency limited servers </a>
94
95<li> <a href="#concurrency_discussion"> Discussion of concurrency
96limited server results </a>
97
98<li> <a href="#concurrency_limitations"> Limitations of less-than-1
99per delivery feedback </a>
100
101<li> <a href="#concurrency_config"> Concurrency configuration
102parameters </a>
103
104</ul>
105
106<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
107concurrency scheduler </a> </h3>
108
109<p> From the start, Postfix has used a simple but robust algorithm
110where the per-destination delivery concurrency is decremented by 1
111after delivery failed due to connection or handshake failure, and
112incremented by 1 otherwise.  Of course the concurrency is never
113allowed to exceed the maximum per-destination concurrency limit.
114And when a destination's concurrency level drops to zero, the
115destination is declared "dead" and delivery is suspended.  </p>
116
117<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
118
119<ul>
120
121<li> <p> Overshoot due to exponential delivery concurrency growth
122with each pseudo-cohort(*). This can be an issue with high-concurrency
123channels. For example, with the default initial concurrency of 5,
124concurrency would proceed over time as (5-10-20).  </p>
125
126<li> <p> Throttling down to zero concurrency after a single
127pseudo-cohort(*) failure. This was especially an issue with
128low-concurrency channels where a single failure could be sufficient
129to mark a destination as "dead", causing the suspension of further
130deliveries to the affected destination. </p>
131
132</ul>
133
134<p> (*) A pseudo-cohort is a number of delivery requests equal to
135a destination's delivery concurrency. </p>
136
137<p> The revised concurrency scheduler has a highly modular structure.
138It uses separate mechanisms for per-destination concurrency control
139and for "dead destination" detection.  The concurrency control in
140turn is built from two separate mechanisms: it supports less-than-1
141feedback per delivery to allow for more gradual concurrency
142adjustments, and it uses feedback hysteresis to suppress concurrency
143oscillations.  And instead of waiting for delivery concurrency to
144throttle down to zero, a destination is declared "dead" after a
145configurable number of pseudo-cohorts reports connection or handshake
146failure.  </p>
147
148<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
149concurrency feedback algorithm </a> </h3>
150
151<p> We want to increment a destination's delivery concurrency when
152some (not necessarily consecutive) number of deliveries complete
153without connection or handshake failure.  This is implemented with
154positive feedback g(N) where N is the destination's delivery
155concurrency.  With g(N)=1 feedback per delivery, concurrency increases
156by 1 after each positive feedback event; this gives us the old
157scheduler's exponential growth in time. With g(N)=1/N feedback per
158delivery, concurrency increases by 1 after an entire pseudo-cohort
159N of positive feedback reports; this gives us linear growth in time.
160Less-than-1 feedback per delivery and integer truncation naturally
161give us hysteresis, so that transitions to larger concurrency happen
162every 1/g(N) positive feedback events.  </p>
163
164<p> We want to decrement a destination's delivery concurrency when
165some (not necessarily consecutive) number of deliveries complete
166after connection or handshake failure.  This is implemented with
167negative feedback f(N) where N is the destination's delivery
168concurrency.  With f(N)=1 feedback per delivery, concurrency decreases
169by 1 after each negative feedback event; this gives us the old
170scheduler's behavior where concurrency is throttled down dramatically
171after a single pseudo-cohort failure.  With f(N)=1/N feedback per
172delivery, concurrency backs off more gently.  Again, less-than-1
173feedback per delivery and integer truncation naturally give us
174hysteresis, so that transitions to lower concurrency happen every
1751/f(N) negative feedback events.  </p>
176
177<p> However, with negative feedback we introduce a subtle twist.
178We "reverse" the negative hysteresis cycle so that the transition
179to lower concurrency happens at the <b>beginning</b> of a sequence
180of 1/f(N) negative feedback events.  Otherwise, a correction for
181overload would be made too late.  This makes the choice of f(N)
182relatively unimportant, as borne out by measurements later in this
183document.  </p>
184
185<p> In summary, the main ingredients for the Postfix 2.5 concurrency
186feedback algorithm are a) the option of less-than-1 positive feedback
187per delivery to avoid overwhelming servers, b) the option of
188less-than-1 negative feedback per delivery to avoid giving up too
189fast, c) feedback hysteresis to avoid rapid oscillation, and d) a
190"reverse" hysteresis cycle for negative feedback, so that it can
191correct for overload quickly.  </p>
192
193<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3>
194
195<p> We want to suspend deliveries to a specific destination after
196some number of deliveries suffers connection or handshake failure.
197The old scheduler declares a destination "dead" when negative (-1)
198feedback throttles the delivery concurrency down to zero. With
199less-than-1 feedback per delivery, this throttling down would
200obviously take too long.  We therefore have to separate "dead
201destination" detection from concurrency feedback.  This is implemented
202by introducing the concept of pseudo-cohort failure. The Postfix
2032.5 concurrency scheduler declares a destination "dead" after a
204configurable number of pseudo-cohorts suffers from connection or
205handshake failures. The old scheduler corresponds to the special
206case where the pseudo-cohort failure limit is equal to 1.  </p>
207
208<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3>
209
210<p> The pseudo code shows how the ideas behind new concurrency
211scheduler are implemented as of November 2007.  The actual code can
212be found in the module qmgr/qmgr_queue.c.  </p>
213
214<pre>
215Types:
216        Each destination has one set of the following variables
217        int concurrency
218        double success
219        double failure
220        double fail_cohorts
221
222Feedback functions:
223        N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
224        positive feedback: g(N) = x/N | x/sqrt(N) | x
225        negative feedback: f(N) = y/N | y/sqrt(N) | y
226
227Initialization:
228        concurrency = initial_concurrency
229        success = 0
230        failure = 0
231        fail_cohorts = 0
232
233After success:
234        fail_cohorts = 0
235        Be prepared for feedback &gt; hysteresis, or rounding error
236        success += g(concurrency)
237        while (success >= 1)            Hysteresis 1
238            concurrency += 1            Hysteresis 1
239            failure = 0
240            success -= 1                Hysteresis 1
241        Be prepared for overshoot
242        if (concurrency &gt; concurrency limit)
243            concurrency = concurrency limit
244
245Safety:
246        Don't apply positive feedback unless
247            concurrency &lt; busy_refcount + init_dest_concurrency
248        otherwise negative feedback effect could be delayed
249
250After failure:
251        if (concurrency &gt; 0)
252            fail_cohorts += 1.0 / concurrency
253            if (fail_cohorts &gt; cohort_failure_limit)
254                concurrency = 0
255        if (concurrency &gt; 0)
256            Be prepared for feedback &gt; hysteresis, rounding errors
257            failure -= f(concurrency)
258            while (failure &lt; 0)
259                concurrency -= 1        Hysteresis 1
260                failure += 1            Hysteresis 1
261                success = 0
262            Be prepared for overshoot
263            if (concurrency &lt; 1)
264                concurrency = 1
265</pre>
266
267<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3>
268
269<p> Discussions about the concurrency scheduler redesign started
270early 2004, when the primary goal was to find alternatives that did
271not exhibit exponential growth or rapid concurrency throttling.  No
272code was implemented until late 2007, when the primary concern had
273shifted towards better handling of server concurrency limits. For
274this reason we measure how well the new scheduler does this
275job.  The table below compares mail delivery performance of the old
276+/-1 feedback per delivery with several less-than-1 feedback
277functions, for different limited-concurrency server scenarios.
278Measurements were done with a FreeBSD 6.2 client and with FreeBSD
2796.2 and various Linux servers.  </p>
280
281<p> Server configuration: </p>
282
283<ul> <li> The mail flow was slowed down with 1 second latency per
284recipient ("smtpd_client_restrictions = sleep 1"). The purpose was
285to make results less dependent on hardware details, by avoiding
286slow-downs by queue file I/O, logging I/O, and network I/O.
287
288<li> Concurrency was limited by the server process limit
289("default_process_limit = 5" and "smtpd_client_event_limit_exceptions
290= static:all"). Postfix was stopped and started after changing the
291process limit, because the same number is also used as the backlog
292argument to the listen(2) system call, and "postfix reload" does
293not re-issue this call.
294
295<li> Mail was discarded with "local_recipient_maps = static:all" and
296"local_transport = discard". The discard action in access maps or
297header/body checks
298could not be used as it fails to update the in_flow_delay counters.
299
300</ul>
301
302<p> Client configuration: </p>
303
304<ul>
305
306<li> Queue file overhead was minimized by sending one message to a
307virtual alias that expanded into 2000 different remote recipients.
308All recipients were accounted for according to the maillog file.
309The virtual_alias_expansion_limit setting was increased to avoid
310complaints from the cleanup(8) server.
311
312<li> The number of deliveries was maximized with
313"smtp_destination_recipient_limit = 2". A smaller limit would cause
314Postfix to schedule the concurrency per recipient instead of domain,
315which is not what we want.
316
317<li> Maximum concurrency was limited with
318"smtp_destination_concurrency_limit = 20", and
319initial_destination_concurrency was set to the same value.
320
321<li> The positive and negative concurrency feedback hysteresis was
3221.  Concurrency was incremented by 1 at the END of 1/feedback steps
323of positive feedback, and was decremented by 1 at the START of
3241/feedback steps of negative feedback.
325
326<li> The SMTP client used the default 30s SMTP connect timeout and
327300s SMTP greeting timeout.
328
329</ul>
330
331<h4> Impact of the 30s SMTP connect timeout </h4>
332
333<p> The first results are for a FreeBSD 6.2 server, where our
334artificially low listen(2) backlog results in a very short kernel
335queue for established connections. The table shows that all deferred
336deliveries failed due to a 30s connection timeout, and none failed
337due to a server greeting timeout.  This measurement simulates what
338happens when the server's connection queue is completely full under
339load, and the TCP engine drops new connections.  </p>
340
341<blockquote>
342
343<table>
344
345<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
346style</th> <th>connection<br> caching</th> <th>percentage<br>
347deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
348<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
349
350<tr> <td align="center" colspan="9"> <hr> </td> </tr>
351
352<tr><td align="center">20</td> <td align="center">5</td> <td
353align="center">1/N</td> <td align="center">no</td> <td
354align="center">9.9</td> <td align="center">19.4</td> <td
355align="center">0.49</td> <td align="center">198</td> <td
356align="center">-</td> </tr>
357
358<tr><td align="center">20</td> <td align="center">5</td> <td
359align="center">1/N</td> <td align="center">yes</td> <td
360align="center">10.3</td> <td align="center">19.4</td> <td
361align="center">0.49</td> <td align="center">206</td> <td
362align="center">-</td> </tr>
363
364<tr><td align="center">20</td> <td align="center">5</td> <td
365align="center">1/sqrt(N)</td> <td align="center">no</td>
366<td align="center">10.4</td> <td align="center">19.6</td> <td
367align="center">0.59</td> <td align="center">208</td> <td
368align="center">-</td> </tr>
369
370<tr><td align="center">20</td> <td align="center">5</td> <td
371align="center">1/sqrt(N)</td> <td align="center">yes</td>
372<td align="center">10.6</td> <td align="center">19.6</td> <td
373align="center">0.61</td> <td align="center">212</td> <td
374align="center">-</td> </tr>
375
376<tr><td align="center">20</td> <td align="center">5</td> <td
377align="center">1</td> <td align="center">no</td> <td
378align="center">10.1</td> <td align="center">19.5</td> <td
379align="center">1.29</td> <td align="center">202</td> <td
380align="center">-</td> </tr>
381
382<tr><td align="center">20</td> <td align="center">5</td> <td
383align="center">1</td> <td align="center">yes</td> <td
384align="center">10.8</td> <td align="center">19.3</td> <td
385align="center">1.57</td> <td align="center">216</td> <td
386align="center">-</td> </tr>
387
388<tr> <td align="center" colspan="9"> <hr> </td> </tr>
389
390</table>
391
392<p> A busy server with a completely full connection queue.  N is
393the client delivery concurrency.  Failed deliveries time out after
39430s without completing the TCP handshake. See text for a discussion
395of results. </p>
396
397</blockquote>
398
399<h4> Impact of the 300s SMTP greeting timeout </h4>
400
401<p> The next table shows results for a Fedora Core 8 server (results
402for RedHat 7.3 are identical). In this case, the artificially small
403listen(2) backlog argument does not impact our measurement.  The
404table shows that practically all deferred deliveries fail after the
405300s SMTP greeting timeout. As these timeouts were 10x longer than
406with the first measurement, we increased the recipient count (and
407thus the running time) by a factor of 10 to keep the results
408comparable. The deferred mail percentages are a factor 10 lower
409than with the first measurement, because the 1s per-recipient delay
410was 1/300th of the greeting timeout instead of 1/30th of the
411connection timeout.  </p>
412
413<blockquote>
414
415<table>
416
417<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
418style</th> <th>connection<br> caching</th> <th>percentage<br>
419deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
420<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
421
422<tr> <td align="center" colspan="9"> <hr> </td> </tr>
423
424<tr> <td align="center">20</td> <td align="center">5</td> <td
425align="center">1/N</td> <td align="center">no</td> <td
426align="center">1.16</td> <td align="center">19.8</td> <td
427align="center">0.37</td> <td align="center">-</td> <td
428align="center">230</td> </tr>
429
430<tr> <td align="center">20</td> <td align="center">5</td> <td
431align="center">1/N</td> <td align="center">yes</td> <td
432align="center">1.36</td> <td align="center">19.8</td> <td
433align="center">0.36</td> <td align="center">-</td> <td
434align="center">272</td> </tr>
435
436<tr> <td align="center">20</td> <td align="center">5</td> <td
437align="center">1/sqrt(N)</td> <td align="center">no</td>
438<td align="center">1.21</td> <td align="center">19.9</td> <td
439align="center">0.23</td> <td align="center">4</td> <td
440align="center">238</td> </tr>
441
442<tr> <td align="center">20</td> <td align="center">5</td> <td
443align="center">1/sqrt(N)</td> <td align="center">yes</td>
444<td align="center">1.36</td> <td align="center">20.0</td> <td
445align="center">0.23</td> <td align="center">-</td> <td
446align="center">272</td> </tr>
447
448<tr> <td align="center">20</td> <td align="center">5</td> <td
449align="center">1</td> <td align="center">no</td> <td
450align="center">1.18</td> <td align="center">20.0</td> <td
451align="center">0.16</td> <td align="center">-</td> <td
452align="center">236</td> </tr>
453
454<tr> <td align="center">20</td> <td align="center">5</td> <td
455align="center">1</td> <td align="center">yes</td> <td
456align="center">1.39</td> <td align="center">20.0</td> <td
457align="center">0.16</td> <td align="center">-</td> <td
458align="center">278</td> </tr>
459
460<tr> <td align="center" colspan="9"> <hr> </td> </tr>
461
462</table>
463
464<p> A busy server with a non-full connection queue.  N is the client
465delivery concurrency. Failed deliveries complete at the TCP level,
466but time out after 300s while waiting for the SMTP greeting.  See
467text for a discussion of results.  </p>
468
469</blockquote>
470
471<h4> Impact of active server concurrency limiter </h4>
472
473<p> The final concurrency-limited result shows what happens when
474SMTP connections don't time out, but are rejected immediately with
475the Postfix server's smtpd_client_connection_count_limit feature
476(the server replies with a 421 status and disconnects immediately).
477Similar results can be expected with concurrency limiting features
478built into other MTAs or firewalls.  For this measurement we specified
479a server concurrency limit and a client initial destination concurrency
480of 5, and a server process limit of 10; all other conditions were
481the same as with the first measurement. The same result would be
482obtained with a FreeBSD or Linux server, because the "pushing back"
483is done entirely by the receiving side. </p>
484
485<blockquote>
486
487<table>
488
489<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
490style</th> <th>connection<br> caching</th> <th>percentage<br>
491deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
492<th>theoretical<br>defer rate</th> </tr>
493
494<tr> <td align="center" colspan="9"> <hr> </td> </tr>
495
496<tr> <td align="center">20</td> <td align="center">5</td> <td
497align="center">1/N</td> <td align="center">no</td> <td
498align="center">16.5</td> <td align="center">5.17</td> <td
499align="center">0.38</td> <td align="center">1/6</td> </tr>
500
501<tr> <td align="center">20</td> <td align="center">5</td> <td
502align="center">1/N</td> <td align="center">yes</td> <td
503align="center">16.5</td> <td align="center">5.17</td> <td
504align="center">0.38</td> <td align="center">1/6</td> </tr>
505
506<tr> <td align="center">20</td> <td align="center">5</td> <td
507align="center">1/sqrt(N)</td> <td align="center">no</td>
508<td align="center">24.5</td> <td align="center">5.28</td> <td
509align="center">0.45</td> <td align="center">1/4</td> </tr>
510
511<tr> <td align="center">20</td> <td align="center">5</td> <td
512align="center">1/sqrt(N)</td> <td align="center">yes</td>
513<td align="center">24.3</td> <td align="center">5.28</td> <td
514align="center">0.46</td> <td align="center">1/4</td> </tr>
515
516<tr> <td align="center">20</td> <td align="center">5</td> <td
517align="center">1</td> <td align="center">no</td> <td
518align="center">49.7</td> <td align="center">5.63</td> <td
519align="center">0.67</td> <td align="center">1/2</td> </tr>
520
521<tr> <td align="center">20</td> <td align="center">5</td> <td
522align="center">1</td> <td align="center">yes</td> <td
523align="center">49.7</td> <td align="center">5.68</td> <td
524align="center">0.70</td> <td align="center">1/2</td> </tr>
525
526<tr> <td align="center" colspan="9"> <hr> </td> </tr>
527
528</table>
529
530<p> A server with active per-client concurrency limiter that replies
531with 421 and disconnects.  N is the client delivery concurrency.
532The theoretical defer rate is 1/(1+roundup(1/feedback)).  This is
533always 1/2 with the fixed +/-1 feedback per delivery; with the
534concurrency-dependent feedback variants, the defer rate decreases
535with increasing concurrency. See text for a discussion of results.
536</p>
537
538</blockquote>
539
540<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3>
541
542<p> All results in the previous sections are based on the first
543delivery runs only; they do not include any second etc. delivery
544attempts. It's also worth noting that the measurements look at
545steady-state behavior only. They don't show what happens when the
546client starts sending at a much higher or lower concurrency.
547</p>
548
549<p> The first two examples show that the effect of feedback
550is negligible when concurrency is limited due to congestion. This
551is because the initial concurrency is already at the client's
552concurrency maximum, and because there is 10-100 times more positive
553than negative feedback.  Under these conditions, it is no surprise
554that the contribution from SMTP connection caching is also negligible.
555</p>
556
557<p> In the last example, the old +/-1 feedback per delivery will
558defer 50% of the mail when confronted with an active (anvil-style)
559server concurrency limit, where the server hangs up immediately
560with a 421 status (a TCP-level RST would have the same result).
561Less aggressive feedback mechanisms fare better than more aggressive
562ones.  Concurrency-dependent feedback fares even better at higher
563concurrencies than shown here, but has limitations as discussed in
564the next section.  </p>
565
566<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3>
567
568<p> Less-than-1 feedback is of interest primarily when sending large
569amounts of mail to destinations with active concurrency limiters
570(servers that reply with 421, or firewalls that send RST).  When
571sending small amounts of mail per destination, less-than-1 per-delivery
572feedback won't have a noticeable effect on the per-destination
573concurrency, because the number of deliveries to the same destination
574is too small. You might just as well use zero per-delivery feedback
575and stay with the initial per-destination concurrency. And when
576mail deliveries fail due to congestion instead of active concurrency
577limiters, the measurements above show that per-delivery feedback
578has no effect.  With large amounts of mail you might just as well
579use zero per-delivery feedback and start with the maximal per-destination
580concurrency.  </p>
581
582<p> The scheduler with less-than-1 concurrency
583feedback per delivery solves a problem with servers that have active
584concurrency limiters.  This works only because feedback is handled
585in a peculiar manner: positive feedback will increment the concurrency
586by 1 at the <b>end</b> of a sequence of events of length 1/feedback,
587while negative feedback will decrement concurrency by 1 at the
588<b>beginning</b> of such a sequence.  This is how Postfix adjusts
589quickly for overshoot without causing lots of mail to be deferred.
590Without this difference in feedback treatment, less-than-1 feedback
591per delivery would defer 50% of the mail, and would be no better
592in this respect than the old +/-1 feedback per delivery.  </p>
593
594<p> Unfortunately, the same feature that corrects quickly for
595concurrency overshoot also makes the scheduler more sensitive for
596noisy negative feedback.  The reason is that one lonely negative
597feedback event has the same effect as a complete sequence of length
5981/feedback: in both cases delivery concurrency is dropped by 1
599immediately.  As a worst-case scenario, consider multiple servers
600behind a load balancer on a single IP address, and no backup MX
601address.  When 1 out of K servers fails to complete the SMTP handshake
602or drops the connection, a scheduler with 1/N (N = concurrency)
603feedback stops increasing its concurrency once it reaches a concurrency
604level of about K,  even though the good servers behind the load
605balancer are perfectly capable of handling more traffic. </p>
606
607<p> This noise problem gets worse as the amount of positive feedback
608per delivery gets smaller.  A compromise is to use fixed less-than-1
609positive feedback values instead of concurrency-dependent positive
610feedback.  For example, to tolerate 1 of 4 bad servers in the above
611load balancer scenario, use positive feedback of 1/4 per "good"
612delivery (no connect or handshake error), and use an equal or smaller
613amount of negative feedback per "bad" delivery.  The downside of
614using concurrency-independent feedback is that some of the old +/-1
615feedback problems will return at large concurrencies.  Sites that
616must deliver mail at non-trivial per-destination concurrencies will
617require special configuration.  </p>
618
619<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3>
620
621<p> The Postfix 2.5 concurrency scheduler is controlled with the
622following configuration parameters, where "<i>transport</i>_foo"
623provides a transport-specific parameter override.  All parameter
624default settings are compatible with earlier Postfix versions. </p>
625
626<blockquote>
627
628<table border="0">
629
630<tr> <th> Parameter name </th> <th> Postfix version </th> <th>
631Description </th> </tr>
632
633<tr> <td colspan="3"> <hr> </td> </tr>
634
635<tr> <td> initial_destination_concurrency<br>
636<i>transport</i>_initial_destination_concurrency </td> <td
637align="center"> all<br> 2.5 </td> <td> Initial per-destination
638delivery concurrency </td> </tr>
639
640<tr> <td> default_destination_concurrency_limit<br>
641<i>transport</i>_destination_concurrency_limit </td> <td align="center">
642all<br> all </td> <td> Maximum per-destination delivery concurrency
643</td> </tr>
644
645<tr> <td> default_destination_concurrency_positive_feedback<br>
646<i>transport</i>_destination_concurrency_positive_feedback </td>
647<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive
648feedback amount, per delivery that does not fail with connection
649or handshake failure </td> </tr>
650
651<tr> <td> default_destination_concurrency_negative_feedback<br>
652<i>transport</i>_destination_concurrency_negative_feedback </td>
653<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative
654feedback amount, per delivery that fails with connection or handshake
655failure </td> </tr>
656
657<tr> <td> default_destination_concurrency_failed_cohort_limit<br>
658<i>transport</i>_destination_concurrency_failed_cohort_limit </td>
659<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed
660pseudo-cohorts after which a destination is declared "dead" and
661delivery is suspended </td> </tr>
662
663<tr> <td> destination_concurrency_feedback_debug</td> <td align="center">
6642.5 </td> <td> Enable verbose logging of concurrency scheduler
665activity </td> </tr>
666
667<tr> <td colspan="3"> <hr> </td> </tr>
668
669</table>
670
671</blockquote>
672
673<h2> <a name="jobs"> Preemptive scheduling </a> </h2>
674
675<p>
676
677The following sections describe the new queue manager and its
678preemptive scheduler algorithm. Note that the document was originally
679written to describe the changes between the new queue manager (in
680this text referred to as <tt>nqmgr</tt>, the name it was known by
681before it became the default queue manager) and the old queue manager
682(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
683every so often.
684
685</p>
686
687<p>
688
689This document is divided into sections as follows:
690
691</p>
692
693<ul>
694
695<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
696nqmgr </a>
697
698<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
699up the message </a> - how it is assigned to transports, jobs, peers,
700entries
701
702<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection
703works </a>
704
705<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption
706works </a> - what messages may be preempted and how and what messages
707are chosen to preempt them
708
709<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
710limits affect the scheduling algorithm </a>
711
712<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
713limits </a>
714
715</ul>
716
717<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
718nqmgr </a> </h3>
719
720<p>
721
722Let's start by recapitulating the structures and terms used when
723referring to queue manager and how it operates. Many of these are
724partially described elsewhere, but it is nice to have a coherent
725overview in one place:
726
727</p>
728
729<ul>
730
731<li> <p> Each message structure represents one mail message which
732Postfix is to deliver. The message recipients specify to what
733destinations is the message to be delivered and what transports are
734going to be used for the delivery. </p>
735
736<li> <p> Each recipient entry groups a batch of recipients of one
737message which are all going to be delivered to the same destination
738(and over the same transport).
739</p>
740
741<li> <p> Each transport structure groups everything what is going
742to be delivered by delivery agents dedicated for that transport.
743Each transport maintains a set of queues (describing the destinations
744it shall talk to) and jobs (referencing the messages it shall
745deliver). </p>
746
747<li> <p> Each transport queue (not to be confused with the on-disk
748active queue or incoming queue) groups everything what is going be
749delivered to given destination (aka nexthop) by its transport.  Each
750queue belongs to one transport, so each destination may be referred
751to by several queues, one for each transport.  Each queue maintains
752a list of all recipient entries (batches of message recipients)
753which shall be delivered to given destination (the todo list), and
754a list of recipient entries already being delivered by the delivery
755agents (the busy list). </p>
756
757<li> <p> Each queue corresponds to multiple peer structures.  Each
758peer structure is like the queue structure, belonging to one transport
759and referencing one destination. The difference is that it lists
760only the recipient entries which all originate from the same message,
761unlike the queue structure, whose entries may originate from various
762messages. For messages with few recipients, there is usually just
763one recipient entry for each destination, resulting in one recipient
764entry per peer. But for large mailing list messages the recipients
765may need to be split to multiple recipient entries, in which case
766the peer structure may list many entries for single destination.
767</p>
768
769<li> <p> Each transport job groups everything it takes to deliver
770one message via its transport. Each job represents one message
771within the context of the transport. The job belongs to one transport
772and message, so each message may have multiple jobs, one for each
773transport. The job groups all the peer structures, which describe
774the destinations the job's message has to be delivered to. </p>
775
776</ul>
777
778<p>
779
780The first four structures are common to both <tt>nqmgr</tt> and
781<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
782
783</p>
784
785<p>
786
787These terms are used extensively in the text below, feel free to
788look up the description above anytime you'll feel you have lost a
789sense what is what.
790
791</p>
792
793<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
794up the message </a> </h3>
795
796<p>
797
798Whenever <tt>nqmgr</tt> moves a queue file into the active queue,
799the following happens: It reads all necessary information from the
800queue file as <tt>oqmgr</tt> does, and also reads as many recipients
801as possible - more on that later, for now let's just pretend it
802always reads all recipients.
803
804</p>
805
806<p>
807
808Then it resolves the recipients as <tt>oqmgr</tt> does, which
809means obtaining (address, nexthop, transport) triple for each
810recipient. For each triple, it finds the transport; if it does not
811exist yet, it instantiates it (unless it's dead). Within the
812transport, it finds the destination queue for given nexthop; if it
813does not exist yet, it instantiates it (unless it's dead). The
814triple is then bound to given destination queue. This happens in
815qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
816
817</p>
818
819<p>
820
821Then for each triple which was bound to some queue (and thus
822transport), the program finds the job which represents the message
823within that transport's context; if it does not exist yet, it
824instantiates it. Within the job, it finds the peer which represents
825the bound destination queue within this jobs context; if it does
826not exist yet, it instantiates it.  Finally, it stores the address
827from the resolved triple to the recipient entry which is appended
828to both the queue entry list and the peer entry list. The addresses
829for same nexthop are batched in the entries up to recipient_concurrency
830limit for that transport. This happens in qmgr_assign() and apart
831from that it operates with job and peer structures it is basically the
832same as in <tt>oqmgr</tt>.
833
834</p>
835
836<p>
837
838When the job is instantiated, it is enqueued on the transport's job
839list based on the time its message was picked up by <tt>nqmgr</tt>.
840For first batch of recipients this means it is appended to the end
841of the job list, but the ordering of the job list by the enqueue
842time is important as we will see shortly.
843
844</p>
845
846<p>
847
848[Now you should have pretty good idea what is the state of the
849<tt>nqmgr</tt> after couple of messages was picked up, what is the
850relation between all those job, peer, queue and entry structures.]
851
852</p>
853
854<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection
855works </a> </h3>
856
857<p>
858
859Having prepared all those above mentioned structures, the task of
860the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
861one at a time and pass them to the delivery agent for corresponding
862transport. Now how does this work?
863
864</p>
865
866<p>
867
868The first approximation of the new scheduling algorithm is like this:
869
870</p>
871
872<blockquote>
873<pre>
874foreach transport (round-robin-by-transport)
875do
876    if transport busy continue
877    if transport process limit reached continue
878    foreach transport's job (in the order of the transport's job list)
879    do
880	foreach job's peer (round-robin-by-destination)
881	     if peer-&gt;queue-&gt;concurrency &lt; peer-&gt;queue-&gt;window
882		 return next peer entry.
883	done
884    done
885done
886</pre>
887</blockquote>
888
889<p>
890
891Now what is the "order of the transport's job list"? As we know
892already, the job list is by default kept in the order the message
893was picked up by the <tt>nqmgr</tt>. So by default we get the
894top-level round-robin transport, and within each transport we get
895the FIFO message delivery. The round-robin of the peers by the
896destination is perhaps of little importance in most real-life cases
897(unless the recipient_concurrency limit is reached, in one job there
898is only one peer structure for each destination), but theoretically
899it makes sure that even within single jobs, destinations are treated
900fairly.
901
902</p>
903
904<p>
905
906[By now you should have a feeling you really know how the scheduler
907works, except for the preemption, under ideal conditions - that is,
908no recipient resource limits and no destination concurrency problems.]
909
910</p>
911
912<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption
913works </a> </h3>
914
915<p>
916
917As you might perhaps expect by now, the transport's job list does
918not remain sorted by the job's message enqueue time all the time.
919The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
920delivery, but that it is able to slip mail with little recipients
921past the mailing-list bulk mail.  This is what the job preemption
922is about - shuffling the jobs on the transport's job list to get
923the best message delivery rates. Now how is it achieved?
924
925</p>
926
927<p>
928
929First I have to tell you that there are in fact two job lists in
930each transport. One is the scheduler's job list, which the scheduler
931is free to play with, while the other one keeps the jobs always
932listed in the order of the enqueue time and is used for recipient
933pool management we will discuss later. For now, we will deal with
934the scheduler's job list only.
935
936</p>
937
938<p>
939
940So, we have the job list, which is first ordered by the time the
941jobs' messages were enqueued, oldest messages first, the most recently
942picked one at the end. For now, let's assume that there are no
943destination concurrency problems. Without preemption, we pick some
944entry of the first (oldest) job on the queue, assign it to delivery
945agent, pick another one from the same job, assign it again, and so
946on, until all the entries are used and the job is delivered. We
947would then move onto the next job and so on and on. Now how do we
948manage to sneak in some entries from the recently added jobs when
949the first job on the job list belongs to a message going to the
950mailing-list and has thousands of recipient entries?
951
952</p>
953
954<p>
955
956The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
957the delivery time of that first job by some constant for free - it
958is basically the same trick you might remember as "accumulation of
959potential" from the amortized complexity lessons. For example,
960instead of delivering the entries of the first job on the job list
961every time a delivery agent becomes available, we can do it only
962every second time. If you view the moments the delivery agent becomes
963available on a timeline as "delivery slots", then instead of using
964every delivery slot for the first job, we can use only every other
965slot, and still the overall delivery efficiency of the first job
966remains the same. So the delivery <tt>11112222</tt> becomes
967<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
968denotes the free slot). Now what do we do with free slots?
969
970</p>
971
972<p>
973
974As you might have guessed, we will use them for sneaking the mail
975with little recipients in. For example, if we have one four-recipient
976mail followed by four one recipients mail, the delivery sequence
977(that is, the sequence in which the jobs are assigned to the
978delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
979for sneaking in the single recipient mail, but how do we sneak in
980the mail with more than one recipient? Say if we have one four-recipient
981mail followed by two two-recipient mails?
982
983</p>
984
985<p>
986
987The simple answer would be to use delivery sequence <tt>12121313</tt>.
988But the problem is that this does not scale well. Imagine you have
989mail with thousand recipients followed by mail with hundred recipients.
990It is tempting to suggest the  delivery sequence like <tt>121212....</tt>,
991but alas! Imagine there arrives another mail with say ten recipients.
992But there are no free slots anymore, so it can't slip by, not even
993if it had just only one recipients.  It will be stuck until the
994hundred-recipient mail is delivered, which really sucks.
995
996</p>
997
998<p>
999
1000So, it becomes obvious that while inflating the message to get
1001free slots is great idea, one has to be really careful of how the
1002free slots are assigned, otherwise one might corner himself. So,
1003how does <tt>nqmgr</tt> really use the free slots?
1004
1005</p>
1006
1007<p>
1008
1009The key idea is that one does not have to generate the free slots
1010in a uniform way. The delivery sequence <tt>111...1</tt> is no
1011worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
1012entries are in the first case selected earlier than in the second
1013case, and none is selected later! So it is possible to first
1014"accumulate" the free delivery slots and then use them all at once.
1015It is even possible to accumulate some, then use them, then accumulate
1016some more and use them again, as in <tt>11..1.1</tt> .
1017
1018</p>
1019
1020<p>
1021
1022Let's get back to the one hundred recipient example. We now know
1023that we could first accumulate one hundred free slots, and only
1024after then to preempt the first job and sneak the one hundred
1025recipient mail in. Applying the algorithm recursively, we see the
1026hundred recipient job can accumulate ten free delivery slots, and
1027then we could preempt it and sneak in the ten-recipient mail...
1028Wait wait wait! Could we? Aren't we overinflating the original one
1029thousand recipient mail?
1030
1031</p>
1032
1033<p>
1034
1035Well, despite it looks so at the first glance, another trick will
1036allow us to answer "no, we are not!". If we had said that we will
1037inflate the delivery time twice at maximum, and then we consider
1038every other slot as a free slot, then we would overinflate in case
1039of the recursive preemption. BUT! The trick is that if we use only
1040every n-th slot as a free slot for n&gt;2, there is always some worst
1041inflation factor which we can guarantee not to be breached, even
1042if we apply the algorithm recursively. To be precise, if for every
1043k&gt;1 normally used slots we accumulate one free delivery slot, than
1044the inflation factor is not worse than k/(k-1) no matter how many
1045recursive preemptions happen. And it's not worse than (k+1)/k if
1046only non-recursive preemption happens. Now, having got through the
1047theory and the related math, let's see how <tt>nqmgr</tt> implements
1048this.
1049
1050</p>
1051
1052<p>
1053
1054Each job has so called "available delivery slot" counter. Each
1055transport has a <i>transport</i>_delivery_slot_cost parameter, which
1056defaults to default_delivery_slot_cost parameter which is set to 5
1057by default. This is the k from the paragraph above. Each time k
1058entries of the job are selected for delivery, this counter is
1059incremented by one. Once there are some slots accumulated, job which
1060requires no more than that number of slots to be fully delivered
1061can preempt this job.
1062
1063</p>
1064
1065<p>
1066
1067[Well, the truth is, the counter is incremented every time an entry
1068is selected and it is divided by k when it is used.
1069But for the understanding it's good enough to use
1070the above approximation of the truth.]
1071
1072</p>
1073
1074<p>
1075
1076OK, so now we know the conditions which must be satisfied so one
1077job can preempt another one. But what job gets preempted, how do
1078we choose what job preempts it if there are several valid candidates,
1079and when does all this exactly happen?
1080
1081</p>
1082
1083<p>
1084
1085The answer for the first part is simple. The job whose entry was
1086selected the last time is so called current job. Normally, it is
1087the first job on the scheduler's job list, but destination concurrency
1088limits may change this as we will see later. It is always only the
1089current job which may get preempted.
1090
1091</p>
1092
1093<p>
1094
1095Now for the second part. The current job has certain amount of
1096recipient entries, and as such may accumulate at maximum some amount
1097of available delivery slots. It might have already accumulated some,
1098and perhaps even already used some when it was preempted before
1099(remember a job can be preempted several times). In either case,
1100we know how many are accumulated and how many are left to deliver,
1101so we know how many it may yet accumulate at maximum. Every other
1102job which may be delivered by less than that number of slots is a
1103valid candidate for preemption. How do we choose among them?
1104
1105</p>
1106
1107<p>
1108
1109The answer is - the one with maximum enqueue_time/recipient_entry_count.
1110That is, the older the job is, the more we should try to deliver
1111it in order to get best message delivery rates. These rates are of
1112course subject to how many recipients the message has, therefore
1113the division by the recipient (entry) count. No one shall be surprised
1114that message with n recipients takes n times longer to deliver than
1115message with one recipient.
1116
1117</p>
1118
1119<p>
1120
1121Now let's recap the previous two paragraphs. Isn't it too complicated?
1122Why don't the candidates come only among the jobs which can be
1123delivered within the number of slots the current job already
1124accumulated? Why do we need to estimate how much it has yet to
1125accumulate? If you found out the answer, congratulate yourself. If
1126we did it this simple way, we would always choose the candidate
1127with least recipient entries. If there were enough single recipient
1128mails coming in, they would always slip by the bulk mail as soon
1129as possible, and the two and more recipients mail would never get
1130a chance, no matter how long they have been sitting around in the
1131job list.
1132
1133</p>
1134
1135<p>
1136
1137This candidate selection has interesting implication - that when
1138we choose the best candidate for preemption (this is done in
1139qmgr_choose_candidate()), it may happen that we may not use it for
1140preemption immediately. This leads to an answer to the last part
1141of the original question - when does the preemption happen?
1142
1143</p>
1144
1145<p>
1146
1147The preemption attempt happens every time next transport's recipient
1148entry is to be chosen for delivery. To avoid needless overhead, the
1149preemption is not attempted if the current job could never accumulate
1150more than <i>transport</i>_minimum_delivery_slots (defaults to
1151default_minimum_delivery_slots which defaults to 3). If there is
1152already enough accumulated slots to preempt the current job by the
1153chosen best candidate, it is done immediately. This basically means
1154that the candidate is moved in front of the current job on the
1155scheduler's job list and decreasing the accumulated slot counter
1156by the amount used by the candidate. If there is not enough slots...
1157well, I could say that nothing happens and the another preemption
1158is attempted the next time. But that's not the complete truth.
1159
1160</p>
1161
1162<p>
1163
1164The truth is that it turns out that it is not really necessary to
1165wait until the jobs counter accumulates all the delivery slots in
1166advance. Say we have ten-recipient mail followed by two two-recipient
1167mails. If the preemption happened when enough delivery slot accumulate
1168(assuming slot cost 2), the delivery sequence becomes
1169<tt>11112211113311</tt>. Now what would we get if we would wait
1170only for 50% of the necessary slots to accumulate and we promise
1171we would wait for the remaining 50% later, after we get back
1172to the preempted job? If we use such slot loan, the delivery sequence
1173becomes <tt>11221111331111</tt>. As we can see, it makes it no
1174considerably worse for the delivery of the ten-recipient mail, but
1175it allows the small messages to be delivered sooner.
1176
1177</p>
1178
1179<p>
1180
1181The concept of these slot loans is where the
1182<i>transport</i>_delivery_slot_discount and
1183<i>transport</i>_delivery_slot_loan come from (they default to
1184default_delivery_slot_discount and default_delivery_slot_loan, whose
1185values are by default 50 and 3, respectively). The discount (resp.
1186loan) specifies how many percent (resp. how many slots) one "gets
1187in advance", when the number of slots required to deliver the best
1188candidate is compared with the number of slots the current slot had
1189accumulated so far.
1190
1191</p>
1192
1193<p>
1194
1195And it pretty much concludes this chapter.
1196
1197</p>
1198
1199<p>
1200
1201[Now you should have a feeling that you pretty much understand the
1202scheduler and the preemption, or at least that you will have it
1203after you read the last chapter couple more times. You shall clearly
1204see the job list and the preemption happening at its head, in ideal
1205delivery conditions. The feeling of understanding shall last until
1206you start wondering what happens if some of the jobs are blocked,
1207which you might eventually figure out correctly from what had been
1208said already. But I would be surprised if your mental image of the
1209scheduler's functionality is not completely shattered once you
1210start wondering how it works when not all recipients may be read
1211in-core.  More on that later.]
1212
1213</p>
1214
1215<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
1216limits affect the scheduling algorithm </a> </h3>
1217
1218<p>
1219
1220The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
1221control as <tt>oqmgr</tt>. Now what happens when the destination
1222limits are reached and no more entries for that destination may be
1223selected by the scheduler?
1224
1225</p>
1226
1227<p>
1228
1229From user's point of view it is all simple. If some of the peers
1230of a job can't be selected, those peers are simply skipped by the
1231entry selection algorithm (the pseudo-code described before) and
1232only the selectable ones are used. If none of the peers may be
1233selected, the job is declared a "blocker job". Blocker jobs are
1234skipped by the entry selection algorithm and they are also excluded
1235from the candidates for preemption of current job. Thus the scheduler
1236effectively behaves as if the blocker jobs didn't exist on the job
1237list at all. As soon as at least one of the peers of a blocker job
1238becomes unblocked (that is, the delivery agent handling the delivery
1239of the recipient entry for given destination successfully finishes),
1240the job's blocker status is removed and the job again participates
1241in all further scheduler actions normally.
1242
1243</p>
1244
1245<p>
1246
1247So the summary is that the users don't really have to be concerned
1248about the interaction of the destination limits and scheduling
1249algorithm. It works well on its own and there are no knobs they
1250would need to control it.
1251
1252</p>
1253
1254<p>
1255
1256From a programmer's point of view, the blocker jobs complicate the
1257scheduler quite a lot. Without them, the jobs on the job list would
1258be normally delivered in strict FIFO order. If the current job is
1259preempted, the job preempting it is completely delivered unless it
1260is preempted itself. Without blockers, the current job is thus
1261always either the first job on the job list, or the top of the stack
1262of jobs preempting the first job on the job list.
1263
1264</p>
1265
1266<p>
1267
1268The visualization of the job list and the preemption stack without
1269blockers would be like this:
1270
1271</p>
1272
1273<blockquote>
1274<pre>
1275first job-&gt;    1--2--3--5--6--8--...    &lt;- job list
1276on job list    |
1277               4    &lt;- preemption stack
1278               |
1279current job-&gt;  7
1280</pre>
1281</blockquote>
1282
1283<p>
1284
1285In the example above we see that job 1 was preempted by job 4 and
1286then job 4 was preempted by job 7. After job 7 is completed, remaining
1287entries of job 4 are selected, and once they are all selected, job
12881 continues.
1289
1290</p>
1291
1292<p>
1293
1294As we see, it's all very clean and straightforward. Now how does
1295this change because of blockers?
1296
1297</p>
1298
1299<p>
1300
1301The answer is: a lot. Any job may become blocker job at any time,
1302and also become normal job again at any time. This has several
1303important implications:
1304
1305</p>
1306
1307<ol>
1308
1309<li> <p>
1310
1311The jobs may be completed in arbitrary order. For example, in the
1312example above, if the current job 7 becomes blocked, the next job
13134 may complete before the job 7 becomes unblocked again. Or if both
13147 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
1315and is completed, then 2 is completed and only after that 4 becomes
1316unblocked and is completed... You get the idea.
1317
1318</p>
1319
1320<p>
1321
1322[Interesting side note: even when jobs are delivered out of order,
1323from single destination's point of view the jobs are still delivered
1324in the expected order (that is, FIFO unless there was some preemption
1325involved). This is because whenever a destination queue becomes
1326unblocked (the destination limit allows selection of more recipient
1327entries for that destination), all jobs which have peers for that
1328destination are unblocked at once.]
1329
1330</p>
1331
1332<li> <p>
1333
1334The idea of the preemption stack at the head of the job list is
1335gone.  That is, it must be possible to preempt any job on the job
1336list. For example, if the jobs 7, 4, 1 and 2 in the example above
1337become all blocked, job 3 becomes the current job. And of course
1338we do not want the preemption to be affected by the fact that there
1339are some blocked jobs or not. Therefore, if it turns out that job
13403 might be preempted by job 6, the implementation shall make it
1341possible.
1342
1343</p>
1344
1345<li> <p>
1346
1347The idea of the linear preemption stack itself is gone. It's no
1348longer true that one job is always preempted by only one job at one
1349time (that is directly preempted, not counting the recursively
1350nested jobs). For example, in the example above, job 1 is directly
1351preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
1352blocked, and job 4 is being delivered. If it accumulates enough
1353delivery slots, it is natural that it might be preempted for example
1354by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
1355same time.
1356
1357</p>
1358
1359</ol>
1360
1361<p>
1362
1363Now combine the points 2) and 3) with point 1) again and you realize
1364that the relations on the once linear job list became pretty
1365complicated. If we extend the point 3) example: jobs 7 and 8 preempt
1366job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
1367huh?
1368
1369</p>
1370
1371<p>
1372
1373If I illustrate the relations after the above mentioned examples
1374(but those in point 1)), the situation would look like this:
1375
1376</p>
1377
1378<blockquote>
1379<pre>
1380                            v- parent
1381
1382adoptive parent -&gt;    1--2--3--5--...      &lt;- "stack" level 0
1383                      |     |
1384parent gone -&gt;        ?     6              &lt;- "stack" level 1
1385                     / \
1386children -&gt;         7   8   ^- child       &lt;- "stack" level 2
1387
1388                      ^- siblings
1389</pre>
1390</blockquote>
1391
1392<p>
1393
1394Now how does <tt>nqmgr</tt> deal with all these complicated relations?
1395
1396</p>
1397
1398<p>
1399
1400Well, it maintains them all as described, but fortunately, all these
1401relations are necessary only for purposes of proper counting of
1402available delivery slots. For purposes of ordering the jobs for
1403entry selection, the original rule still applies: "the job preempting
1404the current job is moved in front of the current job on the job
1405list". So for entry selection purposes, the job relations remain
1406as simple as this:
1407
1408</p>
1409
1410<blockquote>
1411<pre>
14127--8--1--2--6--3--5--..   &lt;- scheduler's job list order
1413</pre>
1414</blockquote>
1415
1416<p>
1417
1418The job list order and the preemption parent/child/siblings relations
1419are maintained separately. And because the selection works only
1420with the job list, you can happily forget about those complicated
1421relations unless you want to study the <tt>nqmgr</tt> sources. In
1422that case the text above might provide some helpful introduction
1423to the problem domain. Otherwise I suggest you just forget about
1424all this and stick with the user's point of view: the blocker jobs
1425are simply ignored.
1426
1427</p>
1428
1429<p>
1430
1431[By now, you should have a feeling that there is more things going
1432under the hood than you ever wanted to know. You decide that
1433forgetting about this chapter is the best you can do for the sake
1434of your mind's health and you basically stick with the idea how the
1435scheduler works in ideal conditions, when there are no blockers,
1436which is good enough.]
1437
1438</p>
1439
1440<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
1441limits </a> </h3>
1442
1443<p>
1444
1445When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
1446that all recipients of all messages in the active queue are completely
1447read into the memory. This is simply not true. There is an upper
1448bound on the amount of memory the <tt>nqmgr</tt> may use, and
1449therefore it must impose some limits on the information it may store
1450in the memory at any given time.
1451
1452</p>
1453
1454<p>
1455
1456First of all, not all messages may be read in-core at once. At any
1457time, only qmgr_message_active_limit messages may be read in-core
1458at maximum. When read into memory, the messages are picked from the
1459incoming and deferred message queues and moved to the active queue
1460(incoming having priority), so if there is more than
1461qmgr_message_active_limit messages queued in the active queue, the
1462rest will have to wait until (some of) the messages in the active
1463queue are completely delivered (or deferred).
1464
1465</p>
1466
1467<p>
1468
1469Even with the limited amount of in-core messages, there is another
1470limit which must be imposed in order to avoid memory exhaustion.
1471Each message may contain huge amount of recipients (tens or hundreds
1472of thousands are not uncommon), so if <tt>nqmgr</tt> read all
1473recipients of all messages in the active queue, it may easily run
1474out of memory. Therefore there must be some upper bound on the
1475amount of message recipients which are read into the memory at the
1476same time.
1477
1478</p>
1479
1480<p>
1481
1482Before discussing how exactly <tt>nqmgr</tt> implements the recipient
1483limits, let's see how the sole existence of the limits themselves
1484affects the <tt>nqmgr</tt> and its scheduler.
1485
1486</p>
1487
1488<p>
1489
1490The message limit is straightforward - it just limits the size of
1491the
1492lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
1493message can preempt the current one. Messages not in the active
1494queue simply are not considered at all.
1495
1496</p>
1497
1498<p>
1499
1500The recipient limit complicates more things. First of all, the
1501message reading code must support reading the recipients in batches,
1502which among other things means accessing the queue file several
1503times and continuing where the last recipient batch ended. This is
1504invoked by the scheduler whenever the current job has space for more
1505recipients, subject to transport's refill_limit and refill_delay parameters.
1506It is also done any time when all
1507in-core recipients of the message are dealt with (which may also
1508mean they were deferred) but there are still more in the queue file.
1509
1510</p>
1511
1512<p>
1513
1514The second complication is that with some recipients left unread
1515in the queue file, the scheduler can't operate with exact counts
1516of recipient entries. With unread recipients, it is not clear how
1517many recipient entries there will be, as they are subject to
1518per-destination grouping. It is not even clear to what transports
1519(and thus jobs) the recipients will be assigned. And with messages
1520coming from the deferred queue, it is not even clear how many unread
1521recipients are still to be delivered. This all means that the
1522scheduler must use only estimates of how many recipients entries
1523there will be.  Fortunately, it is possible to estimate the minimum
1524and maximum correctly, so the scheduler can always err on the safe
1525side.  Obviously, the better the estimates, the better results, so
1526it is best when we are able to read all recipients in-core and turn
1527the estimates into exact counts, or at least try to read as many
1528as possible to make the estimates as accurate as possible.
1529
1530</p>
1531
1532<p>
1533
1534The third complication is that it is no longer true that the scheduler
1535is done with a job once all of its in-core recipients are delivered.
1536It is possible that the job will be revived later, when another
1537batch of recipients is read in core. It is also possible that some
1538jobs will be created for the first time long after the first batch
1539of recipients was read in core. The <tt>nqmgr</tt> code must be
1540ready to handle all such situations.
1541
1542</p>
1543
1544<p>
1545
1546And finally, the fourth complication is that the <tt>nqmgr</tt>
1547code must somehow impose the recipient limit itself. Now how does
1548it achieve it?
1549
1550</p>
1551
1552<p>
1553
1554Perhaps the easiest solution would be to say that each message may
1555have at maximum X recipients stored in-core, but such solution would
1556be poor for several reasons. With reasonable qmgr_message_active_limit
1557values, the X would have to be quite low to maintain reasonable
1558memory footprint. And with low X lots of things would not work well.
1559The <tt>nqmgr</tt> would have problems to use the
1560<i>transport</i>_destination_recipient_limit efficiently. The
1561scheduler's preemption would be suboptimal as the recipient count
1562estimates would be inaccurate. The message queue file would have
1563to be accessed many times to read in more recipients again and
1564again.
1565
1566</p>
1567
1568<p>
1569
1570Therefore it seems reasonable to have a solution which does not use
1571a limit imposed on per-message basis, but which maintains a pool
1572of available recipient slots, which can be shared among all messages
1573in the most efficient manner. And as we do not want separate
1574transports to compete for resources whenever possible, it seems
1575appropriate to maintain such recipient pool for each transport
1576separately. This is the general idea, now how does it work in
1577practice?
1578
1579</p>
1580
1581<p>
1582
1583First we have to solve little chicken-and-egg problem. If we want
1584to use the per-transport recipient pools, we first need to know to
1585what transport(s) is the message assigned. But we will find that
1586out only after we read in the recipients first. So it is obvious
1587that we first have to read in some recipients, use them to find out
1588to what transports is the message to be assigned, and only after
1589that we can use the per-transport recipient pools.
1590
1591</p>
1592
1593<p>
1594
1595Now how many recipients shall we read for the first time? This is
1596what qmgr_message_recipient_minimum and qmgr_message_recipient_limit
1597values control. The qmgr_message_recipient_minimum value specifies
1598how many recipients of each message we will read for the first time,
1599no matter what.  It is necessary to read at least one recipient
1600before we can assign the message to a transport and create the first
1601job. However, reading only qmgr_message_recipient_minimum recipients
1602even if there are only few messages with few recipients in-core would
1603be wasteful. Therefore if there is less than qmgr_message_recipient_limit
1604recipients in-core so far, the first batch of recipients may be
1605larger than qmgr_message_recipient_minimum - as large as is required
1606to reach the qmgr_message_recipient_limit limit.
1607
1608</p>
1609
1610<p>
1611
1612Once the first batch of recipients was read in core and the message
1613jobs were created, the size of the subsequent recipient batches (if
1614any - of course it's best when all recipients are read in one batch)
1615is based solely on the position of the message jobs on their
1616corresponding transports' job lists. Each transport has a pool of
1617<i>transport</i>_recipient_limit recipient slots which it can
1618distribute among its jobs (how this is done is described later).
1619The subsequent recipient batch may be as large as the sum of all
1620recipient slots of all jobs of the message permits (plus the
1621qmgr_message_recipient_minimum amount which always applies).
1622
1623</p>
1624
1625<p>
1626
1627For example, if a message has three jobs, first with 1 recipient
1628still in-core and 4 recipient slots, second with 5 recipient in-core
1629and 5 recipient slots, and third with 2 recipients in-core and 0
1630recipient slots, it has 1+5+2=7 recipients in-core and 4+5+0=9 jobs'
1631recipients slots in total. This means that we could immediately
1632read 2+qmgr_message_recipient_minimum more recipients of that message
1633in core.
1634
1635</p>
1636
1637<p>
1638
1639The above example illustrates several things which might be worth
1640mentioning explicitly: first, note that although the per-transport
1641slots are assigned to particular jobs, we can't guarantee that once
1642the next batch of recipients is read in core, that the corresponding
1643amounts of recipients will be assigned to those jobs. The jobs lend
1644its slots to the message as a whole, so it is possible that some
1645jobs end up sponsoring other jobs of their message. For example,
1646if in the example above the 2 newly read recipients were assigned
1647to the second job, the first job sponsored the second job with 2
1648slots. The second notable thing is the third job, which has more
1649recipients in-core than it has slots. Apart from sponsoring by other
1650job we just saw it can be result of the first recipient batch, which
1651is sponsored from global recipient pool of qmgr_message_recipient_limit
1652recipients. It can be also sponsored from the message recipient
1653pool of qmgr_message_recipient_minimum recipients.
1654
1655</p>
1656
1657<p>
1658
1659Now how does each transport distribute the recipient slots among
1660its jobs?  The strategy is quite simple. As most scheduler activity
1661happens on the head of the job list, it is our intention to make
1662sure that the scheduler has the best estimates of the recipient
1663counts for those jobs. As we mentioned above, this means that we
1664want to try to make sure that the messages of those jobs have all
1665recipients read in-core. Therefore the transport distributes the
1666slots "along" the job list from start to end. In this case the job
1667list sorted by message enqueue time is used, because it doesn't
1668change over time as the scheduler's job list does.
1669
1670</p>
1671
1672<p>
1673
1674More specifically, each time a job is created and appended to the
1675job list, it gets all unused recipient slots from its transport's
1676pool. It keeps them until all recipients of its message are read.
1677When this happens, all unused recipient slots are transferred to
1678the next job (which is now in fact now first such job) on the job
1679list which still has some recipients unread, or eventually back to
1680the transport pool if there is no such job. Such transfer then also
1681happens whenever a recipient entry of that job is delivered.
1682
1683</p>
1684
1685<p>
1686
1687There is also a scenario when a job is not appended to the end of
1688the job list (for example it was created as a result of second or
1689later recipient batch). Then it works exactly as above, except that
1690if it was put in front of the first unread job (that is, the job
1691of a message which still has some unread recipients in queue file),
1692that job is first forced to return all of its unused recipient slots
1693to the transport pool.
1694
1695</p>
1696
1697<p>
1698
1699The algorithm just described leads to the following state: The first
1700unread job on the job list always gets all the remaining recipient
1701slots of that transport (if there are any). The jobs queued before
1702this job are completely read (that is, all recipients of their
1703message were already read in core) and have at maximum as many slots
1704as they still have recipients in-core (the maximum is there because
1705of the sponsoring mentioned before) and the jobs after this job get
1706nothing from the transport recipient pool (unless they got something
1707before and then the first unread job was created and enqueued in
1708front of them later - in such case the also get at maximum as many
1709slots as they have recipients in-core).
1710
1711</p>
1712
1713<p>
1714
1715Things work fine in such state for most of the time, because the
1716current job is either completely read in-core or has as much recipient
1717slots as there are, but there is one situation which we still have
1718to take care of specially.  Imagine if the current job is preempted
1719by some unread job from the job list and there are no more recipient
1720slots available, so this new current job could read only batches
1721of qmgr_message_recipient_minimum recipients at a time. This would
1722really degrade performance. For this reason, each transport has
1723extra pool of <i>transport</i>_extra_recipient_limit recipient
1724slots, dedicated exactly for this situation. Each time an unread
1725job preempts the current job, it gets half of the remaining recipient
1726slots from the normal pool and this extra pool.
1727
1728</p>
1729
1730<p>
1731
1732And that's it. It sure does sound pretty complicated, but fortunately
1733most people don't really have to care how exactly it works as long
1734as it works.  Perhaps the only important things to know for most
1735people are the following upper bound formulas:
1736
1737</p>
1738
1739<p>
1740
1741Each transport has at maximum
1742
1743</p>
1744
1745<blockquote>
1746<pre>
1747max(
1748qmgr_message_recipient_minimum * qmgr_message_active_limit
1749+ *_recipient_limit + *_extra_recipient_limit,
1750qmgr_message_recipient_limit
1751)
1752</pre>
1753</blockquote>
1754
1755<p>
1756
1757recipients in core.
1758
1759</p>
1760
1761<p>
1762
1763The total amount of recipients in core is
1764
1765</p>
1766
1767<blockquote>
1768<pre>
1769max(
1770qmgr_message_recipient_minimum * qmgr_message_active_limit
1771+ sum( *_recipient_limit + *_extra_recipient_limit ),
1772qmgr_message_recipient_limit
1773)
1774</pre>
1775</blockquote>
1776
1777<p>
1778
1779where the sum is over all used transports.
1780
1781</p>
1782
1783<p>
1784
1785And this terribly complicated chapter concludes the documentation
1786of <tt>nqmgr</tt> scheduler.
1787
1788</p>
1789
1790<p>
1791
1792[By now you should theoretically know the <tt>nqmgr</tt> scheduler
1793inside out. In practice, you still hope that you will never have
1794to really understand the last or last two chapters completely, and
1795fortunately most people really won't. Understanding how the scheduler
1796works in ideal conditions is more than good enough for vast majority
1797of users.]
1798
1799</p>
1800
1801<h2> <a name="credits"> Credits </a> </h2>
1802
1803<ul>
1804
1805<li> Wietse Venema designed and implemented the initial queue manager
1806with per-domain FIFO scheduling, and per-delivery +/-1 concurrency
1807feedback.
1808
1809<li> Patrik Rak designed and implemented preemption where mail with
1810fewer recipients can slip past mail with more recipients in a
1811controlled manner, and wrote up its documentation.
1812
1813<li> Wietse Venema initiated a discussion with Patrik Rak and Victor
1814Duchovni on alternatives for the +/-1 feedback scheduler's aggressive
1815behavior. This is when K/N feedback was reviewed (N = concurrency).
1816The discussion ended without a good solution for both negative
1817feedback and dead site detection.
1818
1819<li> Victor Duchovni resumed work on concurrency feedback in the
1820context of concurrency-limited servers.
1821
1822<li> Wietse Venema then re-designed the concurrency scheduler in
1823terms of the simplest possible concepts: less-than-1 concurrency
1824feedback per delivery, forward and reverse concurrency feedback
1825hysteresis, and pseudo-cohort failure. At this same time, concurrency
1826feedback was separated from dead site detection.
1827
1828<li> These simplifications, and their modular implementation, helped
1829to develop further insights into the different roles that positive
1830and negative concurrency feedback play, and helped to identify some
1831worst-case scenarios.
1832
1833</ul>
1834
1835</body>
1836
1837</html>
1838