xref: /netbsd-src/external/ibm-public/postfix/dist/proto/SCHEDULER_README.html (revision d16b7486a53dcb8072b60ec6fcb4373a2d0c27b7)
1<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
2	"http://www.w3.org/TR/html4/loose.dtd">
3
4<html>
5
6<head>
7
8<title>Postfix Queue Scheduler</title>
9
10<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
11
12</head>
13
14<body>
15
16<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
17Queue Scheduler</h1>
18
19<hr>
20
21<h2> Disclaimer </h2>
22
23<p> Many of the <i>transport</i>-specific configuration parameters
24discussed in this document will not show up in "postconf" command
25output before Postfix version 2.9. This limitation applies to many
26parameters whose name is a combination of a master.cf service name
27such as "relay" and a built-in suffix such as
28"_destination_concurrency_limit". </p>
29
30<h2> Overview </h2>
31
32<p> The queue manager is by far the most complex part of the Postfix
33mail system. It schedules delivery of new mail, retries failed
34deliveries at specific times, and removes mail from the queue after
35the last delivery attempt.  There are two major classes of mechanisms
36that control the operation of the queue manager. </p>
37
38<p> Topics covered by this document: </p>
39
40<ul>
41
42<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned
43with the number of concurrent deliveries to a specific destination,
44including decisions on when to suspend deliveries after persistent
45failures.
46
47<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with
48the selection of email messages and recipients for a given destination.
49
50<li> <a href="#credits"> Credits</a>, something this document would not be
51complete without.
52
53</ul>
54
55<!--
56
57<p> Once started, the qmgr(8) process runs until "postfix reload"
58or "postfix stop".  As a persistent process, the queue manager has
59to meet strict requirements with respect to code correctness and
60robustness. Unlike non-persistent daemon processes, the queue manager
61cannot benefit from Postfix's process rejuvenation mechanism that
62limit the impact from resource leaks and other coding errors
63(translation: replacing a process after a short time covers up bugs
64before they can become a problem).  </p>
65
66-->
67
68<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
69
70<p> The following sections document the Postfix 2.5 concurrency
71scheduler, after a discussion of the limitations of the earlier
72concurrency scheduler. This is followed by results of medium-concurrency
73experiments, and a discussion of trade-offs between performance and
74robustness.  </p>
75
76<p> The material is organized as follows: </p>
77
78<ul>
79
80<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
81concurrency scheduler </a>
82
83<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
84concurrency feedback algorithm </a>
85
86<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
87destination" detection algorithm </a>
88
89<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
90concurrency scheduler </a>
91
92<li> <a href="#concurrency_results"> Results for delivery to
93concurrency limited servers </a>
94
95<li> <a href="#concurrency_discussion"> Discussion of concurrency
96limited server results </a>
97
98<li> <a href="#concurrency_limitations"> Limitations of less-than-1
99per delivery feedback </a>
100
101<li> <a href="#concurrency_config"> Concurrency configuration
102parameters </a>
103
104</ul>
105
106<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
107concurrency scheduler </a> </h3>
108
109<p> From the start, Postfix has used a simple but robust algorithm
110where the per-destination delivery concurrency is decremented by 1
111after delivery failed due to connection or handshake failure, and
112incremented by 1 otherwise.  Of course the concurrency is never
113allowed to exceed the maximum per-destination concurrency limit.
114And when a destination's concurrency level drops to zero, the
115destination is declared "dead" and delivery is suspended.  </p>
116
117<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
118
119<ul>
120
121<li> <p> Overshoot due to exponential delivery concurrency growth
122with each pseudo-cohort(*). This can be an issue with high-concurrency
123channels. For example, with the default initial concurrency of 5,
124concurrency would proceed over time as (5-10-20).  </p>
125
126<li> <p> Throttling down to zero concurrency after a single
127pseudo-cohort(*) failure. This was especially an issue with
128low-concurrency channels where a single failure could be sufficient
129to mark a destination as "dead", causing the suspension of further
130deliveries to the affected destination. </p>
131
132</ul>
133
134<p> (*) A pseudo-cohort is a number of delivery requests equal to
135a destination's delivery concurrency. </p>
136
137<p> The revised concurrency scheduler has a highly modular structure.
138It uses separate mechanisms for per-destination concurrency control
139and for "dead destination" detection.  The concurrency control in
140turn is built from two separate mechanisms: it supports less-than-1
141feedback per delivery to allow for more gradual concurrency
142adjustments, and it uses feedback hysteresis to suppress concurrency
143oscillations.  And instead of waiting for delivery concurrency to
144throttle down to zero, a destination is declared "dead" after a
145configurable number of pseudo-cohorts reports connection or handshake
146failure.  </p>
147
148<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
149concurrency feedback algorithm </a> </h3>
150
151<p> We want to increment a destination's delivery concurrency when
152some (not necessarily consecutive) number of deliveries complete
153without connection or handshake failure.  This is implemented with
154positive feedback g(N) where N is the destination's delivery
155concurrency.  With g(N)=1 feedback per delivery, concurrency increases
156by 1 after each positive feedback event; this gives us the old
157scheduler's exponential growth in time. With g(N)=1/N feedback per
158delivery, concurrency increases by 1 after an entire pseudo-cohort
159N of positive feedback reports; this gives us linear growth in time.
160Less-than-1 feedback per delivery and integer truncation naturally
161give us hysteresis, so that transitions to larger concurrency happen
162every 1/g(N) positive feedback events.  </p>
163
164<p> We want to decrement a destination's delivery concurrency when
165some (not necessarily consecutive) number of deliveries complete
166after connection or handshake failure.  This is implemented with
167negative feedback f(N) where N is the destination's delivery
168concurrency.  With f(N)=1 feedback per delivery, concurrency decreases
169by 1 after each negative feedback event; this gives us the old
170scheduler's behavior where concurrency is throttled down dramatically
171after a single pseudo-cohort failure.  With f(N)=1/N feedback per
172delivery, concurrency backs off more gently.  Again, less-than-1
173feedback per delivery and integer truncation naturally give us
174hysteresis, so that transitions to lower concurrency happen every
1751/f(N) negative feedback events.  </p>
176
177<p> However, with negative feedback we introduce a subtle twist.
178We "reverse" the negative hysteresis cycle so that the transition
179to lower concurrency happens at the <b>beginning</b> of a sequence
180of 1/f(N) negative feedback events.  Otherwise, a correction for
181overload would be made too late.  This makes the choice of f(N)
182relatively unimportant, as borne out by measurements later in this
183document.  </p>
184
185<p> In summary, the main ingredients for the Postfix 2.5 concurrency
186feedback algorithm are a) the option of less-than-1 positive feedback
187per delivery to avoid overwhelming servers, b) the option of
188less-than-1 negative feedback per delivery to avoid giving up too
189fast, c) feedback hysteresis to avoid rapid oscillation, and d) a
190"reverse" hysteresis cycle for negative feedback, so that it can
191correct for overload quickly.  </p>
192
193<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3>
194
195<p> We want to suspend deliveries to a specific destination after
196some number of deliveries suffers connection or handshake failure.
197The old scheduler declares a destination "dead" when negative (-1)
198feedback throttles the delivery concurrency down to zero. With
199less-than-1 feedback per delivery, this throttling down would
200obviously take too long.  We therefore have to separate "dead
201destination" detection from concurrency feedback.  This is implemented
202by introducing the concept of pseudo-cohort failure. The Postfix
2032.5 concurrency scheduler declares a destination "dead" after a
204configurable number of pseudo-cohorts suffers from connection or
205handshake failures. The old scheduler corresponds to the special
206case where the pseudo-cohort failure limit is equal to 1.  </p>
207
208<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3>
209
210<p> The pseudo code shows how the ideas behind new concurrency
211scheduler are implemented as of November 2007.  The actual code can
212be found in the module qmgr/qmgr_queue.c.  </p>
213
214<pre>
215Types:
216        Each destination has one set of the following variables
217        int concurrency
218        double success
219        double failure
220        double fail_cohorts
221
222Feedback functions:
223        N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
224        positive feedback: g(N) = x/N | x/sqrt(N) | x
225        negative feedback: f(N) = y/N | y/sqrt(N) | y
226
227Initialization:
228        concurrency = initial_concurrency
229        success = 0
230        failure = 0
231        fail_cohorts = 0
232
233After success:
234        fail_cohorts = 0
235        Be prepared for feedback &gt; hysteresis, or rounding error
236        success += g(concurrency)
237        while (success >= 1)            Hysteresis 1
238            concurrency += 1            Hysteresis 1
239            failure = 0
240            success -= 1                Hysteresis 1
241        Be prepared for overshoot
242        if (concurrency &gt; concurrency limit)
243            concurrency = concurrency limit
244
245Safety:
246        Don't apply positive feedback unless
247            concurrency &lt; busy_refcount + init_dest_concurrency
248        otherwise negative feedback effect could be delayed
249
250After failure:
251        if (concurrency &gt; 0)
252            fail_cohorts += 1.0 / concurrency
253            if (fail_cohorts &gt; cohort_failure_limit)
254                concurrency = 0
255        if (concurrency &gt; 0)
256            Be prepared for feedback &gt; hysteresis, rounding errors
257            failure -= f(concurrency)
258            while (failure &lt; 0)
259                concurrency -= 1        Hysteresis 1
260                failure += 1            Hysteresis 1
261                success = 0
262            Be prepared for overshoot
263            if (concurrency &lt; 1)
264                concurrency = 1
265</pre>
266
267<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3>
268
269<p> Discussions about the concurrency scheduler redesign started
270early 2004, when the primary goal was to find alternatives that did
271not exhibit exponential growth or rapid concurrency throttling.  No
272code was implemented until late 2007, when the primary concern had
273shifted towards better handling of server concurrency limits. For
274this reason we measure how well the new scheduler does this
275job.  The table below compares mail delivery performance of the old
276+/-1 feedback per delivery with several less-than-1 feedback
277functions, for different limited-concurrency server scenarios.
278Measurements were done with a FreeBSD 6.2 client and with FreeBSD
2796.2 and various Linux servers.  </p>
280
281<p> Server configuration: </p>
282
283<ul> <li> The mail flow was slowed down with 1 second latency per
284recipient ("smtpd_client_restrictions = sleep 1"). The purpose was
285to make results less dependent on hardware details, by avoiding
286slow-downs by queue file I/O, logging I/O, and network I/O.
287
288<li> Concurrency was limited by the server process limit
289("default_process_limit = 5" and "smtpd_client_event_limit_exceptions
290= static:all"). Postfix was stopped and started after changing the
291process limit, because the same number is also used as the backlog
292argument to the listen(2) system call, and "postfix reload" does
293not re-issue this call.
294
295<li> Mail was discarded with "local_recipient_maps = static:all" and
296"local_transport = discard". The discard action in access maps or
297header/body checks
298could not be used as it fails to update the in_flow_delay counters.
299
300</ul>
301
302<p> Client configuration: </p>
303
304<ul>
305
306<li> Queue file overhead was minimized by sending one message to a
307virtual alias that expanded into 2000 different remote recipients.
308All recipients were accounted for according to the maillog file.
309The virtual_alias_expansion_limit setting was increased to avoid
310complaints from the cleanup(8) server.
311
312<li> The number of deliveries was maximized with
313"smtp_destination_recipient_limit = 2". A smaller limit would cause
314Postfix to schedule the concurrency per recipient instead of domain,
315which is not what we want.
316
317<li> Maximum concurrency was limited with
318"smtp_destination_concurrency_limit = 20", and
319initial_destination_concurrency was set to the same value.
320
321<li> The positive and negative concurrency feedback hysteresis was
3221.  Concurrency was incremented by 1 at the END of 1/feedback steps
323of positive feedback, and was decremented by 1 at the START of
3241/feedback steps of negative feedback.
325
326<li> The SMTP client used the default 30s SMTP connect timeout and
327300s SMTP greeting timeout.
328
329</ul>
330
331<h4> Impact of the 30s SMTP connect timeout </h4>
332
333<p> The first results are for a FreeBSD 6.2 server, where our
334artificially low listen(2) backlog results in a very short kernel
335queue for established connections. The table shows that all deferred
336deliveries failed due to a 30s connection timeout, and none failed
337due to a server greeting timeout.  This measurement simulates what
338happens when the server's connection queue is completely full under
339load, and the TCP engine drops new connections.  </p>
340
341<blockquote>
342
343<table>
344
345<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
346style</th> <th>connection<br> caching</th> <th>percentage<br>
347deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
348<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
349
350<tr> <td align="center" colspan="9"> <hr> </td> </tr>
351
352<tr><td align="center">20</td> <td align="center">5</td> <td
353align="center">1/N</td> <td align="center">no</td> <td
354align="center">9.9</td> <td align="center">19.4</td> <td
355align="center">0.49</td> <td align="center">198</td> <td
356align="center">-</td> </tr>
357
358<tr><td align="center">20</td> <td align="center">5</td> <td
359align="center">1/N</td> <td align="center">yes</td> <td
360align="center">10.3</td> <td align="center">19.4</td> <td
361align="center">0.49</td> <td align="center">206</td> <td
362align="center">-</td> </tr>
363
364<tr><td align="center">20</td> <td align="center">5</td> <td
365align="center">1/sqrt(N)</td> <td align="center">no</td>
366<td align="center">10.4</td> <td align="center">19.6</td> <td
367align="center">0.59</td> <td align="center">208</td> <td
368align="center">-</td> </tr>
369
370<tr><td align="center">20</td> <td align="center">5</td> <td
371align="center">1/sqrt(N)</td> <td align="center">yes</td>
372<td align="center">10.6</td> <td align="center">19.6</td> <td
373align="center">0.61</td> <td align="center">212</td> <td
374align="center">-</td> </tr>
375
376<tr><td align="center">20</td> <td align="center">5</td> <td
377align="center">1</td> <td align="center">no</td> <td
378align="center">10.1</td> <td align="center">19.5</td> <td
379align="center">1.29</td> <td align="center">202</td> <td
380align="center">-</td> </tr>
381
382<tr><td align="center">20</td> <td align="center">5</td> <td
383align="center">1</td> <td align="center">yes</td> <td
384align="center">10.8</td> <td align="center">19.3</td> <td
385align="center">1.57</td> <td align="center">216</td> <td
386align="center">-</td> </tr>
387
388<tr> <td align="center" colspan="9"> <hr> </td> </tr>
389
390</table>
391
392<p> A busy server with a completely full connection queue.  N is
393the client delivery concurrency.  Failed deliveries time out after
39430s without completing the TCP handshake. See text for a discussion
395of results. </p>
396
397</blockquote>
398
399<h4> Impact of the 300s SMTP greeting timeout </h4>
400
401<p> The next table shows results for a Fedora Core 8 server (results
402for RedHat 7.3 are identical). In this case, the artificially small
403listen(2) backlog argument does not impact our measurement.  The
404table shows that practically all deferred deliveries fail after the
405300s SMTP greeting timeout. As these timeouts were 10x longer than
406with the first measurement, we increased the recipient count (and
407thus the running time) by a factor of 10 to keep the results
408comparable. The deferred mail percentages are a factor 10 lower
409than with the first measurement, because the 1s per-recipient delay
410was 1/300th of the greeting timeout instead of 1/30th of the
411connection timeout.  </p>
412
413<blockquote>
414
415<table>
416
417<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
418style</th> <th>connection<br> caching</th> <th>percentage<br>
419deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
420<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
421
422<tr> <td align="center" colspan="9"> <hr> </td> </tr>
423
424<tr> <td align="center">20</td> <td align="center">5</td> <td
425align="center">1/N</td> <td align="center">no</td> <td
426align="center">1.16</td> <td align="center">19.8</td> <td
427align="center">0.37</td> <td align="center">-</td> <td
428align="center">230</td> </tr>
429
430<tr> <td align="center">20</td> <td align="center">5</td> <td
431align="center">1/N</td> <td align="center">yes</td> <td
432align="center">1.36</td> <td align="center">19.8</td> <td
433align="center">0.36</td> <td align="center">-</td> <td
434align="center">272</td> </tr>
435
436<tr> <td align="center">20</td> <td align="center">5</td> <td
437align="center">1/sqrt(N)</td> <td align="center">no</td>
438<td align="center">1.21</td> <td align="center">19.9</td> <td
439align="center">0.23</td> <td align="center">4</td> <td
440align="center">238</td> </tr>
441
442<tr> <td align="center">20</td> <td align="center">5</td> <td
443align="center">1/sqrt(N)</td> <td align="center">yes</td>
444<td align="center">1.36</td> <td align="center">20.0</td> <td
445align="center">0.23</td> <td align="center">-</td> <td
446align="center">272</td> </tr>
447
448<tr> <td align="center">20</td> <td align="center">5</td> <td
449align="center">1</td> <td align="center">no</td> <td
450align="center">1.18</td> <td align="center">20.0</td> <td
451align="center">0.16</td> <td align="center">-</td> <td
452align="center">236</td> </tr>
453
454<tr> <td align="center">20</td> <td align="center">5</td> <td
455align="center">1</td> <td align="center">yes</td> <td
456align="center">1.39</td> <td align="center">20.0</td> <td
457align="center">0.16</td> <td align="center">-</td> <td
458align="center">278</td> </tr>
459
460<tr> <td align="center" colspan="9"> <hr> </td> </tr>
461
462</table>
463
464<p> A busy server with a non-full connection queue.  N is the client
465delivery concurrency. Failed deliveries complete at the TCP level,
466but time out after 300s while waiting for the SMTP greeting.  See
467text for a discussion of results.  </p>
468
469</blockquote>
470
471<h4> Impact of active server concurrency limiter </h4>
472
473<p> The final concurrency-limited result shows what happens when
474SMTP connections don't time out, but are rejected immediately with
475the Postfix server's smtpd_client_connection_count_limit feature
476(the server replies with a 421 status and disconnects immediately).
477Similar results can be expected with concurrency limiting features
478built into other MTAs or firewalls.  For this measurement we specified
479a server concurrency limit and a client initial destination concurrency
480of 5, and a server process limit of 10; all other conditions were
481the same as with the first measurement. The same result would be
482obtained with a FreeBSD or Linux server, because the "pushing back"
483is done entirely by the receiving side. </p>
484
485<blockquote>
486
487<table>
488
489<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
490style</th> <th>connection<br> caching</th> <th>percentage<br>
491deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
492<th>theoretical<br>defer rate</th> </tr>
493
494<tr> <td align="center" colspan="9"> <hr> </td> </tr>
495
496<tr> <td align="center">20</td> <td align="center">5</td> <td
497align="center">1/N</td> <td align="center">no</td> <td
498align="center">16.5</td> <td align="center">5.17</td> <td
499align="center">0.38</td> <td align="center">1/6</td> </tr>
500
501<tr> <td align="center">20</td> <td align="center">5</td> <td
502align="center">1/N</td> <td align="center">yes</td> <td
503align="center">16.5</td> <td align="center">5.17</td> <td
504align="center">0.38</td> <td align="center">1/6</td> </tr>
505
506<tr> <td align="center">20</td> <td align="center">5</td> <td
507align="center">1/sqrt(N)</td> <td align="center">no</td>
508<td align="center">24.5</td> <td align="center">5.28</td> <td
509align="center">0.45</td> <td align="center">1/4</td> </tr>
510
511<tr> <td align="center">20</td> <td align="center">5</td> <td
512align="center">1/sqrt(N)</td> <td align="center">yes</td>
513<td align="center">24.3</td> <td align="center">5.28</td> <td
514align="center">0.46</td> <td align="center">1/4</td> </tr>
515
516<tr> <td align="center">20</td> <td align="center">5</td> <td
517align="center">1</td> <td align="center">no</td> <td
518align="center">49.7</td> <td align="center">5.63</td> <td
519align="center">0.67</td> <td align="center">1/2</td> </tr>
520
521<tr> <td align="center">20</td> <td align="center">5</td> <td
522align="center">1</td> <td align="center">yes</td> <td
523align="center">49.7</td> <td align="center">5.68</td> <td
524align="center">0.70</td> <td align="center">1/2</td> </tr>
525
526<tr> <td align="center" colspan="9"> <hr> </td> </tr>
527
528</table>
529
530<p> A server with active per-client concurrency limiter that replies
531with 421 and disconnects.  N is the client delivery concurrency.
532The theoretical defer rate is 1/(1+roundup(1/feedback)).  This is
533always 1/2 with the fixed +/-1 feedback per delivery; with the
534concurrency-dependent feedback variants, the defer rate decreases
535with increasing concurrency. See text for a discussion of results.
536</p>
537
538</blockquote>
539
540<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3>
541
542<p> All results in the previous sections are based on the first
543delivery runs only; they do not include any second etc. delivery
544attempts. It's also worth noting that the measurements look at
545steady-state behavior only. They don't show what happens when the
546client starts sending at a much higher or lower concurrency.
547</p>
548
549<p> The first two examples show that the effect of feedback
550is negligible when concurrency is limited due to congestion. This
551is because the initial concurrency is already at the client's
552concurrency maximum, and because there is 10-100 times more positive
553than negative feedback.  Under these conditions, it is no surprise
554that the contribution from SMTP connection caching is also negligible.
555</p>
556
557<p> In the last example, the old +/-1 feedback per delivery will
558defer 50% of the mail when confronted with an active (anvil-style)
559server concurrency limit, where the server hangs up immediately
560with a 421 status (a TCP-level RST would have the same result).
561Less aggressive feedback mechanisms fare better than more aggressive
562ones.  Concurrency-dependent feedback fares even better at higher
563concurrencies than shown here, but has limitations as discussed in
564the next section.  </p>
565
566<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3>
567
568<p> Less-than-1 feedback is of interest primarily when sending large
569amounts of mail to destinations with active concurrency limiters
570(servers that reply with 421, or firewalls that send RST).  When
571sending small amounts of mail per destination, less-than-1 per-delivery
572feedback won't have a noticeable effect on the per-destination
573concurrency, because the number of deliveries to the same destination
574is too small. You might just as well use zero per-delivery feedback
575and stay with the initial per-destination concurrency. And when
576mail deliveries fail due to congestion instead of active concurrency
577limiters, the measurements above show that per-delivery feedback
578has no effect.  With large amounts of mail you might just as well
579use zero per-delivery feedback and start with the maximal per-destination
580concurrency.  </p>
581
582<p> The scheduler with less-than-1 concurrency
583feedback per delivery solves a problem with servers that have active
584concurrency limiters.  This works only because feedback is handled
585in a peculiar manner: positive feedback will increment the concurrency
586by 1 at the <b>end</b> of a sequence of events of length 1/feedback,
587while negative feedback will decrement concurrency by 1 at the
588<b>beginning</b> of such a sequence.  This is how Postfix adjusts
589quickly for overshoot without causing lots of mail to be deferred.
590Without this difference in feedback treatment, less-than-1 feedback
591per delivery would defer 50% of the mail, and would be no better
592in this respect than the old +/-1 feedback per delivery.  </p>
593
594<p> Unfortunately, the same feature that corrects quickly for
595concurrency overshoot also makes the scheduler more sensitive for
596noisy negative feedback.  The reason is that one lonely negative
597feedback event has the same effect as a complete sequence of length
5981/feedback: in both cases delivery concurrency is dropped by 1
599immediately.  As a worst-case scenario, consider multiple servers
600behind a load balancer on a single IP address, and no backup MX
601address.  When 1 out of K servers fails to complete the SMTP handshake
602or drops the connection, a scheduler with 1/N (N = concurrency)
603feedback stops increasing its concurrency once it reaches a concurrency
604level of about K,  even though the good servers behind the load
605balancer are perfectly capable of handling more traffic. </p>
606
607<p> This noise problem gets worse as the amount of positive feedback
608per delivery gets smaller.  A compromise is to use fixed less-than-1
609positive feedback values instead of concurrency-dependent positive
610feedback.  For example, to tolerate 1 of 4 bad servers in the above
611load balancer scenario, use positive feedback of 1/4 per "good"
612delivery (no connect or handshake error), and use an equal or smaller
613amount of negative feedback per "bad" delivery.  The downside of
614using concurrency-independent feedback is that some of the old +/-1
615feedback problems will return at large concurrencies.  Sites that
616must deliver mail at non-trivial per-destination concurrencies will
617require special configuration.  </p>
618
619<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3>
620
621<p> The Postfix 2.5 concurrency scheduler is controlled with the
622following configuration parameters, where "<i>transport</i>_foo"
623provides a transport-specific parameter override.  All parameter
624default settings are compatible with earlier Postfix versions. </p>
625
626<blockquote>
627
628<table border="0">
629
630<tr> <th> Parameter name </th> <th> Postfix version </th> <th>
631Description </th> </tr>
632
633<tr> <td colspan="3"> <hr> </td> </tr>
634
635<tr> <td> initial_destination_concurrency<br>
636<i>transport</i>_initial_destination_concurrency </td> <td
637align="center"> all<br> 2.5 </td> <td> Initial per-destination
638delivery concurrency </td> </tr>
639
640<tr> <td> default_destination_concurrency_limit<br>
641<i>transport</i>_destination_concurrency_limit </td> <td align="center">
642all<br> all </td> <td> Maximum per-destination delivery concurrency
643</td> </tr>
644
645<tr> <td> default_destination_concurrency_positive_feedback<br>
646<i>transport</i>_destination_concurrency_positive_feedback </td>
647<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive
648feedback amount, per delivery that does not fail with connection
649or handshake failure </td> </tr>
650
651<tr> <td> default_destination_concurrency_negative_feedback<br>
652<i>transport</i>_destination_concurrency_negative_feedback </td>
653<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative
654feedback amount, per delivery that fails with connection or handshake
655failure </td> </tr>
656
657<tr> <td> default_destination_concurrency_failed_cohort_limit<br>
658<i>transport</i>_destination_concurrency_failed_cohort_limit </td>
659<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed
660pseudo-cohorts after which a destination is declared "dead" and
661delivery is suspended </td> </tr>
662
663<tr> <td> destination_concurrency_feedback_debug</td> <td align="center">
6642.5 </td> <td> Enable verbose logging of concurrency scheduler
665activity </td> </tr>
666
667<tr> <td colspan="3"> <hr> </td> </tr>
668
669</table>
670
671</blockquote>
672
673<h2> <a name="jobs"> Preemptive scheduling </a> </h2>
674
675<p>
676
677The following sections describe the new queue manager and its
678preemptive scheduler algorithm. Note that the document was originally
679written to describe the changes between the new queue manager (in
680this text referred to as <tt>nqmgr</tt>, the name it was known by
681before it became the default queue manager) and the old queue manager
682(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
683every so often.
684
685</p>
686
687<p>
688
689This document is divided into sections as follows:
690
691</p>
692
693<ul>
694
695<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
696nqmgr </a>
697
698<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
699up the message </a> - how it is assigned to transports, jobs, peers,
700entries
701
702<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection
703works </a>
704
705<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption
706works </a> - what messages may be preempted and how and what messages
707are chosen to preempt them
708
709<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
710limits affect the scheduling algorithm </a>
711
712<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
713limits </a>
714
715</ul>
716
717<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
718nqmgr </a> </h3>
719
720<p>
721
722Let's start by recapitulating the structures and terms used when
723referring to the queue manager and how it operates. Many of these are
724partially described elsewhere, but it is nice to have a coherent
725overview in one place:
726
727</p>
728
729<ul>
730
731<li> <p> Each message structure represents one mail message which
732Postfix is to deliver. The message recipients specify to what
733destinations is the message to be delivered and what transports are
734going to be used for the delivery. </p>
735
736<li> <p> Each recipient entry groups a batch of recipients of one
737message which are all going to be delivered to the same destination
738(and over the same transport).
739</p>
740
741<li> <p> Each transport structure groups everything what is going
742to be delivered by delivery agents dedicated for that transport.
743Each transport maintains a set of queues (describing the destinations
744it shall talk to) and jobs (referencing the messages it shall
745deliver). </p>
746
747<li> <p> Each transport queue (not to be confused with the on-disk
748"active" queue or "incoming" queue) groups everything what is going be
749delivered to given destination (aka nexthop) by its transport.  Each
750queue belongs to one transport, so each destination may be referred
751to by several queues, one for each transport.  Each queue maintains
752a list of all recipient entries (batches of message recipients)
753which shall be delivered to given destination (the todo list), and
754a list of recipient entries already being delivered by the delivery
755agents (the busy list). </p>
756
757<li> <p> Each queue corresponds to multiple peer structures.  Each
758peer structure is like the queue structure, belonging to one transport
759and referencing one destination. The difference is that it lists
760only the recipient entries which all originate from the same message,
761unlike the queue structure, whose entries may originate from various
762messages. For messages with few recipients, there is usually just
763one recipient entry for each destination, resulting in one recipient
764entry per peer. But for large mailing list messages the recipients
765may need to be split to multiple recipient entries, in which case
766the peer structure may list many entries for single destination.
767</p>
768
769<li> <p> Each transport job groups everything it takes to deliver
770one message via its transport. Each job represents one message
771within the context of the transport. The job belongs to one transport
772and message, so each message may have multiple jobs, one for each
773transport. The job groups all the peer structures, which describe
774the destinations the job's message has to be delivered to. </p>
775
776</ul>
777
778<p>
779
780The first four structures are common to both <tt>nqmgr</tt> and
781<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
782
783</p>
784
785<p>
786
787These terms are used extensively in the text below, feel free to
788look up the description above anytime you'll feel you have lost a
789sense what is what.
790
791</p>
792
793<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
794up the message </a> </h3>
795
796<p>
797
798Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue,
799the following happens: It reads all necessary information from the
800queue file as <tt>oqmgr</tt> does, and also reads as many recipients
801as possible - more on that later, for now let's just pretend it
802always reads all recipients.
803
804</p>
805
806<p>
807
808Then it resolves the recipients as <tt>oqmgr</tt> does, which
809means obtaining (address, nexthop, transport) triple for each
810recipient. For each triple, it finds the transport; if it does not
811exist yet, it instantiates it (unless it's dead). Within the
812transport, it finds the destination queue for the given nexthop; if it
813does not exist yet, it instantiates it (unless it's dead). The
814triple is then bound to given destination queue. This happens in
815qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
816
817</p>
818
819<p>
820
821Then for each triple which was bound to some queue (and thus
822transport), the program finds the job which represents the message
823within that transport's context; if it does not exist yet, it
824instantiates it. Within the job, it finds the peer which represents
825the bound destination queue within this jobs context; if it does
826not exist yet, it instantiates it.  Finally, it stores the address
827from the resolved triple to the recipient entry which is appended
828to both the queue entry list and the peer entry list. The addresses
829for the same nexthop are batched in the entries up to the
830<i>transport</i>_destination_recipient_limit for that transport.
831This happens in qmgr_message_assign(), and apart
832from that it operates with job and peer structures, it is basically the
833same as in <tt>oqmgr</tt>.
834
835</p>
836
837<p>
838
839When the job is instantiated, it is enqueued on the transport's job
840list based on the time its message was picked up by <tt>nqmgr</tt>.
841For first batch of recipients this means it is appended to the end
842of the job list, but the ordering of the job list by the enqueue
843time is important as we will see shortly.
844
845</p>
846
847<p>
848
849[Now you should have a pretty good idea what the state of the
850<tt>nqmgr</tt> is after a couple of messages were picked up, and what the
851relation is between all those job, peer, queue and entry structures.]
852
853</p>
854
855<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection
856works </a> </h3>
857
858<p>
859
860Having prepared all those above mentioned structures, the task of
861the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
862one at a time and pass them to the delivery agent for corresponding
863transport. Now how does this work?
864
865</p>
866
867<p>
868
869The first approximation of the new scheduling algorithm is like this:
870
871</p>
872
873<blockquote>
874<pre>
875foreach transport (round-robin-by-transport)
876do
877    if transport busy continue
878    if transport process limit reached continue
879    foreach transport's job (in the order of the transport's job list)
880    do
881	foreach job's peer (round-robin-by-destination)
882	     if peer-&gt;queue-&gt;concurrency &lt; peer-&gt;queue-&gt;window
883		 return next peer entry.
884	done
885    done
886done
887</pre>
888</blockquote>
889
890<p>
891
892Now what is the "order of the transport's job list"? As we know
893already, the job list is by default kept in the order the message
894was picked up by the <tt>nqmgr</tt>. So by default we get the
895top-level round-robin transport, and within each transport we get
896the FIFO message delivery. The round-robin of the peers by the
897destination is perhaps of little importance in most real-life cases
898(unless the <i>transport</i>_destination_recipient_limit is reached,
899in one job there
900is only one peer structure for each destination), but theoretically
901it makes sure that even within single jobs, destinations are treated
902fairly.
903
904</p>
905
906<p>
907
908[By now you should have a feeling you really know how the scheduler
909works, except for the preemption, under ideal conditions - that is,
910no recipient resource limits and no destination concurrency problems.]
911
912</p>
913
914<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption
915works </a> </h3>
916
917<p>
918
919As you might perhaps expect by now, the transport's job list does
920not remain sorted by the job's message enqueue time all the time.
921The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
922delivery, but that it is able to slip mail with little recipients
923past the mailing-list bulk mail.  This is what the job preemption
924is about - shuffling the jobs on the transport's job list to get
925the best message delivery rates. Now how is it achieved?
926
927</p>
928
929<p>
930
931First I have to tell you that there are in fact two job lists in
932each transport. One is the scheduler's job list, which the scheduler
933is free to play with, while the other one keeps the jobs always
934listed in the order of the enqueue time and is used for recipient
935pool management we will discuss later. For now, we will deal with
936the scheduler's job list only.
937
938</p>
939
940<p>
941
942So, we have the job list, which is first ordered by the time the
943jobs' messages were enqueued, oldest messages first, the most recently
944picked one at the end. For now, let's assume that there are no
945destination concurrency problems. Without preemption, we pick some
946entry of the first (oldest) job on the queue, assign it to delivery
947agent, pick another one from the same job, assign it again, and so
948on, until all the entries are used and the job is delivered. We
949would then move onto the next job and so on and on. Now how do we
950manage to sneak in some entries from the recently added jobs when
951the first job on the job list belongs to a message going to the
952mailing-list and has thousands of recipient entries?
953
954</p>
955
956<p>
957
958The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
959the delivery time of that first job by some constant for free - it
960is basically the same trick you might remember as "accumulation of
961potential" from the amortized complexity lessons. For example,
962instead of delivering the entries of the first job on the job list
963every time a delivery agent becomes available, we can do it only
964every second time. If you view the moments the delivery agent becomes
965available on a timeline as "delivery slots", then instead of using
966every delivery slot for the first job, we can use only every other
967slot, and still the overall delivery efficiency of the first job
968remains the same. So the delivery <tt>11112222</tt> becomes
969<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
970denotes the free slot). Now what do we do with free slots?
971
972</p>
973
974<p>
975
976As you might have guessed, we will use them for sneaking the mail
977with little recipients in. For example, if we have one four-recipient
978mail followed by four one recipients mail, the delivery sequence
979(that is, the sequence in which the jobs are assigned to the
980delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
981for sneaking in the single recipient mail, but how do we sneak in
982the mail with more than one recipient? Say if we have one four-recipient
983mail followed by two two-recipient mails?
984
985</p>
986
987<p>
988
989The simple answer would be to use delivery sequence <tt>12121313</tt>.
990But the problem is that this does not scale well. Imagine you have
991mail with a thousand recipients followed by mail with a hundred recipients.
992It is tempting to suggest the  delivery sequence like <tt>121212....</tt>,
993but alas! Imagine there arrives another mail with say ten recipients.
994But there are no free slots anymore, so it can't slip by, not even
995if it had only one recipient.  It will be stuck until the
996hundred-recipient mail is delivered, which really sucks.
997
998</p>
999
1000<p>
1001
1002So, it becomes obvious that while inflating the message to get
1003free slots is a great idea, one has to be really careful of how the
1004free slots are assigned, otherwise one might corner himself. So,
1005how does <tt>nqmgr</tt> really use the free slots?
1006
1007</p>
1008
1009<p>
1010
1011The key idea is that one does not have to generate the free slots
1012in a uniform way. The delivery sequence <tt>111...1</tt> is no
1013worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
1014entries are in the first case selected earlier than in the second
1015case, and none is selected later! So it is possible to first
1016"accumulate" the free delivery slots and then use them all at once.
1017It is even possible to accumulate some, then use them, then accumulate
1018some more and use them again, as in <tt>11..1.1</tt> .
1019
1020</p>
1021
1022<p>
1023
1024Let's get back to the one hundred recipient example. We now know
1025that we could first accumulate one hundred free slots, and only
1026after then to preempt the first job and sneak the one hundred
1027recipient mail in. Applying the algorithm recursively, we see the
1028hundred recipient job can accumulate ten free delivery slots, and
1029then we could preempt it and sneak in the ten-recipient mail...
1030Wait wait wait! Could we? Aren't we overinflating the original one
1031thousand recipient mail?
1032
1033</p>
1034
1035<p>
1036
1037Well, despite the fact that it looks so at the first glance, another trick will
1038allow us to answer "no, we are not!". If we had said that we will
1039inflate the delivery time twice at maximum, and then we consider
1040every other slot as a free slot, then we would overinflate in case
1041of the recursive preemption. BUT! The trick is that if we use only
1042every n-th slot as a free slot for n&gt;2, there is always some worst
1043inflation factor which we can guarantee not to be breached, even
1044if we apply the algorithm recursively. To be precise, if for every
1045k&gt;1 normally used slots we accumulate one free delivery slot, than
1046the inflation factor is not worse than k/(k-1) no matter how many
1047recursive preemptions happen. And it's not worse than (k+1)/k if
1048only non-recursive preemption happens. Now, having got through the
1049theory and the related math, let's see how <tt>nqmgr</tt> implements
1050this.
1051
1052</p>
1053
1054<p>
1055
1056Each job has so called "available delivery slot" counter. Each
1057transport has a <i>transport</i>_delivery_slot_cost parameter, which
1058defaults to default_delivery_slot_cost parameter which is set to 5
1059by default. This is the k from the paragraph above. Each time k
1060entries of the job are selected for delivery, this counter is
1061incremented by one. Once there are some slots accumulated, a job which
1062requires no more than that number of slots to be fully delivered
1063can preempt this job.
1064
1065</p>
1066
1067<p>
1068
1069[Well, the truth is, the counter is incremented every time an entry
1070is selected and it is divided by k when it is used.
1071But to understand, it's good enough to use
1072the above approximation of the truth.]
1073
1074</p>
1075
1076<p>
1077
1078OK, so now we know the conditions which must be satisfied so one
1079job can preempt another one. But what job gets preempted, how do
1080we choose what job preempts it if there are several valid candidates,
1081and when does all this exactly happen?
1082
1083</p>
1084
1085<p>
1086
1087The answer for the first part is simple. The job whose entry was
1088selected the last time is the so called current job. Normally, it is
1089the first job on the scheduler's job list, but destination concurrency
1090limits may change this as we will see later. It is always only the
1091current job which may get preempted.
1092
1093</p>
1094
1095<p>
1096
1097Now for the second part. The current job has a certain amount of
1098recipient entries, and as such may accumulate at maximum some amount
1099of available delivery slots. It might have already accumulated some,
1100and perhaps even already used some when it was preempted before
1101(remember a job can be preempted several times). In either case,
1102we know how many are accumulated and how many are left to deliver,
1103so we know how many it may yet accumulate at maximum. Every other
1104job which may be delivered by less than that number of slots is a
1105valid candidate for preemption. How do we choose among them?
1106
1107</p>
1108
1109<p>
1110
1111The answer is - the one with maximum enqueue_time/recipient_entry_count.
1112That is, the older the job is, the more we should try to deliver
1113it in order to get best message delivery rates. These rates are of
1114course subject to how many recipients the message has, therefore
1115the division by the recipient (entry) count. No one shall be surprised
1116that a message with n recipients takes n times longer to deliver than
1117a message with one recipient.
1118
1119</p>
1120
1121<p>
1122
1123Now let's recap the previous two paragraphs. Isn't it too complicated?
1124Why don't the candidates come only among the jobs which can be
1125delivered within the number of slots the current job already
1126accumulated? Why do we need to estimate how much it has yet to
1127accumulate? If you found out the answer, congratulate yourself. If
1128we did it this simple way, we would always choose the candidate
1129with the fewest recipient entries. If there were enough single recipient
1130mails coming in, they would always slip by the bulk mail as soon
1131as possible, and the two or more recipients mail would never get
1132a chance, no matter how long they have been sitting around in the
1133job list.
1134
1135</p>
1136
1137<p>
1138
1139This candidate selection has an interesting implication - that when
1140we choose the best candidate for preemption (this is done in
1141qmgr_choose_candidate()), it may happen that we may not use it for
1142preemption immediately. This leads to an answer to the last part
1143of the original question - when does the preemption happen?
1144
1145</p>
1146
1147<p>
1148
1149The preemption attempt happens every time next transport's recipient
1150entry is to be chosen for delivery. To avoid needless overhead, the
1151preemption is not attempted if the current job could never accumulate
1152more than <i>transport</i>_minimum_delivery_slots (defaults to
1153default_minimum_delivery_slots which defaults to 3). If there are
1154already enough accumulated slots to preempt the current job by the
1155chosen best candidate, it is done immediately. This basically means
1156that the candidate is moved in front of the current job on the
1157scheduler's job list and decreasing the accumulated slot counter
1158by the amount used by the candidate. If there are not enough slots...
1159well, I could say that nothing happens and the another preemption
1160is attempted the next time. But that's not the complete truth.
1161
1162</p>
1163
1164<p>
1165
1166The truth is that it turns out that it is not really necessary to
1167wait until the jobs counter accumulates all the delivery slots in
1168advance. Say we have ten-recipient mail followed by two two-recipient
1169mails. If the preemption happened when enough delivery slots accumulate
1170(assuming slot cost 2), the delivery sequence becomes
1171<tt>11112211113311</tt>. Now what would we get if we would wait
1172only for 50% of the necessary slots to accumulate and we promise
1173we would wait for the remaining 50% later, after we get back
1174to the preempted job? If we use such a slot loan, the delivery sequence
1175becomes <tt>11221111331111</tt>. As we can see, it makes it not
1176considerably worse for the delivery of the ten-recipient mail, but
1177it allows the small messages to be delivered sooner.
1178
1179</p>
1180
1181<p>
1182
1183The concept of these slot loans is where the
1184<i>transport</i>_delivery_slot_discount and
1185<i>transport</i>_delivery_slot_loan come from (they default to
1186default_delivery_slot_discount and default_delivery_slot_loan, whose
1187values are by default 50 and 3, respectively). The discount (resp.
1188loan) specifies how many percent (resp. how many slots) one "gets
1189in advance", when the number of slots required to deliver the best
1190candidate is compared with the number of slots the current slot had
1191accumulated so far.
1192
1193</p>
1194
1195<p>
1196
1197And that pretty much concludes this chapter.
1198
1199</p>
1200
1201<p>
1202
1203[Now you should have a feeling that you pretty much understand the
1204scheduler and the preemption, or at least that you will have
1205after you read the last chapter a couple more times. You shall clearly
1206see the job list and the preemption happening at its head, in ideal
1207delivery conditions. The feeling of understanding shall last until
1208you start wondering what happens if some of the jobs are blocked,
1209which you might eventually figure out correctly from what had been
1210said already. But I would be surprised if your mental image of the
1211scheduler's functionality is not completely shattered once you
1212start wondering how it works when not all recipients may be read
1213in-core.  More on that later.]
1214
1215</p>
1216
1217<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
1218limits affect the scheduling algorithm </a> </h3>
1219
1220<p>
1221
1222The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
1223control as <tt>oqmgr</tt>. Now what happens when the destination
1224limits are reached and no more entries for that destination may be
1225selected by the scheduler?
1226
1227</p>
1228
1229<p>
1230
1231From the user's point of view it is all simple. If some of the peers
1232of a job can't be selected, those peers are simply skipped by the
1233entry selection algorithm (the pseudo-code described before) and
1234only the selectable ones are used. If none of the peers may be
1235selected, the job is declared a "blocker job". Blocker jobs are
1236skipped by the entry selection algorithm and they are also excluded
1237from the candidates for preemption of the current job. Thus the scheduler
1238effectively behaves as if the blocker jobs didn't exist on the job
1239list at all. As soon as at least one of the peers of a blocker job
1240becomes unblocked (that is, the delivery agent handling the delivery
1241of the recipient entry for the given destination successfully finishes),
1242the job's blocker status is removed and the job again participates
1243in all further scheduler actions normally.
1244
1245</p>
1246
1247<p>
1248
1249So the summary is that the users don't really have to be concerned
1250about the interaction of the destination limits and scheduling
1251algorithm. It works well on its own and there are no knobs they
1252would need to control it.
1253
1254</p>
1255
1256<p>
1257
1258From a programmer's point of view, the blocker jobs complicate the
1259scheduler quite a lot. Without them, the jobs on the job list would
1260be normally delivered in strict FIFO order. If the current job is
1261preempted, the job preempting it is completely delivered unless it
1262is preempted itself. Without blockers, the current job is thus
1263always either the first job on the job list, or the top of the stack
1264of jobs preempting the first job on the job list.
1265
1266</p>
1267
1268<p>
1269
1270The visualization of the job list and the preemption stack without
1271blockers would be like this:
1272
1273</p>
1274
1275<blockquote>
1276<pre>
1277first job-&gt;    1--2--3--5--6--8--...    &lt;- job list
1278on job list    |
1279               4    &lt;- preemption stack
1280               |
1281current job-&gt;  7
1282</pre>
1283</blockquote>
1284
1285<p>
1286
1287In the example above we see that job 1 was preempted by job 4 and
1288then job 4 was preempted by job 7. After job 7 is completed, remaining
1289entries of job 4 are selected, and once they are all selected, job
12901 continues.
1291
1292</p>
1293
1294<p>
1295
1296As we see, it's all very clean and straightforward. Now how does
1297this change because of blockers?
1298
1299</p>
1300
1301<p>
1302
1303The answer is: a lot. Any job may become a blocker job at any time,
1304and also become a normal job again at any time. This has several
1305important implications:
1306
1307</p>
1308
1309<ol>
1310
1311<li> <p>
1312
1313The jobs may be completed in arbitrary order. For example, in the
1314example above, if the current job 7 becomes blocked, the next job
13154 may complete before the job 7 becomes unblocked again. Or if both
13167 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
1317and is completed, then 2 is completed and only after that 4 becomes
1318unblocked and is completed... You get the idea.
1319
1320</p>
1321
1322<p>
1323
1324[Interesting side note: even when jobs are delivered out of order,
1325from a single destination's point of view the jobs are still delivered
1326in the expected order (that is, FIFO unless there was some preemption
1327involved). This is because whenever a destination queue becomes
1328unblocked (the destination limit allows selection of more recipient
1329entries for that destination), all jobs which have peers for that
1330destination are unblocked at once.]
1331
1332</p>
1333
1334<li> <p>
1335
1336The idea of the preemption stack at the head of the job list is
1337gone.  That is, it must be possible to preempt any job on the job
1338list. For example, if the jobs 7, 4, 1 and 2 in the example above
1339become all blocked, job 3 becomes the current job. And of course
1340we do not want the preemption to be affected by the fact that there
1341are some blocked jobs or not. Therefore, if it turns out that job
13423 might be preempted by job 6, the implementation shall make it
1343possible.
1344
1345</p>
1346
1347<li> <p>
1348
1349The idea of the linear preemption stack itself is gone. It's no
1350longer true that one job is always preempted by only one job at one
1351time (that is directly preempted, not counting the recursively
1352nested jobs). For example, in the example above, job 1 is directly
1353preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
1354blocked, and job 4 is being delivered. If it accumulates enough
1355delivery slots, it is natural that it might be preempted for example
1356by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
1357same time.
1358
1359</p>
1360
1361</ol>
1362
1363<p>
1364
1365Now combine the points 2) and 3) with point 1) again and you realize
1366that the relations on the once linear job list became pretty
1367complicated. If we extend the point 3) example: jobs 7 and 8 preempt
1368job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
1369huh?
1370
1371</p>
1372
1373<p>
1374
1375If I illustrate the relations after the above mentioned examples
1376(but those in point 1), the situation would look like this:
1377
1378</p>
1379
1380<blockquote>
1381<pre>
1382                            v- parent
1383
1384adoptive parent -&gt;    1--2--3--5--...      &lt;- "stack" level 0
1385                      |     |
1386parent gone -&gt;        ?     6              &lt;- "stack" level 1
1387                     / \
1388children -&gt;         7   8   ^- child       &lt;- "stack" level 2
1389
1390                      ^- siblings
1391</pre>
1392</blockquote>
1393
1394<p>
1395
1396Now how does <tt>nqmgr</tt> deal with all these complicated relations?
1397
1398</p>
1399
1400<p>
1401
1402Well, it maintains them all as described, but fortunately, all these
1403relations are necessary only for the purpose of proper counting of
1404available delivery slots. For the purpose of ordering the jobs for
1405entry selection, the original rule still applies: "the job preempting
1406the current job is moved in front of the current job on the job
1407list". So for entry selection purposes, the job relations remain
1408as simple as this:
1409
1410</p>
1411
1412<blockquote>
1413<pre>
14147--8--1--2--6--3--5--..   &lt;- scheduler's job list order
1415</pre>
1416</blockquote>
1417
1418<p>
1419
1420The job list order and the preemption parent/child/siblings relations
1421are maintained separately. And because the selection works only
1422with the job list, you can happily forget about those complicated
1423relations unless you want to study the <tt>nqmgr</tt> sources. In
1424that case the text above might provide some helpful introduction
1425to the problem domain. Otherwise I suggest you just forget about
1426all this and stick with the user's point of view: the blocker jobs
1427are simply ignored.
1428
1429</p>
1430
1431<p>
1432
1433[By now, you should have a feeling that there are more things going
1434on under the hood than you ever wanted to know. You decide that
1435forgetting about this chapter is the best you can do for the sake
1436of your mind's health and you basically stick with the idea how the
1437scheduler works in ideal conditions, when there are no blockers,
1438which is good enough.]
1439
1440</p>
1441
1442<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
1443limits </a> </h3>
1444
1445<p>
1446
1447When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
1448that all recipients of all messages in the "active" queue are completely
1449read into memory. This is simply not true. There is an upper
1450bound on the amount of memory the <tt>nqmgr</tt> may use, and
1451therefore it must impose some limits on the information it may store
1452in memory at any given time.
1453
1454</p>
1455
1456<p>
1457
1458First of all, not all messages may be read in-core at once. At any
1459time, only qmgr_message_active_limit messages may be read in-core
1460at maximum. When read into memory, the messages are picked from the
1461"incoming" and "deferred" queues and moved to the "active" queue
1462(incoming having priority), so if there are more than
1463qmgr_message_active_limit messages queued in the "active" queue, the
1464rest will have to wait until (some of) the messages in the "active" queue
1465are completely delivered (or deferred).
1466
1467</p>
1468
1469<p>
1470
1471Even with the limited amount of in-core messages, there is another
1472limit which must be imposed in order to avoid memory exhaustion.
1473Each message may contain a huge number of recipients (tens or hundreds
1474of thousands are not uncommon), so if <tt>nqmgr</tt> read all
1475recipients of all messages in the "active" queue, it may easily run
1476out of memory. Therefore there must be some upper bound on the
1477amount of message recipients which are read into memory at the
1478same time.
1479
1480</p>
1481
1482<p>
1483
1484Before discussing how exactly <tt>nqmgr</tt> implements the recipient
1485limits, let's see how the sole existence of the limits themselves
1486affects the <tt>nqmgr</tt> and its scheduler.
1487
1488</p>
1489
1490<p>
1491
1492The message limit is straightforward - it just limits the size of
1493the
1494lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
1495message can preempt the current one. Messages not in the "active" queue
1496are simply not considered at all.
1497
1498</p>
1499
1500<p>
1501
1502The recipient limit complicates more things. First of all, the
1503message reading code must support reading the recipients in batches,
1504which among other things means accessing the queue file several
1505times and continuing where the last recipient batch ended. This is
1506invoked by the scheduler whenever the current job has space for more
1507recipients, subject to transport's refill_limit and refill_delay parameters.
1508It is also done any time when all
1509in-core recipients of the message are dealt with (which may also
1510mean they were deferred) but there are still more in the queue file.
1511
1512</p>
1513
1514<p>
1515
1516The second complication is that with some recipients left unread
1517in the queue file, the scheduler can't operate with exact counts
1518of recipient entries. With unread recipients, it is not clear how
1519many recipient entries there will be, as they are subject to
1520per-destination grouping. It is not even clear to what transports
1521(and thus jobs) the recipients will be assigned. And with messages
1522coming from the "deferred" queue, it is not even clear how many unread
1523recipients are still to be delivered. This all means that the
1524scheduler must use only estimates of how many recipients entries
1525there will be.  Fortunately, it is possible to estimate the minimum
1526and maximum correctly, so the scheduler can always err on the safe
1527side.  Obviously, the better the estimates, the better the results, so
1528it is best when we are able to read all recipients in-core and turn
1529the estimates into exact counts, or at least try to read as many
1530as possible to make the estimates as accurate as possible.
1531
1532</p>
1533
1534<p>
1535
1536The third complication is that it is no longer true that the scheduler
1537is done with a job once all of its in-core recipients are delivered.
1538It is possible that the job will be revived later, when another
1539batch of recipients is read in core. It is also possible that some
1540jobs will be created for the first time long after the first batch
1541of recipients was read in core. The <tt>nqmgr</tt> code must be
1542ready to handle all such situations.
1543
1544</p>
1545
1546<p>
1547
1548And finally, the fourth complication is that the <tt>nqmgr</tt>
1549code must somehow impose the recipient limit itself. Now how does
1550it achieve it?
1551
1552</p>
1553
1554<p>
1555
1556Perhaps the easiest solution would be to say that each message may
1557have at maximum X recipients stored in-core, but such a solution would
1558be poor for several reasons. With reasonable qmgr_message_active_limit
1559values, the X would have to be quite low to maintain a reasonable
1560memory footprint. And with low X lots of things would not work well.
1561The <tt>nqmgr</tt> would have problems to use the
1562<i>transport</i>_destination_recipient_limit efficiently. The
1563scheduler's preemption would be suboptimal as the recipient count
1564estimates would be inaccurate. The message queue file would have
1565to be accessed many times to read in more recipients again and
1566again.
1567
1568</p>
1569
1570<p>
1571
1572Therefore it seems reasonable to have a solution which does not use
1573a limit imposed on a per-message basis, but which maintains a pool
1574of available recipient slots, which can be shared among all messages
1575in the most efficient manner. And as we do not want separate
1576transports to compete for resources whenever possible, it seems
1577appropriate to maintain such a recipient pool for each transport
1578separately. This is the general idea, now how does it work in
1579practice?
1580
1581</p>
1582
1583<p>
1584
1585First we have to solve a little chicken-and-egg problem. If we want
1586to use the per-transport recipient pools, we first need to know to
1587what transport(s) the message is assigned. But we will find that
1588out only after we first read in the recipients. So it is obvious
1589that we first have to read in some recipients, use them to find out
1590to what transports the message is to be assigned, and only after
1591that can we use the per-transport recipient pools.
1592
1593</p>
1594
1595<p>
1596
1597Now how many recipients shall we read for the first time? This is
1598what qmgr_message_recipient_minimum and qmgr_message_recipient_limit
1599values control. The qmgr_message_recipient_minimum value specifies
1600how many recipients of each message we will read the first time,
1601no matter what.  It is necessary to read at least one recipient
1602before we can assign the message to a transport and create the first
1603job. However, reading only qmgr_message_recipient_minimum recipients
1604even if there are only few messages with few recipients in-core would
1605be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit
1606recipients in-core so far, the first batch of recipients may be
1607larger than qmgr_message_recipient_minimum - as large as is required
1608to reach the qmgr_message_recipient_limit limit.
1609
1610</p>
1611
1612<p>
1613
1614Once the first batch of recipients was read in core and the message
1615jobs were created, the size of the subsequent recipient batches (if
1616any - of course it's best when all recipients are read in one batch)
1617is based solely on the position of the message jobs on their
1618corresponding transports' job lists. Each transport has a pool of
1619<i>transport</i>_recipient_limit recipient slots which it can
1620distribute among its jobs (how this is done is described later).
1621The subsequent recipient batch may be as large as the sum of all
1622recipient slots of all jobs of the message permits (plus the
1623qmgr_message_recipient_minimum amount which always applies).
1624
1625</p>
1626
1627<p>
1628
1629For example, if a message has three jobs, the first with 1 recipient
1630still in-core and 4 recipient slots, the second with 5 recipients in-core
1631and 5 recipient slots, and the third with 2 recipients in-core and 0
1632recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs'
1633recipients slots in total. This means that we could immediately
1634read 2+qmgr_message_recipient_minimum more recipients of that message
1635in core.
1636
1637</p>
1638
1639<p>
1640
1641The above example illustrates several things which might be worth
1642mentioning explicitly: first, note that although the per-transport
1643slots are assigned to particular jobs, we can't guarantee that once
1644the next batch of recipients is read in core, that the corresponding
1645amounts of recipients will be assigned to those jobs. The jobs lend
1646its slots to the message as a whole, so it is possible that some
1647jobs end up sponsoring other jobs of their message. For example,
1648if in the example above the 2 newly read recipients were assigned
1649to the second job, the first job sponsored the second job with 2
1650slots. The second notable thing is the third job, which has more
1651recipients in-core than it has slots. Apart from sponsoring by other
1652job we just saw it can be result of the first recipient batch, which
1653is sponsored from global recipient pool of qmgr_message_recipient_limit
1654recipients. It can be also sponsored from the message recipient
1655pool of qmgr_message_recipient_minimum recipients.
1656
1657</p>
1658
1659<p>
1660
1661Now how does each transport distribute the recipient slots among
1662its jobs?  The strategy is quite simple. As most scheduler activity
1663happens on the head of the job list, it is our intention to make
1664sure that the scheduler has the best estimates of the recipient
1665counts for those jobs. As we mentioned above, this means that we
1666want to try to make sure that the messages of those jobs have all
1667recipients read in-core. Therefore the transport distributes the
1668slots "along" the job list from start to end. In this case the job
1669list sorted by message enqueue time is used, because it doesn't
1670change over time as the scheduler's job list does.
1671
1672</p>
1673
1674<p>
1675
1676More specifically, each time a job is created and appended to the
1677job list, it gets all unused recipient slots from its transport's
1678pool. It keeps them until all recipients of its message are read.
1679When this happens, all unused recipient slots are transferred to
1680the next job (which is now in fact the first such job) on the job
1681list which still has some recipients unread, or eventually back to
1682the transport pool if there is no such job. Such a transfer then also
1683happens whenever a recipient entry of that job is delivered.
1684
1685</p>
1686
1687<p>
1688
1689There is also a scenario when a job is not appended to the end of
1690the job list (for example it was created as a result of a second or
1691later recipient batch). Then it works exactly as above, except that
1692if it was put in front of the first unread job (that is, the job
1693of a message which still has some unread recipients in the queue file),
1694that job is first forced to return all of its unused recipient slots
1695to the transport pool.
1696
1697</p>
1698
1699<p>
1700
1701The algorithm just described leads to the following state: The first
1702unread job on the job list always gets all the remaining recipient
1703slots of that transport (if there are any). The jobs queued before
1704this job are completely read (that is, all recipients of their
1705message were already read in core) and have at maximum as many slots
1706as they still have recipients in-core (the maximum is there because
1707of the sponsoring mentioned before) and the jobs after this job get
1708nothing from the transport recipient pool (unless they got something
1709before and then the first unread job was created and enqueued in
1710front of them later - in such a case, they also get at maximum as many
1711slots as they have recipients in-core).
1712
1713</p>
1714
1715<p>
1716
1717Things work fine in such a state for most of the time, because the
1718current job is either completely read in-core or has as many recipient
1719slots as there are, but there is one situation which we still have
1720to take care of specially.  Imagine if the current job is preempted
1721by some unread job from the job list and there are no more recipient
1722slots available, so this new current job could read only batches
1723of qmgr_message_recipient_minimum recipients at a time. This would
1724really degrade performance. For this reason, each transport has an
1725extra pool of <i>transport</i>_extra_recipient_limit recipient
1726slots, dedicated exactly for this situation. Each time an unread
1727job preempts the current job, it gets half of the remaining recipient
1728slots from the normal pool and this extra pool.
1729
1730</p>
1731
1732<p>
1733
1734And that's it. It sure does sound pretty complicated, but fortunately
1735most people don't really have to care exactly how it works as long
1736as it works.  Perhaps the only important things to know for most
1737people are the following upper bound formulas:
1738
1739</p>
1740
1741<p>
1742
1743Each transport has at maximum
1744
1745</p>
1746
1747<blockquote>
1748<pre>
1749max(
1750qmgr_message_recipient_minimum * qmgr_message_active_limit
1751+ *_recipient_limit + *_extra_recipient_limit,
1752qmgr_message_recipient_limit
1753)
1754</pre>
1755</blockquote>
1756
1757<p>
1758
1759recipients in core.
1760
1761</p>
1762
1763<p>
1764
1765The total amount of recipients in core is
1766
1767</p>
1768
1769<blockquote>
1770<pre>
1771max(
1772qmgr_message_recipient_minimum * qmgr_message_active_limit
1773+ sum( *_recipient_limit + *_extra_recipient_limit ),
1774qmgr_message_recipient_limit
1775)
1776</pre>
1777</blockquote>
1778
1779<p>
1780
1781where the sum is over all used transports.
1782
1783</p>
1784
1785<p>
1786
1787And this terribly complicated chapter concludes the documentation
1788of the <tt>nqmgr</tt> scheduler.
1789
1790</p>
1791
1792<p>
1793
1794[By now you should theoretically know the <tt>nqmgr</tt> scheduler
1795inside out. In practice, you still hope that you will never have
1796to really understand the last or last two chapters completely, and
1797fortunately most people really won't. Understanding how the scheduler
1798works in ideal conditions is more than good enough for the vast majority
1799of users.]
1800
1801</p>
1802
1803<h2> <a name="credits"> Credits </a> </h2>
1804
1805<ul>
1806
1807<li> Wietse Venema designed and implemented the initial queue manager
1808with per-domain FIFO scheduling, and per-delivery +/-1 concurrency
1809feedback.
1810
1811<li> Patrik Rak designed and implemented preemption where mail with
1812fewer recipients can slip past mail with more recipients in a
1813controlled manner, and wrote up its documentation.
1814
1815<li> Wietse Venema initiated a discussion with Patrik Rak and Victor
1816Duchovni on alternatives for the +/-1 feedback scheduler's aggressive
1817behavior. This is when K/N feedback was reviewed (N = concurrency).
1818The discussion ended without a good solution for both negative
1819feedback and dead site detection.
1820
1821<li> Victor Duchovni resumed work on concurrency feedback in the
1822context of concurrency-limited servers.
1823
1824<li> Wietse Venema then re-designed the concurrency scheduler in
1825terms of the simplest possible concepts: less-than-1 concurrency
1826feedback per delivery, forward and reverse concurrency feedback
1827hysteresis, and pseudo-cohort failure. At this same time, concurrency
1828feedback was separated from dead site detection.
1829
1830<li> These simplifications, and their modular implementation, helped
1831to develop further insights into the different roles that positive
1832and negative concurrency feedback play, and helped to identify some
1833worst-case scenarios.
1834
1835</ul>
1836
1837</body>
1838
1839</html>
1840