xref: /netbsd-src/external/ibm-public/postfix/dist/proto/SCHEDULER_README.html (revision 059c16a85b0b39d60ad6d18f53c09510815afa2b)
1<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
2	"http://www.w3.org/TR/html4/loose.dtd">
3
4<html>
5
6<head>
7
8<title>Postfix Queue Scheduler</title>
9
10<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
11<link rel='stylesheet' type='text/css' href='postfix-doc.css'>
12
13</head>
14
15<body>
16
17<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
18Queue Scheduler</h1>
19
20<hr>
21
22<h2> Disclaimer </h2>
23
24<p> Many of the <i>transport</i>-specific configuration parameters
25discussed in this document will not show up in "postconf" command
26output before Postfix version 2.9. This limitation applies to many
27parameters whose name is a combination of a master.cf service name
28such as "relay" and a built-in suffix such as
29"_destination_concurrency_limit". </p>
30
31<h2> Overview </h2>
32
33<p> The queue manager is by far the most complex part of the Postfix
34mail system. It schedules delivery of new mail, retries failed
35deliveries at specific times, and removes mail from the queue after
36the last delivery attempt.  There are two major classes of mechanisms
37that control the operation of the queue manager. </p>
38
39<p> Topics covered by this document: </p>
40
41<ul>
42
43<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned
44with the number of concurrent deliveries to a specific destination,
45including decisions on when to suspend deliveries after persistent
46failures.
47
48<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with
49the selection of email messages and recipients for a given destination.
50
51<li> <a href="#credits"> Credits</a>, something this document would not be
52complete without.
53
54</ul>
55
56<!--
57
58<p> Once started, the qmgr(8) process runs until "postfix reload"
59or "postfix stop".  As a persistent process, the queue manager has
60to meet strict requirements with respect to code correctness and
61robustness. Unlike non-persistent daemon processes, the queue manager
62cannot benefit from Postfix's process rejuvenation mechanism that
63limit the impact from resource leaks and other coding errors
64(translation: replacing a process after a short time covers up bugs
65before they can become a problem).  </p>
66
67-->
68
69<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
70
71<p> The following sections document the Postfix 2.5 concurrency
72scheduler, after a discussion of the limitations of the earlier
73concurrency scheduler. This is followed by results of medium-concurrency
74experiments, and a discussion of trade-offs between performance and
75robustness.  </p>
76
77<p> The material is organized as follows: </p>
78
79<ul>
80
81<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
82concurrency scheduler </a>
83
84<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
85concurrency feedback algorithm </a>
86
87<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
88destination" detection algorithm </a>
89
90<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
91concurrency scheduler </a>
92
93<li> <a href="#concurrency_results"> Results for delivery to
94concurrency limited servers </a>
95
96<li> <a href="#concurrency_discussion"> Discussion of concurrency
97limited server results </a>
98
99<li> <a href="#concurrency_limitations"> Limitations of less-than-1
100per delivery feedback </a>
101
102<li> <a href="#concurrency_config"> Concurrency configuration
103parameters </a>
104
105</ul>
106
107<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
108concurrency scheduler </a> </h3>
109
110<p> From the start, Postfix has used a simple but robust algorithm
111where the per-destination delivery concurrency is decremented by 1
112after delivery failed due to connection or handshake failure, and
113incremented by 1 otherwise.  Of course the concurrency is never
114allowed to exceed the maximum per-destination concurrency limit.
115And when a destination's concurrency level drops to zero, the
116destination is declared "dead" and delivery is suspended.  </p>
117
118<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
119
120<ul>
121
122<li> <p> Overshoot due to exponential delivery concurrency growth
123with each pseudo-cohort(*). This can be an issue with high-concurrency
124channels. For example, with the default initial concurrency of 5,
125concurrency would proceed over time as (5-10-20).  </p>
126
127<li> <p> Throttling down to zero concurrency after a single
128pseudo-cohort(*) failure. This was especially an issue with
129low-concurrency channels where a single failure could be sufficient
130to mark a destination as "dead", causing the suspension of further
131deliveries to the affected destination. </p>
132
133</ul>
134
135<p> (*) A pseudo-cohort is a number of delivery requests equal to
136a destination's delivery concurrency. </p>
137
138<p> The revised concurrency scheduler has a highly modular structure.
139It uses separate mechanisms for per-destination concurrency control
140and for "dead destination" detection.  The concurrency control in
141turn is built from two separate mechanisms: it supports less-than-1
142feedback per delivery to allow for more gradual concurrency
143adjustments, and it uses feedback hysteresis to suppress concurrency
144oscillations.  And instead of waiting for delivery concurrency to
145throttle down to zero, a destination is declared "dead" after a
146configurable number of pseudo-cohorts reports connection or handshake
147failure.  </p>
148
149<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
150concurrency feedback algorithm </a> </h3>
151
152<p> We want to increment a destination's delivery concurrency when
153some (not necessarily consecutive) number of deliveries complete
154without connection or handshake failure.  This is implemented with
155positive feedback g(N) where N is the destination's delivery
156concurrency.  With g(N)=1 feedback per delivery, concurrency increases
157by 1 after each positive feedback event; this gives us the old
158scheduler's exponential growth in time. With g(N)=1/N feedback per
159delivery, concurrency increases by 1 after an entire pseudo-cohort
160N of positive feedback reports; this gives us linear growth in time.
161Less-than-1 feedback per delivery and integer truncation naturally
162give us hysteresis, so that transitions to larger concurrency happen
163every 1/g(N) positive feedback events.  </p>
164
165<p> We want to decrement a destination's delivery concurrency when
166some (not necessarily consecutive) number of deliveries complete
167after connection or handshake failure.  This is implemented with
168negative feedback f(N) where N is the destination's delivery
169concurrency.  With f(N)=1 feedback per delivery, concurrency decreases
170by 1 after each negative feedback event; this gives us the old
171scheduler's behavior where concurrency is throttled down dramatically
172after a single pseudo-cohort failure.  With f(N)=1/N feedback per
173delivery, concurrency backs off more gently.  Again, less-than-1
174feedback per delivery and integer truncation naturally give us
175hysteresis, so that transitions to lower concurrency happen every
1761/f(N) negative feedback events.  </p>
177
178<p> However, with negative feedback we introduce a subtle twist.
179We "reverse" the negative hysteresis cycle so that the transition
180to lower concurrency happens at the <b>beginning</b> of a sequence
181of 1/f(N) negative feedback events.  Otherwise, a correction for
182overload would be made too late.  This makes the choice of f(N)
183relatively unimportant, as borne out by measurements later in this
184document.  </p>
185
186<p> In summary, the main ingredients for the Postfix 2.5 concurrency
187feedback algorithm are a) the option of less-than-1 positive feedback
188per delivery to avoid overwhelming servers, b) the option of
189less-than-1 negative feedback per delivery to avoid giving up too
190fast, c) feedback hysteresis to avoid rapid oscillation, and d) a
191"reverse" hysteresis cycle for negative feedback, so that it can
192correct for overload quickly.  </p>
193
194<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3>
195
196<p> We want to suspend deliveries to a specific destination after
197some number of deliveries suffers connection or handshake failure.
198The old scheduler declares a destination "dead" when negative (-1)
199feedback throttles the delivery concurrency down to zero. With
200less-than-1 feedback per delivery, this throttling down would
201obviously take too long.  We therefore have to separate "dead
202destination" detection from concurrency feedback.  This is implemented
203by introducing the concept of pseudo-cohort failure. The Postfix
2042.5 concurrency scheduler declares a destination "dead" after a
205configurable number of pseudo-cohorts suffers from connection or
206handshake failures. The old scheduler corresponds to the special
207case where the pseudo-cohort failure limit is equal to 1.  </p>
208
209<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3>
210
211<p> The pseudo code shows how the ideas behind new concurrency
212scheduler are implemented as of November 2007.  The actual code can
213be found in the module qmgr/qmgr_queue.c.  </p>
214
215<pre>
216Types:
217        Each destination has one set of the following variables
218        int concurrency
219        double success
220        double failure
221        double fail_cohorts
222
223Feedback functions:
224        N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
225        positive feedback: g(N) = x/N | x/sqrt(N) | x
226        negative feedback: f(N) = y/N | y/sqrt(N) | y
227
228Initialization:
229        concurrency = initial_concurrency
230        success = 0
231        failure = 0
232        fail_cohorts = 0
233
234After success:
235        fail_cohorts = 0
236        Be prepared for feedback &gt; hysteresis, or rounding error
237        success += g(concurrency)
238        while (success >= 1)            Hysteresis 1
239            concurrency += 1            Hysteresis 1
240            failure = 0
241            success -= 1                Hysteresis 1
242        Be prepared for overshoot
243        if (concurrency &gt; concurrency limit)
244            concurrency = concurrency limit
245
246Safety:
247        Don't apply positive feedback unless
248            concurrency &lt; busy_refcount + init_dest_concurrency
249        otherwise negative feedback effect could be delayed
250
251After failure:
252        if (concurrency &gt; 0)
253            fail_cohorts += 1.0 / concurrency
254            if (fail_cohorts &gt; cohort_failure_limit)
255                concurrency = 0
256        if (concurrency &gt; 0)
257            Be prepared for feedback &gt; hysteresis, rounding errors
258            failure -= f(concurrency)
259            while (failure &lt; 0)
260                concurrency -= 1        Hysteresis 1
261                failure += 1            Hysteresis 1
262                success = 0
263            Be prepared for overshoot
264            if (concurrency &lt; 1)
265                concurrency = 1
266</pre>
267
268<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3>
269
270<p> Discussions about the concurrency scheduler redesign started
271early 2004, when the primary goal was to find alternatives that did
272not exhibit exponential growth or rapid concurrency throttling.  No
273code was implemented until late 2007, when the primary concern had
274shifted towards better handling of server concurrency limits. For
275this reason we measure how well the new scheduler does this
276job.  The table below compares mail delivery performance of the old
277+/-1 feedback per delivery with several less-than-1 feedback
278functions, for different limited-concurrency server scenarios.
279Measurements were done with a FreeBSD 6.2 client and with FreeBSD
2806.2 and various Linux servers.  </p>
281
282<p> Server configuration: </p>
283
284<ul> <li> The mail flow was slowed down with 1 second latency per
285recipient ("smtpd_client_restrictions = sleep 1"). The purpose was
286to make results less dependent on hardware details, by avoiding
287slow-downs by queue file I/O, logging I/O, and network I/O.
288
289<li> Concurrency was limited by the server process limit
290("default_process_limit = 5" and "smtpd_client_event_limit_exceptions
291= static:all"). Postfix was stopped and started after changing the
292process limit, because the same number is also used as the backlog
293argument to the listen(2) system call, and "postfix reload" does
294not re-issue this call.
295
296<li> Mail was discarded with "local_recipient_maps = static:all" and
297"local_transport = discard". The discard action in access maps or
298header/body checks
299could not be used as it fails to update the in_flow_delay counters.
300
301</ul>
302
303<p> Client configuration: </p>
304
305<ul>
306
307<li> Queue file overhead was minimized by sending one message to a
308virtual alias that expanded into 2000 different remote recipients.
309All recipients were accounted for according to the maillog file.
310The virtual_alias_expansion_limit setting was increased to avoid
311complaints from the cleanup(8) server.
312
313<li> The number of deliveries was maximized with
314"smtp_destination_recipient_limit = 2". A smaller limit would cause
315Postfix to schedule the concurrency per recipient instead of domain,
316which is not what we want.
317
318<li> Maximum concurrency was limited with
319"smtp_destination_concurrency_limit = 20", and
320initial_destination_concurrency was set to the same value.
321
322<li> The positive and negative concurrency feedback hysteresis was
3231.  Concurrency was incremented by 1 at the END of 1/feedback steps
324of positive feedback, and was decremented by 1 at the START of
3251/feedback steps of negative feedback.
326
327<li> The SMTP client used the default 30s SMTP connect timeout and
328300s SMTP greeting timeout.
329
330</ul>
331
332<h4> Impact of the 30s SMTP connect timeout </h4>
333
334<p> The first results are for a FreeBSD 6.2 server, where our
335artificially low listen(2) backlog results in a very short kernel
336queue for established connections. The table shows that all deferred
337deliveries failed due to a 30s connection timeout, and none failed
338due to a server greeting timeout.  This measurement simulates what
339happens when the server's connection queue is completely full under
340load, and the TCP engine drops new connections.  </p>
341
342<blockquote>
343
344<table>
345
346<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
347style</th> <th>connection<br> caching</th> <th>percentage<br>
348deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
349<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
350
351<tr> <td align="center" colspan="9"> <hr> </td> </tr>
352
353<tr><td align="center">20</td> <td align="center">5</td> <td
354align="center">1/N</td> <td align="center">no</td> <td
355align="center">9.9</td> <td align="center">19.4</td> <td
356align="center">0.49</td> <td align="center">198</td> <td
357align="center">-</td> </tr>
358
359<tr><td align="center">20</td> <td align="center">5</td> <td
360align="center">1/N</td> <td align="center">yes</td> <td
361align="center">10.3</td> <td align="center">19.4</td> <td
362align="center">0.49</td> <td align="center">206</td> <td
363align="center">-</td> </tr>
364
365<tr><td align="center">20</td> <td align="center">5</td> <td
366align="center">1/sqrt(N)</td> <td align="center">no</td>
367<td align="center">10.4</td> <td align="center">19.6</td> <td
368align="center">0.59</td> <td align="center">208</td> <td
369align="center">-</td> </tr>
370
371<tr><td align="center">20</td> <td align="center">5</td> <td
372align="center">1/sqrt(N)</td> <td align="center">yes</td>
373<td align="center">10.6</td> <td align="center">19.6</td> <td
374align="center">0.61</td> <td align="center">212</td> <td
375align="center">-</td> </tr>
376
377<tr><td align="center">20</td> <td align="center">5</td> <td
378align="center">1</td> <td align="center">no</td> <td
379align="center">10.1</td> <td align="center">19.5</td> <td
380align="center">1.29</td> <td align="center">202</td> <td
381align="center">-</td> </tr>
382
383<tr><td align="center">20</td> <td align="center">5</td> <td
384align="center">1</td> <td align="center">yes</td> <td
385align="center">10.8</td> <td align="center">19.3</td> <td
386align="center">1.57</td> <td align="center">216</td> <td
387align="center">-</td> </tr>
388
389<tr> <td align="center" colspan="9"> <hr> </td> </tr>
390
391</table>
392
393<p> A busy server with a completely full connection queue.  N is
394the client delivery concurrency.  Failed deliveries time out after
39530s without completing the TCP handshake. See text for a discussion
396of results. </p>
397
398</blockquote>
399
400<h4> Impact of the 300s SMTP greeting timeout </h4>
401
402<p> The next table shows results for a Fedora Core 8 server (results
403for RedHat 7.3 are identical). In this case, the artificially small
404listen(2) backlog argument does not impact our measurement.  The
405table shows that practically all deferred deliveries fail after the
406300s SMTP greeting timeout. As these timeouts were 10x longer than
407with the first measurement, we increased the recipient count (and
408thus the running time) by a factor of 10 to keep the results
409comparable. The deferred mail percentages are a factor 10 lower
410than with the first measurement, because the 1s per-recipient delay
411was 1/300th of the greeting timeout instead of 1/30th of the
412connection timeout.  </p>
413
414<blockquote>
415
416<table>
417
418<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
419style</th> <th>connection<br> caching</th> <th>percentage<br>
420deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
421<th colspan=2>timed-out in<br> connect/greeting </th> </tr>
422
423<tr> <td align="center" colspan="9"> <hr> </td> </tr>
424
425<tr> <td align="center">20</td> <td align="center">5</td> <td
426align="center">1/N</td> <td align="center">no</td> <td
427align="center">1.16</td> <td align="center">19.8</td> <td
428align="center">0.37</td> <td align="center">-</td> <td
429align="center">230</td> </tr>
430
431<tr> <td align="center">20</td> <td align="center">5</td> <td
432align="center">1/N</td> <td align="center">yes</td> <td
433align="center">1.36</td> <td align="center">19.8</td> <td
434align="center">0.36</td> <td align="center">-</td> <td
435align="center">272</td> </tr>
436
437<tr> <td align="center">20</td> <td align="center">5</td> <td
438align="center">1/sqrt(N)</td> <td align="center">no</td>
439<td align="center">1.21</td> <td align="center">19.9</td> <td
440align="center">0.23</td> <td align="center">4</td> <td
441align="center">238</td> </tr>
442
443<tr> <td align="center">20</td> <td align="center">5</td> <td
444align="center">1/sqrt(N)</td> <td align="center">yes</td>
445<td align="center">1.36</td> <td align="center">20.0</td> <td
446align="center">0.23</td> <td align="center">-</td> <td
447align="center">272</td> </tr>
448
449<tr> <td align="center">20</td> <td align="center">5</td> <td
450align="center">1</td> <td align="center">no</td> <td
451align="center">1.18</td> <td align="center">20.0</td> <td
452align="center">0.16</td> <td align="center">-</td> <td
453align="center">236</td> </tr>
454
455<tr> <td align="center">20</td> <td align="center">5</td> <td
456align="center">1</td> <td align="center">yes</td> <td
457align="center">1.39</td> <td align="center">20.0</td> <td
458align="center">0.16</td> <td align="center">-</td> <td
459align="center">278</td> </tr>
460
461<tr> <td align="center" colspan="9"> <hr> </td> </tr>
462
463</table>
464
465<p> A busy server with a non-full connection queue.  N is the client
466delivery concurrency. Failed deliveries complete at the TCP level,
467but time out after 300s while waiting for the SMTP greeting.  See
468text for a discussion of results.  </p>
469
470</blockquote>
471
472<h4> Impact of active server concurrency limiter </h4>
473
474<p> The final concurrency-limited result shows what happens when
475SMTP connections don't time out, but are rejected immediately with
476the Postfix server's smtpd_client_connection_count_limit feature
477(the server replies with a 421 status and disconnects immediately).
478Similar results can be expected with concurrency limiting features
479built into other MTAs or firewalls.  For this measurement we specified
480a server concurrency limit and a client initial destination concurrency
481of 5, and a server process limit of 10; all other conditions were
482the same as with the first measurement. The same result would be
483obtained with a FreeBSD or Linux server, because the "pushing back"
484is done entirely by the receiving side. </p>
485
486<blockquote>
487
488<table>
489
490<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br>
491style</th> <th>connection<br> caching</th> <th>percentage<br>
492deferred</th> <th colspan="2">client concurrency<br> average/stddev</th>
493<th>theoretical<br>defer rate</th> </tr>
494
495<tr> <td align="center" colspan="9"> <hr> </td> </tr>
496
497<tr> <td align="center">20</td> <td align="center">5</td> <td
498align="center">1/N</td> <td align="center">no</td> <td
499align="center">16.5</td> <td align="center">5.17</td> <td
500align="center">0.38</td> <td align="center">1/6</td> </tr>
501
502<tr> <td align="center">20</td> <td align="center">5</td> <td
503align="center">1/N</td> <td align="center">yes</td> <td
504align="center">16.5</td> <td align="center">5.17</td> <td
505align="center">0.38</td> <td align="center">1/6</td> </tr>
506
507<tr> <td align="center">20</td> <td align="center">5</td> <td
508align="center">1/sqrt(N)</td> <td align="center">no</td>
509<td align="center">24.5</td> <td align="center">5.28</td> <td
510align="center">0.45</td> <td align="center">1/4</td> </tr>
511
512<tr> <td align="center">20</td> <td align="center">5</td> <td
513align="center">1/sqrt(N)</td> <td align="center">yes</td>
514<td align="center">24.3</td> <td align="center">5.28</td> <td
515align="center">0.46</td> <td align="center">1/4</td> </tr>
516
517<tr> <td align="center">20</td> <td align="center">5</td> <td
518align="center">1</td> <td align="center">no</td> <td
519align="center">49.7</td> <td align="center">5.63</td> <td
520align="center">0.67</td> <td align="center">1/2</td> </tr>
521
522<tr> <td align="center">20</td> <td align="center">5</td> <td
523align="center">1</td> <td align="center">yes</td> <td
524align="center">49.7</td> <td align="center">5.68</td> <td
525align="center">0.70</td> <td align="center">1/2</td> </tr>
526
527<tr> <td align="center" colspan="9"> <hr> </td> </tr>
528
529</table>
530
531<p> A server with active per-client concurrency limiter that replies
532with 421 and disconnects.  N is the client delivery concurrency.
533The theoretical defer rate is 1/(1+roundup(1/feedback)).  This is
534always 1/2 with the fixed +/-1 feedback per delivery; with the
535concurrency-dependent feedback variants, the defer rate decreases
536with increasing concurrency. See text for a discussion of results.
537</p>
538
539</blockquote>
540
541<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3>
542
543<p> All results in the previous sections are based on the first
544delivery runs only; they do not include any second etc. delivery
545attempts. It's also worth noting that the measurements look at
546steady-state behavior only. They don't show what happens when the
547client starts sending at a much higher or lower concurrency.
548</p>
549
550<p> The first two examples show that the effect of feedback
551is negligible when concurrency is limited due to congestion. This
552is because the initial concurrency is already at the client's
553concurrency maximum, and because there is 10-100 times more positive
554than negative feedback.  Under these conditions, it is no surprise
555that the contribution from SMTP connection caching is also negligible.
556</p>
557
558<p> In the last example, the old +/-1 feedback per delivery will
559defer 50% of the mail when confronted with an active (anvil-style)
560server concurrency limit, where the server hangs up immediately
561with a 421 status (a TCP-level RST would have the same result).
562Less aggressive feedback mechanisms fare better than more aggressive
563ones.  Concurrency-dependent feedback fares even better at higher
564concurrencies than shown here, but has limitations as discussed in
565the next section.  </p>
566
567<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3>
568
569<p> Less-than-1 feedback is of interest primarily when sending large
570amounts of mail to destinations with active concurrency limiters
571(servers that reply with 421, or firewalls that send RST).  When
572sending small amounts of mail per destination, less-than-1 per-delivery
573feedback won't have a noticeable effect on the per-destination
574concurrency, because the number of deliveries to the same destination
575is too small. You might just as well use zero per-delivery feedback
576and stay with the initial per-destination concurrency. And when
577mail deliveries fail due to congestion instead of active concurrency
578limiters, the measurements above show that per-delivery feedback
579has no effect.  With large amounts of mail you might just as well
580use zero per-delivery feedback and start with the maximal per-destination
581concurrency.  </p>
582
583<p> The scheduler with less-than-1 concurrency
584feedback per delivery solves a problem with servers that have active
585concurrency limiters.  This works only because feedback is handled
586in a peculiar manner: positive feedback will increment the concurrency
587by 1 at the <b>end</b> of a sequence of events of length 1/feedback,
588while negative feedback will decrement concurrency by 1 at the
589<b>beginning</b> of such a sequence.  This is how Postfix adjusts
590quickly for overshoot without causing lots of mail to be deferred.
591Without this difference in feedback treatment, less-than-1 feedback
592per delivery would defer 50% of the mail, and would be no better
593in this respect than the old +/-1 feedback per delivery.  </p>
594
595<p> Unfortunately, the same feature that corrects quickly for
596concurrency overshoot also makes the scheduler more sensitive for
597noisy negative feedback.  The reason is that one lonely negative
598feedback event has the same effect as a complete sequence of length
5991/feedback: in both cases delivery concurrency is dropped by 1
600immediately.  As a worst-case scenario, consider multiple servers
601behind a load balancer on a single IP address, and no backup MX
602address.  When 1 out of K servers fails to complete the SMTP handshake
603or drops the connection, a scheduler with 1/N (N = concurrency)
604feedback stops increasing its concurrency once it reaches a concurrency
605level of about K,  even though the good servers behind the load
606balancer are perfectly capable of handling more traffic. </p>
607
608<p> This noise problem gets worse as the amount of positive feedback
609per delivery gets smaller.  A compromise is to use fixed less-than-1
610positive feedback values instead of concurrency-dependent positive
611feedback.  For example, to tolerate 1 of 4 bad servers in the above
612load balancer scenario, use positive feedback of 1/4 per "good"
613delivery (no connect or handshake error), and use an equal or smaller
614amount of negative feedback per "bad" delivery.  The downside of
615using concurrency-independent feedback is that some of the old +/-1
616feedback problems will return at large concurrencies.  Sites that
617must deliver mail at non-trivial per-destination concurrencies will
618require special configuration.  </p>
619
620<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3>
621
622<p> The Postfix 2.5 concurrency scheduler is controlled with the
623following configuration parameters, where "<i>transport</i>_foo"
624provides a transport-specific parameter override.  All parameter
625default settings are compatible with earlier Postfix versions. </p>
626
627<blockquote>
628
629<table border="0">
630
631<tr> <th> Parameter name </th> <th> Postfix version </th> <th>
632Description </th> </tr>
633
634<tr> <td colspan="3"> <hr> </td> </tr>
635
636<tr> <td> initial_destination_concurrency<br>
637<i>transport</i>_initial_destination_concurrency </td> <td
638align="center"> all<br> 2.5 </td> <td> Initial per-destination
639delivery concurrency </td> </tr>
640
641<tr> <td> default_destination_concurrency_limit<br>
642<i>transport</i>_destination_concurrency_limit </td> <td align="center">
643all<br> all </td> <td> Maximum per-destination delivery concurrency
644</td> </tr>
645
646<tr> <td> default_destination_concurrency_positive_feedback<br>
647<i>transport</i>_destination_concurrency_positive_feedback </td>
648<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive
649feedback amount, per delivery that does not fail with connection
650or handshake failure </td> </tr>
651
652<tr> <td> default_destination_concurrency_negative_feedback<br>
653<i>transport</i>_destination_concurrency_negative_feedback </td>
654<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative
655feedback amount, per delivery that fails with connection or handshake
656failure </td> </tr>
657
658<tr> <td> default_destination_concurrency_failed_cohort_limit<br>
659<i>transport</i>_destination_concurrency_failed_cohort_limit </td>
660<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed
661pseudo-cohorts after which a destination is declared "dead" and
662delivery is suspended </td> </tr>
663
664<tr> <td> destination_concurrency_feedback_debug</td> <td align="center">
6652.5 </td> <td> Enable verbose logging of concurrency scheduler
666activity </td> </tr>
667
668<tr> <td colspan="3"> <hr> </td> </tr>
669
670</table>
671
672</blockquote>
673
674<h2> <a name="jobs"> Preemptive scheduling </a> </h2>
675
676<p>
677
678The following sections describe the new queue manager and its
679preemptive scheduler algorithm. Note that the document was originally
680written to describe the changes between the new queue manager (in
681this text referred to as <tt>nqmgr</tt>, the name it was known by
682before it became the default queue manager) and the old queue manager
683(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
684every so often.
685
686</p>
687
688<p>
689
690This document is divided into sections as follows:
691
692</p>
693
694<ul>
695
696<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
697nqmgr </a>
698
699<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
700up the message </a> - how it is assigned to transports, jobs, peers,
701entries
702
703<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection
704works </a>
705
706<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption
707works </a> - what messages may be preempted and how and what messages
708are chosen to preempt them
709
710<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
711limits affect the scheduling algorithm </a>
712
713<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
714limits </a>
715
716</ul>
717
718<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
719nqmgr </a> </h3>
720
721<p>
722
723Let's start by recapitulating the structures and terms used when
724referring to the queue manager and how it operates. Many of these are
725partially described elsewhere, but it is nice to have a coherent
726overview in one place:
727
728</p>
729
730<ul>
731
732<li> <p> Each message structure represents one mail message which
733Postfix is to deliver. The message recipients specify to what
734destinations is the message to be delivered and what transports are
735going to be used for the delivery. </p>
736
737<li> <p> Each recipient entry groups a batch of recipients of one
738message which are all going to be delivered to the same destination
739(and over the same transport).
740</p>
741
742<li> <p> Each transport structure groups everything what is going
743to be delivered by delivery agents dedicated for that transport.
744Each transport maintains a set of queues (describing the destinations
745it shall talk to) and jobs (referencing the messages it shall
746deliver). </p>
747
748<li> <p> Each transport queue (not to be confused with the on-disk
749"active" queue or "incoming" queue) groups everything what is going be
750delivered to given destination (aka nexthop) by its transport.  Each
751queue belongs to one transport, so each destination may be referred
752to by several queues, one for each transport.  Each queue maintains
753a list of all recipient entries (batches of message recipients)
754which shall be delivered to given destination (the todo list), and
755a list of recipient entries already being delivered by the delivery
756agents (the busy list). </p>
757
758<li> <p> Each queue corresponds to multiple peer structures.  Each
759peer structure is like the queue structure, belonging to one transport
760and referencing one destination. The difference is that it lists
761only the recipient entries which all originate from the same message,
762unlike the queue structure, whose entries may originate from various
763messages. For messages with few recipients, there is usually just
764one recipient entry for each destination, resulting in one recipient
765entry per peer. But for large mailing list messages the recipients
766may need to be split to multiple recipient entries, in which case
767the peer structure may list many entries for single destination.
768</p>
769
770<li> <p> Each transport job groups everything it takes to deliver
771one message via its transport. Each job represents one message
772within the context of the transport. The job belongs to one transport
773and message, so each message may have multiple jobs, one for each
774transport. The job groups all the peer structures, which describe
775the destinations the job's message has to be delivered to. </p>
776
777</ul>
778
779<p>
780
781The first four structures are common to both <tt>nqmgr</tt> and
782<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
783
784</p>
785
786<p>
787
788These terms are used extensively in the text below, feel free to
789look up the description above anytime you'll feel you have lost a
790sense what is what.
791
792</p>
793
794<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
795up the message </a> </h3>
796
797<p>
798
799Whenever <tt>nqmgr</tt> moves a queue file into the "active" queue,
800the following happens: It reads all necessary information from the
801queue file as <tt>oqmgr</tt> does, and also reads as many recipients
802as possible - more on that later, for now let's just pretend it
803always reads all recipients.
804
805</p>
806
807<p>
808
809Then it resolves the recipients as <tt>oqmgr</tt> does, which
810means obtaining (address, nexthop, transport) triple for each
811recipient. For each triple, it finds the transport; if it does not
812exist yet, it instantiates it (unless it's dead). Within the
813transport, it finds the destination queue for the given nexthop; if it
814does not exist yet, it instantiates it (unless it's dead). The
815triple is then bound to given destination queue. This happens in
816qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
817
818</p>
819
820<p>
821
822Then for each triple which was bound to some queue (and thus
823transport), the program finds the job which represents the message
824within that transport's context; if it does not exist yet, it
825instantiates it. Within the job, it finds the peer which represents
826the bound destination queue within this jobs context; if it does
827not exist yet, it instantiates it.  Finally, it stores the address
828from the resolved triple to the recipient entry which is appended
829to both the queue entry list and the peer entry list. The addresses
830for the same nexthop are batched in the entries up to the
831<i>transport</i>_destination_recipient_limit for that transport.
832This happens in qmgr_message_assign(), and apart
833from that it operates with job and peer structures, it is basically the
834same as in <tt>oqmgr</tt>.
835
836</p>
837
838<p>
839
840When the job is instantiated, it is enqueued on the transport's job
841list based on the time its message was picked up by <tt>nqmgr</tt>.
842For first batch of recipients this means it is appended to the end
843of the job list, but the ordering of the job list by the enqueue
844time is important as we will see shortly.
845
846</p>
847
848<p>
849
850[Now you should have a pretty good idea what the state of the
851<tt>nqmgr</tt> is after a couple of messages were picked up, and what the
852relation is between all those job, peer, queue and entry structures.]
853
854</p>
855
856<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection
857works </a> </h3>
858
859<p>
860
861Having prepared all those above mentioned structures, the task of
862the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
863one at a time and pass them to the delivery agent for corresponding
864transport. Now how does this work?
865
866</p>
867
868<p>
869
870The first approximation of the new scheduling algorithm is like this:
871
872</p>
873
874<blockquote>
875<pre>
876foreach transport (round-robin-by-transport)
877do
878    if transport busy continue
879    if transport process limit reached continue
880    foreach transport's job (in the order of the transport's job list)
881    do
882	foreach job's peer (round-robin-by-destination)
883	     if peer-&gt;queue-&gt;concurrency &lt; peer-&gt;queue-&gt;window
884		 return next peer entry.
885	done
886    done
887done
888</pre>
889</blockquote>
890
891<p>
892
893Now what is the "order of the transport's job list"? As we know
894already, the job list is by default kept in the order the message
895was picked up by the <tt>nqmgr</tt>. So by default we get the
896top-level round-robin transport, and within each transport we get
897the FIFO message delivery. The round-robin of the peers by the
898destination is perhaps of little importance in most real-life cases
899(unless the <i>transport</i>_destination_recipient_limit is reached,
900in one job there
901is only one peer structure for each destination), but theoretically
902it makes sure that even within single jobs, destinations are treated
903fairly.
904
905</p>
906
907<p>
908
909[By now you should have a feeling you really know how the scheduler
910works, except for the preemption, under ideal conditions - that is,
911no recipient resource limits and no destination concurrency problems.]
912
913</p>
914
915<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption
916works </a> </h3>
917
918<p>
919
920As you might perhaps expect by now, the transport's job list does
921not remain sorted by the job's message enqueue time all the time.
922The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
923delivery, but that it is able to slip mail with little recipients
924past the mailing-list bulk mail.  This is what the job preemption
925is about - shuffling the jobs on the transport's job list to get
926the best message delivery rates. Now how is it achieved?
927
928</p>
929
930<p>
931
932First I have to tell you that there are in fact two job lists in
933each transport. One is the scheduler's job list, which the scheduler
934is free to play with, while the other one keeps the jobs always
935listed in the order of the enqueue time and is used for recipient
936pool management we will discuss later. For now, we will deal with
937the scheduler's job list only.
938
939</p>
940
941<p>
942
943So, we have the job list, which is first ordered by the time the
944jobs' messages were enqueued, oldest messages first, the most recently
945picked one at the end. For now, let's assume that there are no
946destination concurrency problems. Without preemption, we pick some
947entry of the first (oldest) job on the queue, assign it to delivery
948agent, pick another one from the same job, assign it again, and so
949on, until all the entries are used and the job is delivered. We
950would then move onto the next job and so on and on. Now how do we
951manage to sneak in some entries from the recently added jobs when
952the first job on the job list belongs to a message going to the
953mailing-list and has thousands of recipient entries?
954
955</p>
956
957<p>
958
959The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
960the delivery time of that first job by some constant for free - it
961is basically the same trick you might remember as "accumulation of
962potential" from the amortized complexity lessons. For example,
963instead of delivering the entries of the first job on the job list
964every time a delivery agent becomes available, we can do it only
965every second time. If you view the moments the delivery agent becomes
966available on a timeline as "delivery slots", then instead of using
967every delivery slot for the first job, we can use only every other
968slot, and still the overall delivery efficiency of the first job
969remains the same. So the delivery <tt>11112222</tt> becomes
970<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
971denotes the free slot). Now what do we do with free slots?
972
973</p>
974
975<p>
976
977As you might have guessed, we will use them for sneaking the mail
978with little recipients in. For example, if we have one four-recipient
979mail followed by four one recipients mail, the delivery sequence
980(that is, the sequence in which the jobs are assigned to the
981delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
982for sneaking in the single recipient mail, but how do we sneak in
983the mail with more than one recipient? Say if we have one four-recipient
984mail followed by two two-recipient mails?
985
986</p>
987
988<p>
989
990The simple answer would be to use delivery sequence <tt>12121313</tt>.
991But the problem is that this does not scale well. Imagine you have
992mail with a thousand recipients followed by mail with a hundred recipients.
993It is tempting to suggest the  delivery sequence like <tt>121212....</tt>,
994but alas! Imagine there arrives another mail with say ten recipients.
995But there are no free slots anymore, so it can't slip by, not even
996if it had only one recipient.  It will be stuck until the
997hundred-recipient mail is delivered, which really sucks.
998
999</p>
1000
1001<p>
1002
1003So, it becomes obvious that while inflating the message to get
1004free slots is a great idea, one has to be really careful of how the
1005free slots are assigned, otherwise one might corner himself. So,
1006how does <tt>nqmgr</tt> really use the free slots?
1007
1008</p>
1009
1010<p>
1011
1012The key idea is that one does not have to generate the free slots
1013in a uniform way. The delivery sequence <tt>111...1</tt> is no
1014worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
1015entries are in the first case selected earlier than in the second
1016case, and none is selected later! So it is possible to first
1017"accumulate" the free delivery slots and then use them all at once.
1018It is even possible to accumulate some, then use them, then accumulate
1019some more and use them again, as in <tt>11..1.1</tt> .
1020
1021</p>
1022
1023<p>
1024
1025Let's get back to the one hundred recipient example. We now know
1026that we could first accumulate one hundred free slots, and only
1027after then to preempt the first job and sneak the one hundred
1028recipient mail in. Applying the algorithm recursively, we see the
1029hundred recipient job can accumulate ten free delivery slots, and
1030then we could preempt it and sneak in the ten-recipient mail...
1031Wait wait wait! Could we? Aren't we overinflating the original one
1032thousand recipient mail?
1033
1034</p>
1035
1036<p>
1037
1038Well, despite the fact that it looks so at the first glance, another trick will
1039allow us to answer "no, we are not!". If we had said that we will
1040inflate the delivery time twice at maximum, and then we consider
1041every other slot as a free slot, then we would overinflate in case
1042of the recursive preemption. BUT! The trick is that if we use only
1043every n-th slot as a free slot for n&gt;2, there is always some worst
1044inflation factor which we can guarantee not to be breached, even
1045if we apply the algorithm recursively. To be precise, if for every
1046k&gt;1 normally used slots we accumulate one free delivery slot, than
1047the inflation factor is not worse than k/(k-1) no matter how many
1048recursive preemptions happen. And it's not worse than (k+1)/k if
1049only non-recursive preemption happens. Now, having got through the
1050theory and the related math, let's see how <tt>nqmgr</tt> implements
1051this.
1052
1053</p>
1054
1055<p>
1056
1057Each job has so called "available delivery slot" counter. Each
1058transport has a <i>transport</i>_delivery_slot_cost parameter, which
1059defaults to default_delivery_slot_cost parameter which is set to 5
1060by default. This is the k from the paragraph above. Each time k
1061entries of the job are selected for delivery, this counter is
1062incremented by one. Once there are some slots accumulated, a job which
1063requires no more than that number of slots to be fully delivered
1064can preempt this job.
1065
1066</p>
1067
1068<p>
1069
1070[Well, the truth is, the counter is incremented every time an entry
1071is selected and it is divided by k when it is used.
1072But to understand, it's good enough to use
1073the above approximation of the truth.]
1074
1075</p>
1076
1077<p>
1078
1079OK, so now we know the conditions which must be satisfied so one
1080job can preempt another one. But what job gets preempted, how do
1081we choose what job preempts it if there are several valid candidates,
1082and when does all this exactly happen?
1083
1084</p>
1085
1086<p>
1087
1088The answer for the first part is simple. The job whose entry was
1089selected the last time is the so called current job. Normally, it is
1090the first job on the scheduler's job list, but destination concurrency
1091limits may change this as we will see later. It is always only the
1092current job which may get preempted.
1093
1094</p>
1095
1096<p>
1097
1098Now for the second part. The current job has a certain amount of
1099recipient entries, and as such may accumulate at maximum some amount
1100of available delivery slots. It might have already accumulated some,
1101and perhaps even already used some when it was preempted before
1102(remember a job can be preempted several times). In either case,
1103we know how many are accumulated and how many are left to deliver,
1104so we know how many it may yet accumulate at maximum. Every other
1105job which may be delivered by less than that number of slots is a
1106valid candidate for preemption. How do we choose among them?
1107
1108</p>
1109
1110<p>
1111
1112The answer is - the one with maximum enqueue_time/recipient_entry_count.
1113That is, the older the job is, the more we should try to deliver
1114it in order to get best message delivery rates. These rates are of
1115course subject to how many recipients the message has, therefore
1116the division by the recipient (entry) count. No one shall be surprised
1117that a message with n recipients takes n times longer to deliver than
1118a message with one recipient.
1119
1120</p>
1121
1122<p>
1123
1124Now let's recap the previous two paragraphs. Isn't it too complicated?
1125Why don't the candidates come only among the jobs which can be
1126delivered within the number of slots the current job already
1127accumulated? Why do we need to estimate how much it has yet to
1128accumulate? If you found out the answer, congratulate yourself. If
1129we did it this simple way, we would always choose the candidate
1130with the fewest recipient entries. If there were enough single recipient
1131mails coming in, they would always slip by the bulk mail as soon
1132as possible, and the two or more recipients mail would never get
1133a chance, no matter how long they have been sitting around in the
1134job list.
1135
1136</p>
1137
1138<p>
1139
1140This candidate selection has an interesting implication - that when
1141we choose the best candidate for preemption (this is done in
1142qmgr_choose_candidate()), it may happen that we may not use it for
1143preemption immediately. This leads to an answer to the last part
1144of the original question - when does the preemption happen?
1145
1146</p>
1147
1148<p>
1149
1150The preemption attempt happens every time next transport's recipient
1151entry is to be chosen for delivery. To avoid needless overhead, the
1152preemption is not attempted if the current job could never accumulate
1153more than <i>transport</i>_minimum_delivery_slots (defaults to
1154default_minimum_delivery_slots which defaults to 3). If there are
1155already enough accumulated slots to preempt the current job by the
1156chosen best candidate, it is done immediately. This basically means
1157that the candidate is moved in front of the current job on the
1158scheduler's job list and decreasing the accumulated slot counter
1159by the amount used by the candidate. If there are not enough slots...
1160well, I could say that nothing happens and the another preemption
1161is attempted the next time. But that's not the complete truth.
1162
1163</p>
1164
1165<p>
1166
1167The truth is that it turns out that it is not really necessary to
1168wait until the jobs counter accumulates all the delivery slots in
1169advance. Say we have ten-recipient mail followed by two two-recipient
1170mails. If the preemption happened when enough delivery slots accumulate
1171(assuming slot cost 2), the delivery sequence becomes
1172<tt>11112211113311</tt>. Now what would we get if we would wait
1173only for 50% of the necessary slots to accumulate and we promise
1174we would wait for the remaining 50% later, after we get back
1175to the preempted job? If we use such a slot loan, the delivery sequence
1176becomes <tt>11221111331111</tt>. As we can see, it makes it not
1177considerably worse for the delivery of the ten-recipient mail, but
1178it allows the small messages to be delivered sooner.
1179
1180</p>
1181
1182<p>
1183
1184The concept of these slot loans is where the
1185<i>transport</i>_delivery_slot_discount and
1186<i>transport</i>_delivery_slot_loan come from (they default to
1187default_delivery_slot_discount and default_delivery_slot_loan, whose
1188values are by default 50 and 3, respectively). The discount (resp.
1189loan) specifies how many percent (resp. how many slots) one "gets
1190in advance", when the number of slots required to deliver the best
1191candidate is compared with the number of slots the current slot had
1192accumulated so far.
1193
1194</p>
1195
1196<p>
1197
1198And that pretty much concludes this chapter.
1199
1200</p>
1201
1202<p>
1203
1204[Now you should have a feeling that you pretty much understand the
1205scheduler and the preemption, or at least that you will have
1206after you read the last chapter a couple more times. You shall clearly
1207see the job list and the preemption happening at its head, in ideal
1208delivery conditions. The feeling of understanding shall last until
1209you start wondering what happens if some of the jobs are blocked,
1210which you might eventually figure out correctly from what had been
1211said already. But I would be surprised if your mental image of the
1212scheduler's functionality is not completely shattered once you
1213start wondering how it works when not all recipients may be read
1214in-core.  More on that later.]
1215
1216</p>
1217
1218<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
1219limits affect the scheduling algorithm </a> </h3>
1220
1221<p>
1222
1223The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
1224control as <tt>oqmgr</tt>. Now what happens when the destination
1225limits are reached and no more entries for that destination may be
1226selected by the scheduler?
1227
1228</p>
1229
1230<p>
1231
1232From the user's point of view it is all simple. If some of the peers
1233of a job can't be selected, those peers are simply skipped by the
1234entry selection algorithm (the pseudo-code described before) and
1235only the selectable ones are used. If none of the peers may be
1236selected, the job is declared a "blocker job". Blocker jobs are
1237skipped by the entry selection algorithm and they are also excluded
1238from the candidates for preemption of the current job. Thus the scheduler
1239effectively behaves as if the blocker jobs didn't exist on the job
1240list at all. As soon as at least one of the peers of a blocker job
1241becomes unblocked (that is, the delivery agent handling the delivery
1242of the recipient entry for the given destination successfully finishes),
1243the job's blocker status is removed and the job again participates
1244in all further scheduler actions normally.
1245
1246</p>
1247
1248<p>
1249
1250So the summary is that the users don't really have to be concerned
1251about the interaction of the destination limits and scheduling
1252algorithm. It works well on its own and there are no knobs they
1253would need to control it.
1254
1255</p>
1256
1257<p>
1258
1259From a programmer's point of view, the blocker jobs complicate the
1260scheduler quite a lot. Without them, the jobs on the job list would
1261be normally delivered in strict FIFO order. If the current job is
1262preempted, the job preempting it is completely delivered unless it
1263is preempted itself. Without blockers, the current job is thus
1264always either the first job on the job list, or the top of the stack
1265of jobs preempting the first job on the job list.
1266
1267</p>
1268
1269<p>
1270
1271The visualization of the job list and the preemption stack without
1272blockers would be like this:
1273
1274</p>
1275
1276<blockquote>
1277<pre>
1278first job-&gt;    1--2--3--5--6--8--...    &lt;- job list
1279on job list    |
1280               4    &lt;- preemption stack
1281               |
1282current job-&gt;  7
1283</pre>
1284</blockquote>
1285
1286<p>
1287
1288In the example above we see that job 1 was preempted by job 4 and
1289then job 4 was preempted by job 7. After job 7 is completed, remaining
1290entries of job 4 are selected, and once they are all selected, job
12911 continues.
1292
1293</p>
1294
1295<p>
1296
1297As we see, it's all very clean and straightforward. Now how does
1298this change because of blockers?
1299
1300</p>
1301
1302<p>
1303
1304The answer is: a lot. Any job may become a blocker job at any time,
1305and also become a normal job again at any time. This has several
1306important implications:
1307
1308</p>
1309
1310<ol>
1311
1312<li> <p>
1313
1314The jobs may be completed in arbitrary order. For example, in the
1315example above, if the current job 7 becomes blocked, the next job
13164 may complete before the job 7 becomes unblocked again. Or if both
13177 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
1318and is completed, then 2 is completed and only after that 4 becomes
1319unblocked and is completed... You get the idea.
1320
1321</p>
1322
1323<p>
1324
1325[Interesting side note: even when jobs are delivered out of order,
1326from a single destination's point of view the jobs are still delivered
1327in the expected order (that is, FIFO unless there was some preemption
1328involved). This is because whenever a destination queue becomes
1329unblocked (the destination limit allows selection of more recipient
1330entries for that destination), all jobs which have peers for that
1331destination are unblocked at once.]
1332
1333</p>
1334
1335<li> <p>
1336
1337The idea of the preemption stack at the head of the job list is
1338gone.  That is, it must be possible to preempt any job on the job
1339list. For example, if the jobs 7, 4, 1 and 2 in the example above
1340become all blocked, job 3 becomes the current job. And of course
1341we do not want the preemption to be affected by the fact that there
1342are some blocked jobs or not. Therefore, if it turns out that job
13433 might be preempted by job 6, the implementation shall make it
1344possible.
1345
1346</p>
1347
1348<li> <p>
1349
1350The idea of the linear preemption stack itself is gone. It's no
1351longer true that one job is always preempted by only one job at one
1352time (that is directly preempted, not counting the recursively
1353nested jobs). For example, in the example above, job 1 is directly
1354preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
1355blocked, and job 4 is being delivered. If it accumulates enough
1356delivery slots, it is natural that it might be preempted for example
1357by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
1358same time.
1359
1360</p>
1361
1362</ol>
1363
1364<p>
1365
1366Now combine the points 2) and 3) with point 1) again and you realize
1367that the relations on the once linear job list became pretty
1368complicated. If we extend the point 3) example: jobs 7 and 8 preempt
1369job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
1370huh?
1371
1372</p>
1373
1374<p>
1375
1376If I illustrate the relations after the above mentioned examples
1377(but those in point 1), the situation would look like this:
1378
1379</p>
1380
1381<blockquote>
1382<pre>
1383                            v- parent
1384
1385adoptive parent -&gt;    1--2--3--5--...      &lt;- "stack" level 0
1386                      |     |
1387parent gone -&gt;        ?     6              &lt;- "stack" level 1
1388                     / \
1389children -&gt;         7   8   ^- child       &lt;- "stack" level 2
1390
1391                      ^- siblings
1392</pre>
1393</blockquote>
1394
1395<p>
1396
1397Now how does <tt>nqmgr</tt> deal with all these complicated relations?
1398
1399</p>
1400
1401<p>
1402
1403Well, it maintains them all as described, but fortunately, all these
1404relations are necessary only for the purpose of proper counting of
1405available delivery slots. For the purpose of ordering the jobs for
1406entry selection, the original rule still applies: "the job preempting
1407the current job is moved in front of the current job on the job
1408list". So for entry selection purposes, the job relations remain
1409as simple as this:
1410
1411</p>
1412
1413<blockquote>
1414<pre>
14157--8--1--2--6--3--5--..   &lt;- scheduler's job list order
1416</pre>
1417</blockquote>
1418
1419<p>
1420
1421The job list order and the preemption parent/child/siblings relations
1422are maintained separately. And because the selection works only
1423with the job list, you can happily forget about those complicated
1424relations unless you want to study the <tt>nqmgr</tt> sources. In
1425that case the text above might provide some helpful introduction
1426to the problem domain. Otherwise I suggest you just forget about
1427all this and stick with the user's point of view: the blocker jobs
1428are simply ignored.
1429
1430</p>
1431
1432<p>
1433
1434[By now, you should have a feeling that there are more things going
1435on under the hood than you ever wanted to know. You decide that
1436forgetting about this chapter is the best you can do for the sake
1437of your mind's health and you basically stick with the idea how the
1438scheduler works in ideal conditions, when there are no blockers,
1439which is good enough.]
1440
1441</p>
1442
1443<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
1444limits </a> </h3>
1445
1446<p>
1447
1448When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
1449that all recipients of all messages in the "active" queue are completely
1450read into memory. This is simply not true. There is an upper
1451bound on the amount of memory the <tt>nqmgr</tt> may use, and
1452therefore it must impose some limits on the information it may store
1453in memory at any given time.
1454
1455</p>
1456
1457<p>
1458
1459First of all, not all messages may be read in-core at once. At any
1460time, only qmgr_message_active_limit messages may be read in-core
1461at maximum. When read into memory, the messages are picked from the
1462"incoming" and "deferred" queues and moved to the "active" queue
1463(incoming having priority), so if there are more than
1464qmgr_message_active_limit messages queued in the "active" queue, the
1465rest will have to wait until (some of) the messages in the "active" queue
1466are completely delivered (or deferred).
1467
1468</p>
1469
1470<p>
1471
1472Even with the limited amount of in-core messages, there is another
1473limit which must be imposed in order to avoid memory exhaustion.
1474Each message may contain a huge number of recipients (tens or hundreds
1475of thousands are not uncommon), so if <tt>nqmgr</tt> read all
1476recipients of all messages in the "active" queue, it may easily run
1477out of memory. Therefore there must be some upper bound on the
1478amount of message recipients which are read into memory at the
1479same time.
1480
1481</p>
1482
1483<p>
1484
1485Before discussing how exactly <tt>nqmgr</tt> implements the recipient
1486limits, let's see how the sole existence of the limits themselves
1487affects the <tt>nqmgr</tt> and its scheduler.
1488
1489</p>
1490
1491<p>
1492
1493The message limit is straightforward - it just limits the size of
1494the
1495lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
1496message can preempt the current one. Messages not in the "active" queue
1497are simply not considered at all.
1498
1499</p>
1500
1501<p>
1502
1503The recipient limit complicates more things. First of all, the
1504message reading code must support reading the recipients in batches,
1505which among other things means accessing the queue file several
1506times and continuing where the last recipient batch ended. This is
1507invoked by the scheduler whenever the current job has space for more
1508recipients, subject to transport's refill_limit and refill_delay parameters.
1509It is also done any time when all
1510in-core recipients of the message are dealt with (which may also
1511mean they were deferred) but there are still more in the queue file.
1512
1513</p>
1514
1515<p>
1516
1517The second complication is that with some recipients left unread
1518in the queue file, the scheduler can't operate with exact counts
1519of recipient entries. With unread recipients, it is not clear how
1520many recipient entries there will be, as they are subject to
1521per-destination grouping. It is not even clear to what transports
1522(and thus jobs) the recipients will be assigned. And with messages
1523coming from the "deferred" queue, it is not even clear how many unread
1524recipients are still to be delivered. This all means that the
1525scheduler must use only estimates of how many recipients entries
1526there will be.  Fortunately, it is possible to estimate the minimum
1527and maximum correctly, so the scheduler can always err on the safe
1528side.  Obviously, the better the estimates, the better the results, so
1529it is best when we are able to read all recipients in-core and turn
1530the estimates into exact counts, or at least try to read as many
1531as possible to make the estimates as accurate as possible.
1532
1533</p>
1534
1535<p>
1536
1537The third complication is that it is no longer true that the scheduler
1538is done with a job once all of its in-core recipients are delivered.
1539It is possible that the job will be revived later, when another
1540batch of recipients is read in core. It is also possible that some
1541jobs will be created for the first time long after the first batch
1542of recipients was read in core. The <tt>nqmgr</tt> code must be
1543ready to handle all such situations.
1544
1545</p>
1546
1547<p>
1548
1549And finally, the fourth complication is that the <tt>nqmgr</tt>
1550code must somehow impose the recipient limit itself. Now how does
1551it achieve it?
1552
1553</p>
1554
1555<p>
1556
1557Perhaps the easiest solution would be to say that each message may
1558have at maximum X recipients stored in-core, but such a solution would
1559be poor for several reasons. With reasonable qmgr_message_active_limit
1560values, the X would have to be quite low to maintain a reasonable
1561memory footprint. And with low X lots of things would not work well.
1562The <tt>nqmgr</tt> would have problems to use the
1563<i>transport</i>_destination_recipient_limit efficiently. The
1564scheduler's preemption would be suboptimal as the recipient count
1565estimates would be inaccurate. The message queue file would have
1566to be accessed many times to read in more recipients again and
1567again.
1568
1569</p>
1570
1571<p>
1572
1573Therefore it seems reasonable to have a solution which does not use
1574a limit imposed on a per-message basis, but which maintains a pool
1575of available recipient slots, which can be shared among all messages
1576in the most efficient manner. And as we do not want separate
1577transports to compete for resources whenever possible, it seems
1578appropriate to maintain such a recipient pool for each transport
1579separately. This is the general idea, now how does it work in
1580practice?
1581
1582</p>
1583
1584<p>
1585
1586First we have to solve a little chicken-and-egg problem. If we want
1587to use the per-transport recipient pools, we first need to know to
1588what transport(s) the message is assigned. But we will find that
1589out only after we first read in the recipients. So it is obvious
1590that we first have to read in some recipients, use them to find out
1591to what transports the message is to be assigned, and only after
1592that can we use the per-transport recipient pools.
1593
1594</p>
1595
1596<p>
1597
1598Now how many recipients shall we read for the first time? This is
1599what qmgr_message_recipient_minimum and qmgr_message_recipient_limit
1600values control. The qmgr_message_recipient_minimum value specifies
1601how many recipients of each message we will read the first time,
1602no matter what.  It is necessary to read at least one recipient
1603before we can assign the message to a transport and create the first
1604job. However, reading only qmgr_message_recipient_minimum recipients
1605even if there are only few messages with few recipients in-core would
1606be wasteful. Therefore if there are fewer than qmgr_message_recipient_limit
1607recipients in-core so far, the first batch of recipients may be
1608larger than qmgr_message_recipient_minimum - as large as is required
1609to reach the qmgr_message_recipient_limit limit.
1610
1611</p>
1612
1613<p>
1614
1615Once the first batch of recipients was read in core and the message
1616jobs were created, the size of the subsequent recipient batches (if
1617any - of course it's best when all recipients are read in one batch)
1618is based solely on the position of the message jobs on their
1619corresponding transports' job lists. Each transport has a pool of
1620<i>transport</i>_recipient_limit recipient slots which it can
1621distribute among its jobs (how this is done is described later).
1622The subsequent recipient batch may be as large as the sum of all
1623recipient slots of all jobs of the message permits (plus the
1624qmgr_message_recipient_minimum amount which always applies).
1625
1626</p>
1627
1628<p>
1629
1630For example, if a message has three jobs, the first with 1 recipient
1631still in-core and 4 recipient slots, the second with 5 recipients in-core
1632and 5 recipient slots, and the third with 2 recipients in-core and 0
1633recipient slots, it has 1+5+2=8 recipients in-core and 4+5+0=9 jobs'
1634recipients slots in total. This means that we could immediately
1635read 2+qmgr_message_recipient_minimum more recipients of that message
1636in core.
1637
1638</p>
1639
1640<p>
1641
1642The above example illustrates several things which might be worth
1643mentioning explicitly: first, note that although the per-transport
1644slots are assigned to particular jobs, we can't guarantee that once
1645the next batch of recipients is read in core, that the corresponding
1646amounts of recipients will be assigned to those jobs. The jobs lend
1647its slots to the message as a whole, so it is possible that some
1648jobs end up sponsoring other jobs of their message. For example,
1649if in the example above the 2 newly read recipients were assigned
1650to the second job, the first job sponsored the second job with 2
1651slots. The second notable thing is the third job, which has more
1652recipients in-core than it has slots. Apart from sponsoring by other
1653job we just saw it can be result of the first recipient batch, which
1654is sponsored from global recipient pool of qmgr_message_recipient_limit
1655recipients. It can be also sponsored from the message recipient
1656pool of qmgr_message_recipient_minimum recipients.
1657
1658</p>
1659
1660<p>
1661
1662Now how does each transport distribute the recipient slots among
1663its jobs?  The strategy is quite simple. As most scheduler activity
1664happens on the head of the job list, it is our intention to make
1665sure that the scheduler has the best estimates of the recipient
1666counts for those jobs. As we mentioned above, this means that we
1667want to try to make sure that the messages of those jobs have all
1668recipients read in-core. Therefore the transport distributes the
1669slots "along" the job list from start to end. In this case the job
1670list sorted by message enqueue time is used, because it doesn't
1671change over time as the scheduler's job list does.
1672
1673</p>
1674
1675<p>
1676
1677More specifically, each time a job is created and appended to the
1678job list, it gets all unused recipient slots from its transport's
1679pool. It keeps them until all recipients of its message are read.
1680When this happens, all unused recipient slots are transferred to
1681the next job (which is now in fact the first such job) on the job
1682list which still has some recipients unread, or eventually back to
1683the transport pool if there is no such job. Such a transfer then also
1684happens whenever a recipient entry of that job is delivered.
1685
1686</p>
1687
1688<p>
1689
1690There is also a scenario when a job is not appended to the end of
1691the job list (for example it was created as a result of a second or
1692later recipient batch). Then it works exactly as above, except that
1693if it was put in front of the first unread job (that is, the job
1694of a message which still has some unread recipients in the queue file),
1695that job is first forced to return all of its unused recipient slots
1696to the transport pool.
1697
1698</p>
1699
1700<p>
1701
1702The algorithm just described leads to the following state: The first
1703unread job on the job list always gets all the remaining recipient
1704slots of that transport (if there are any). The jobs queued before
1705this job are completely read (that is, all recipients of their
1706message were already read in core) and have at maximum as many slots
1707as they still have recipients in-core (the maximum is there because
1708of the sponsoring mentioned before) and the jobs after this job get
1709nothing from the transport recipient pool (unless they got something
1710before and then the first unread job was created and enqueued in
1711front of them later - in such a case, they also get at maximum as many
1712slots as they have recipients in-core).
1713
1714</p>
1715
1716<p>
1717
1718Things work fine in such a state for most of the time, because the
1719current job is either completely read in-core or has as many recipient
1720slots as there are, but there is one situation which we still have
1721to take care of specially.  Imagine if the current job is preempted
1722by some unread job from the job list and there are no more recipient
1723slots available, so this new current job could read only batches
1724of qmgr_message_recipient_minimum recipients at a time. This would
1725really degrade performance. For this reason, each transport has an
1726extra pool of <i>transport</i>_extra_recipient_limit recipient
1727slots, dedicated exactly for this situation. Each time an unread
1728job preempts the current job, it gets half of the remaining recipient
1729slots from the normal pool and this extra pool.
1730
1731</p>
1732
1733<p>
1734
1735And that's it. It sure does sound pretty complicated, but fortunately
1736most people don't really have to care exactly how it works as long
1737as it works.  Perhaps the only important things to know for most
1738people are the following upper bound formulas:
1739
1740</p>
1741
1742<p>
1743
1744Each transport has at maximum
1745
1746</p>
1747
1748<blockquote>
1749<pre>
1750max(
1751qmgr_message_recipient_minimum * qmgr_message_active_limit
1752+ *_recipient_limit + *_extra_recipient_limit,
1753qmgr_message_recipient_limit
1754)
1755</pre>
1756</blockquote>
1757
1758<p>
1759
1760recipients in core.
1761
1762</p>
1763
1764<p>
1765
1766The total amount of recipients in core is
1767
1768</p>
1769
1770<blockquote>
1771<pre>
1772max(
1773qmgr_message_recipient_minimum * qmgr_message_active_limit
1774+ sum( *_recipient_limit + *_extra_recipient_limit ),
1775qmgr_message_recipient_limit
1776)
1777</pre>
1778</blockquote>
1779
1780<p>
1781
1782where the sum is over all used transports.
1783
1784</p>
1785
1786<p>
1787
1788And this terribly complicated chapter concludes the documentation
1789of the <tt>nqmgr</tt> scheduler.
1790
1791</p>
1792
1793<p>
1794
1795[By now you should theoretically know the <tt>nqmgr</tt> scheduler
1796inside out. In practice, you still hope that you will never have
1797to really understand the last or last two chapters completely, and
1798fortunately most people really won't. Understanding how the scheduler
1799works in ideal conditions is more than good enough for the vast majority
1800of users.]
1801
1802</p>
1803
1804<h2> <a name="credits"> Credits </a> </h2>
1805
1806<ul>
1807
1808<li> Wietse Venema designed and implemented the initial queue manager
1809with per-domain FIFO scheduling, and per-delivery +/-1 concurrency
1810feedback.
1811
1812<li> Patrik Rak designed and implemented preemption where mail with
1813fewer recipients can slip past mail with more recipients in a
1814controlled manner, and wrote up its documentation.
1815
1816<li> Wietse Venema initiated a discussion with Patrik Rak and Victor
1817Duchovni on alternatives for the +/-1 feedback scheduler's aggressive
1818behavior. This is when K/N feedback was reviewed (N = concurrency).
1819The discussion ended without a good solution for both negative
1820feedback and dead site detection.
1821
1822<li> Victor Duchovni resumed work on concurrency feedback in the
1823context of concurrency-limited servers.
1824
1825<li> Wietse Venema then re-designed the concurrency scheduler in
1826terms of the simplest possible concepts: less-than-1 concurrency
1827feedback per delivery, forward and reverse concurrency feedback
1828hysteresis, and pseudo-cohort failure. At this same time, concurrency
1829feedback was separated from dead site detection.
1830
1831<li> These simplifications, and their modular implementation, helped
1832to develop further insights into the different roles that positive
1833and negative concurrency feedback play, and helped to identify some
1834worst-case scenarios.
1835
1836</ul>
1837
1838</body>
1839
1840</html>
1841