xref: /plan9-contrib/sys/doc/sleep.ms (revision 219b2ee8daee37f4aad58d63f21287faa8e4ffdc)
1.TL
2Process Sleep and Wakeup on a Shared-memory Multiprocessor
3.AU
4Rob Pike
5Dave Presotto
6Ken Thompson
7Gerard Holzmann
8.sp
9rob,presotto,ken,gerard@plan9.att.com
10.AB
11The problem of enabling a `sleeping' process on a shared-memory multiprocessor
12is a difficult one, especially if the process is to be awakened by an interrupt-time
13event.  We present here the code
14for sleep and wakeup primitives that we use in our multiprocessor system.
15The code has been exercised by years of active use and by a verification
16system.
17.AE
18.LP
19Our problem is to synchronise processes on a symmetric shared-memory multiprocessor.
20Processes suspend execution, or
21.I sleep,
22while awaiting an enabling event such as an I/O interrupt.
23When the event occurs, the process is issued a
24.I wakeup
25to resume its execution.
26During these events, other processes may be running and other interrupts
27occurring on other processors.
28.LP
29More specifically, we wish to implement subroutines called
30.CW sleep ,
31callable by a process to relinquish control of its current processor,
32and
33.CW wakeup ,
34callable by another process or an interrupt to resume the execution
35of a suspended process.
36The calling conventions of these subroutines will remain unspecified
37for the moment.
38.LP
39We assume the processors have an atomic test-and-set or equivalent
40operation but no other synchronisation method.  Also, we assume interrupts
41can occur on any processor at any time, except on a processor that has
42locally inhibited them.
43.LP
44The problem is the generalisation to a multiprocessor of a familiar
45and well-understood uniprocessor problem.  It may be reduced to a
46uniprocessor problem by using a global test-and-set to serialise the
47sleeps and wakeups,
48which is equivalent to synchronising through a monitor.
49For performance and cleanliness, however,
50we prefer to allow the interrupt handling and process control to be multiprocessed.
51.LP
52Our attempts to solve the sleep/wakeup problem in Plan 9
53[Pik90]
54prompted this paper.
55We implemented solutions several times over several months and each
56time convinced ourselves \(em wrongly \(em they were correct.
57Multiprocessor algorithms can be
58difficult to prove correct by inspection and formal reasoning about them
59is impractical.  We finally developed an algorithm we trust by
60verifying our code using an
61empirical testing tool.
62We present that code here, along with some comments about the process by
63which it was designed.
64.SH
65History
66.LP
67Since processes in Plan 9 and the UNIX
68system have similar structure and properties, one might ask if
69UNIX
70.CW sleep
71and
72.CW wakeup
73[Bac86]
74could not easily be adapted from their standard uniprocessor implementation
75to our multiprocessor needs.
76The short answer is, no.
77.LP
78The
79UNIX
80routines
81take as argument a single global address
82that serves as a unique
83identifier to connect the wakeup with the appropriate process or processes.
84This has several inherent disadvantages.
85From the point of view of
86.CW sleep
87and
88.CW wakeup ,
89it is difficult to associate a data structure with an arbitrary address;
90the routines are unable to maintain a state variable recording the
91status of the event and processes.
92(The reverse is of course easy \(em we could
93require the address to point to a special data structure \(em
94but we are investigating
95UNIX
96.CW sleep
97and
98.CW wakeup ,
99not the code that calls them.)
100Also, multiple processes sleep `on' a given address, so
101.CW wakeup
102must enable them all, and let process scheduling determine which process
103actually benefits from the event.
104This is inefficient;
105a queueing mechanism would be preferable
106but, again, it is difficult to associate a queue with a general address.
107Moreover, the lack of state means that
108.CW sleep
109and
110.CW wakeup
111cannot know what the corresponding process (or interrupt) is doing;
112.CW sleep
113and
114.CW wakeup
115must be executed atomically.
116On a uniprocessor it suffices to disable interrupts during their
117execution.
118On a multiprocessor, however,
119most processors
120can inhibit interrupts only on the current processor,
121so while a process is executing
122.CW sleep
123the desired interrupt can come and go on another processor.
124If the wakeup is to be issued by another process, the problem is even harder.
125Some inter-process mutual exclusion mechanism must be used,
126which, yet again, is difficult to do without a way to communicate state.
127.LP
128In summary, to be useful on a multiprocessor,
129UNIX
130.CW sleep
131and
132.CW wakeup
133must either be made to run atomically on a single
134processor (such as by using a monitor)
135or they need a richer model for their communication.
136.SH
137The design
138.LP
139Consider the case of an interrupt waking up a sleeping process.
140(The other case, a process awakening a second process, is easier because
141atomicity can be achieved using an interlock.)
142The sleeping process is waiting for some event to occur, which may be
143modeled by a condition coming true.
144The condition could be just that the event has happened, or something
145more subtle such as a queue draining below some low-water mark.
146We represent the condition by a function of one
147argument of type
148.CW void* ;
149the code supporting the device generating the interrupts
150provides such a function to be used by
151.CW sleep
152and
153.CW wakeup
154to synchronise.  The function returns
155.CW false
156if the event has not occurred, and
157.CW true
158some time after the event has occurred.
159The
160.CW sleep
161and
162.CW wakeup
163routines must, of course, work correctly if the
164event occurs while the process is executing
165.CW sleep .
166.LP
167We assume that a particular call to
168.CW sleep
169corresponds to a particular call to
170.CW wakeup ,
171that is,
172at most one process is asleep waiting for a particular event.
173This can be guaranteed in the code that calls
174.CW sleep
175and
176.CW wakeup
177by appropriate interlocks.
178We also assume for the moment that there will be only one interrupt
179and that it may occur at any time, even before
180.CW sleep
181has been called.
182.LP
183For performance,
184we desire that multiple instances of
185.CW sleep
186and
187.CW wakeup
188may be running simultaneously on our multiprocessor.
189For example, a process calling
190.CW sleep
191to await a character from an input channel need not
192wait for another process to finish executing
193.CW sleep
194to await a disk block.
195At a finer level, we would like a process reading from one input channel
196to be able to execute
197.CW sleep
198in parallel with a process reading from another input channel.
199A standard approach to synchronisation is to interlock the channel `driver'
200so that only one process may be executing in the channel code at once.
201This method is clearly inadequate for our purposes; we need
202fine-grained synchronisation, and in particular to apply
203interlocks at the level of individual channels rather than at the level
204of the channel driver.
205.LP
206Our approach is to use an object called a
207.I rendezvous ,
208which is a data structure through which
209.CW sleep
210and
211.CW wakeup
212synchronise.
213(The similarly named construct in Ada is a control structure;
214ours is an unrelated data structure.)
215A rendezvous
216is allocated for each active source of events:
217one for each I/O channel,
218one for each end of a pipe, and so on.
219The rendezvous serves as an interlockable structure in which to record
220the state of the sleeping process, so that
221.CW sleep
222and
223.CW wakeup
224can communicate if the event happens before or while
225.CW sleep
226is executing.
227.LP
228Our design for
229.CW sleep
230is therefore a function
231.P1
232void sleep(Rendezvous *r, int (*condition)(void*), void *arg)
233.P2
234called by the sleeping process.
235The argument
236.CW r
237connects the call to
238.CW sleep
239with the call to
240.CW wakeup ,
241and is part of the data structure for the (say) device.
242The function
243.CW condition
244is described above;
245called with argument
246.CW arg ,
247it is used by
248.CW sleep
249to decide whether the event has occurred.
250.CW Wakeup
251has a simpler specification:
252.P1
253void wakeup(Rendezvous *r).
254.P2
255.CW Wakeup
256must be called after the condition has become true.
257.SH
258An implementation
259.LP
260The
261.CW Rendezvous
262data type is defined as
263.P1
264typedef struct{
265	Lock	l;
266	Proc	*p;
267}Rendezvous;
268.P2
269Our
270.CW Locks
271are test-and-set spin locks.
272The routine
273.CW lock(Lock\ *l)
274returns when the current process holds that lock;
275.CW unlock(Lock\ *l)
276releases the lock.
277.LP
278Here is our implementation of
279.CW sleep .
280Its details are discussed below.
281.CW Thisp
282is a pointer to the current process on the current processor.
283(Its value differs on each processor.)
284.P1
285void
286sleep(Rendezvous *r, int (*condition)(void*), void *arg)
287{
288	int s;
289
290	s = inhibit();		/* interrupts */
291	lock(&r->l);
292
293	/*
294	 * if condition happened, never mind
295	 */
296	if((*condition)(arg)){
297		unlock(&r->l);
298		allow();	/* interrupts */
299		return;
300	}
301
302	/*
303	 * now we are committed to
304	 * change state and call scheduler
305	 */
306	if(r->p)
307		error("double sleep %d %d", r->p->pid, thisp->pid);
308	thisp->state = Wakeme;
309	r->p = thisp;
310	unlock(&r->l);
311	allow(s);	/* interrupts */
312	sched();	/* relinquish CPU */
313}
314.P2
315.ne 3i
316Here is
317.CW wakeup.
318.P1
319void
320wakeup(Rendezvous *r)
321{
322	Proc *p;
323	int s;
324
325	s = inhibit();	/* interrupts; return old state */
326	lock(&r->l);
327	p = r->p;
328	if(p){
329		r->p = 0;
330		if(p->state != Wakeme)
331			panic("wakeup: not Wakeme");
332		ready(p);
333	}
334	unlock(&r->l);
335	if(s)
336		allow();
337}
338.P2
339.CW Sleep
340and
341.CW wakeup
342both begin by disabling interrupts
343and then locking the rendezvous structure.
344Because
345.CW wakeup
346may be called in an interrupt routine, the lock must be set only
347with interrupts disabled on the current processor,
348so that if the interrupt comes during
349.CW sleep
350it will occur only on a different processor;
351if it occurred on the processor executing
352.CW sleep ,
353the spin lock in
354.CW wakeup
355would hang forever.
356At the end of each routine, the lock is released and processor priority
357returned to its previous value.
358.CW Wakeup "" (
359needs to inhibit interrupts in case
360it is being called by a process;
361this is a no-op if called by an interrupt.)
362.LP
363.CW Sleep
364checks to see if the condition has become true, and returns if so.
365Otherwise the process posts its name in the rendezvous structure where
366.CW wakeup
367may find it, marks its state as waiting to be awakened
368(this is for error checking only) and goes to sleep by calling
369.CW sched() .
370The manipulation of the rendezvous structure is all done under the lock,
371and
372.CW wakeup
373only examines it under lock, so atomicity and mutual exclusion
374are guaranteed.
375.LP
376.CW Wakeup
377has a simpler job.  When it is called, the condition has implicitly become true,
378so it locks the rendezvous, sees if a process is waiting, and readies it to run.
379.SH
380Discussion
381.LP
382The synchronisation technique used here
383is similar to known methods, even as far back as Saltzer's thesis
384[Sal66].
385The code looks trivially correct in retrospect: all access to data structures is done
386under lock, and there is no place that things may get out of order.
387Nonetheless, it took us several iterations to arrive at the above
388implementation, because the things that
389.I can
390go wrong are often hard to see.  We had four earlier implementations
391that were examined at great length and only found faulty when a new,
392different style of device or activity was added to the system.
393.LP
394.ne 3i
395Here, for example, is an incorrect implementation of wakeup,
396closely related to one of our versions.
397.P1
398void
399wakeup(Rendezvous *r)
400{
401	Proc *p;
402	int s;
403
404	p = r->p;
405	if(p){
406		s = inhibit();
407		lock(&r->l);
408		r->p = 0;
409		if(p->state != Wakeme)
410			panic("wakeup: not Wakeme");
411		ready(p);
412		unlock(&r->l);
413		if(s)
414			allow();
415	}
416}
417.P2
418The mistake is that the reading of
419.CW r->p
420may occur just as the other process calls
421.CW sleep ,
422so when the interrupt examines the structure it sees no one to wake up,
423and the sleeping process misses its wakeup.
424We wrote the code this way because we reasoned that the fetch
425.CW p
426.CW =
427.CW r->p
428was inherently atomic and need not be interlocked.
429The bug was found by examination when a new, very fast device
430was added to the system and sleeps and interrupts were closely overlapped.
431However, it was in the system for a couple of months without causing an error.
432.LP
433How many errors lurk in our supposedly correct implementation above?
434We would like a way to guarantee correctness; formal proofs are beyond
435our abilities when the subtleties of interrupts and multiprocessors are
436involved.
437With that in mind, the first three authors approached the last to see
438if his automated tool for checking protocols
439[Hol91]
440could be
441used to verify our new
442.CW sleep
443and
444.CW wakeup
445for correctness.
446The code was translated into the language for that system
447(with, unfortunately, no way of proving that the translation is itself correct)
448and validated by exhaustive simulation.
449.LP
450The validator found a bug.
451Under our assumption that there is only one interrupt, the bug cannot
452occur, but in the more general case of multiple interrupts synchronising
453through the same condition function and rendezvous,
454the process and interrupt can enter a peculiar state.
455A process may return from
456.CW sleep
457with the condition function false
458if there is a delay between
459the condition coming true and
460.CW wakeup
461being called,
462with the delay occurring
463just as the receiving process calls
464.CW sleep .
465The condition is now true, so that process returns immediately,
466does whatever is appropriate, and then (say) decides to call
467.CW sleep
468again.  This time the condition is false, so it goes to sleep.
469The wakeup process then finds a sleeping process,
470and wakes it up, but the condition is now false.
471.LP
472There is an easy (and verified) solution: at the end of
473.CW sleep
474or after
475.CW sleep
476returns,
477if the condition is false, execute
478.CW sleep
479again.  This re-execution cannot repeat; the second synchronisation is guaranteed
480to function under the external conditions we are supposing.
481.LP
482Even though the original code is completely
483protected by interlocks and had been examined carefully by all of us
484and believed correct, it still had problems.
485It seems to us that some exhaustive automated analysis is
486required of multiprocessor algorithms to guarantee their safety.
487Our experience has confirmed that it is almost impossible to
488guarantee by inspection or simple testing the correctness
489of a multiprocessor algorithm.  Testing can demonstrate the presence
490of bugs but not their absence
491[Dij72].
492.LP
493We close by claiming that the code above with
494the suggested modification passes all tests we have for correctness
495under the assumptions used in the validation.
496We would not, however, go so far as to claim that it is universally correct.
497.SH
498References
499.LP
500[Bac86] Maurice J. Bach,
501.I "The Design of the UNIX Operating System,
502Prentice-Hall,
503Englewood Cliffs,
5041986.
505.LP
506[Dij72] Edsger W. Dijkstra,
507``The Humble Programmer \- 1972 Turing Award Lecture'',
508.I "Comm. ACM,
50915(10), pp. 859-866,
510October 1972.
511.LP
512[Hol91] Gerard J. Holzmann,
513.I "Design and Validation of Computer Protocols,
514Prentice-Hall,
515Englewood Cliffs,
5161991.
517.LP
518[Pik90]
519Rob Pike,
520Dave Presotto,
521Ken Thompson,
522Howard Trickey,
523``Plan 9 from Bell Labs'',
524.I "Proceedings of the Summer 1990 UKUUG Conference,
525pp. 1-9,
526London,
527July, 1990.
528.LP
529[Sal66] Jerome H. Saltzer,
530.I "Traffic Control in a Multiplexed Computer System
531MIT,
532Cambridge, Mass.,
5331966.
534