xref: /plan9/sys/doc/sleep.ms (revision 426d2b71458df9b491ba6c167f699b3f1f7b0428)
1.HTML "Process Sleep and Wakeup on a Shared-memory Multiprocessor
2.TL
3Process Sleep and Wakeup on a Shared-memory Multiprocessor
4.AU
5Rob Pike
6Dave Presotto
7Ken Thompson
8Gerard Holzmann
9.sp
10rob,presotto,ken,gerard@plan9.bell-labs.com
11.AB
12.FS
13Appeared in a slightly different form in
14.I
15Proceedings of the Spring 1991 EurOpen Conference,
16.R
17Tromsø, Norway, 1991, pp. 161-166.
18.FE
19The problem of enabling a `sleeping' process on a shared-memory multiprocessor
20is a difficult one, especially if the process is to be awakened by an interrupt-time
21event.  We present here the code
22for sleep and wakeup primitives that we use in our multiprocessor system.
23The code has been exercised by years of active use and by a verification
24system.
25.AE
26.LP
27Our problem is to synchronise processes on a symmetric shared-memory multiprocessor.
28Processes suspend execution, or
29.I sleep,
30while awaiting an enabling event such as an I/O interrupt.
31When the event occurs, the process is issued a
32.I wakeup
33to resume its execution.
34During these events, other processes may be running and other interrupts
35occurring on other processors.
36.LP
37More specifically, we wish to implement subroutines called
38.CW sleep ,
39callable by a process to relinquish control of its current processor,
40and
41.CW wakeup ,
42callable by another process or an interrupt to resume the execution
43of a suspended process.
44The calling conventions of these subroutines will remain unspecified
45for the moment.
46.LP
47We assume the processors have an atomic test-and-set or equivalent
48operation but no other synchronisation method.  Also, we assume interrupts
49can occur on any processor at any time, except on a processor that has
50locally inhibited them.
51.LP
52The problem is the generalisation to a multiprocessor of a familiar
53and well-understood uniprocessor problem.  It may be reduced to a
54uniprocessor problem by using a global test-and-set to serialise the
55sleeps and wakeups,
56which is equivalent to synchronising through a monitor.
57For performance and cleanliness, however,
58we prefer to allow the interrupt handling and process control to be multiprocessed.
59.LP
60Our attempts to solve the sleep/wakeup problem in Plan 9
61[Pik90]
62prompted this paper.
63We implemented solutions several times over several months and each
64time convinced ourselves \(em wrongly \(em they were correct.
65Multiprocessor algorithms can be
66difficult to prove correct by inspection and formal reasoning about them
67is impractical.  We finally developed an algorithm we trust by
68verifying our code using an
69empirical testing tool.
70We present that code here, along with some comments about the process by
71which it was designed.
72.SH
73History
74.LP
75Since processes in Plan 9 and the UNIX
76system have similar structure and properties, one might ask if
77UNIX
78.CW sleep
79and
80.CW wakeup
81[Bac86]
82could not easily be adapted from their standard uniprocessor implementation
83to our multiprocessor needs.
84The short answer is, no.
85.LP
86The
87UNIX
88routines
89take as argument a single global address
90that serves as a unique
91identifier to connect the wakeup with the appropriate process or processes.
92This has several inherent disadvantages.
93From the point of view of
94.CW sleep
95and
96.CW wakeup ,
97it is difficult to associate a data structure with an arbitrary address;
98the routines are unable to maintain a state variable recording the
99status of the event and processes.
100(The reverse is of course easy \(em we could
101require the address to point to a special data structure \(em
102but we are investigating
103UNIX
104.CW sleep
105and
106.CW wakeup ,
107not the code that calls them.)
108Also, multiple processes sleep `on' a given address, so
109.CW wakeup
110must enable them all, and let process scheduling determine which process
111actually benefits from the event.
112This is inefficient;
113a queueing mechanism would be preferable
114but, again, it is difficult to associate a queue with a general address.
115Moreover, the lack of state means that
116.CW sleep
117and
118.CW wakeup
119cannot know what the corresponding process (or interrupt) is doing;
120.CW sleep
121and
122.CW wakeup
123must be executed atomically.
124On a uniprocessor it suffices to disable interrupts during their
125execution.
126On a multiprocessor, however,
127most processors
128can inhibit interrupts only on the current processor,
129so while a process is executing
130.CW sleep
131the desired interrupt can come and go on another processor.
132If the wakeup is to be issued by another process, the problem is even harder.
133Some inter-process mutual exclusion mechanism must be used,
134which, yet again, is difficult to do without a way to communicate state.
135.LP
136In summary, to be useful on a multiprocessor,
137UNIX
138.CW sleep
139and
140.CW wakeup
141must either be made to run atomically on a single
142processor (such as by using a monitor)
143or they need a richer model for their communication.
144.SH
145The design
146.LP
147Consider the case of an interrupt waking up a sleeping process.
148(The other case, a process awakening a second process, is easier because
149atomicity can be achieved using an interlock.)
150The sleeping process is waiting for some event to occur, which may be
151modeled by a condition coming true.
152The condition could be just that the event has happened, or something
153more subtle such as a queue draining below some low-water mark.
154We represent the condition by a function of one
155argument of type
156.CW void* ;
157the code supporting the device generating the interrupts
158provides such a function to be used by
159.CW sleep
160and
161.CW wakeup
162to synchronise.  The function returns
163.CW false
164if the event has not occurred, and
165.CW true
166some time after the event has occurred.
167The
168.CW sleep
169and
170.CW wakeup
171routines must, of course, work correctly if the
172event occurs while the process is executing
173.CW sleep .
174.LP
175We assume that a particular call to
176.CW sleep
177corresponds to a particular call to
178.CW wakeup ,
179that is,
180at most one process is asleep waiting for a particular event.
181This can be guaranteed in the code that calls
182.CW sleep
183and
184.CW wakeup
185by appropriate interlocks.
186We also assume for the moment that there will be only one interrupt
187and that it may occur at any time, even before
188.CW sleep
189has been called.
190.LP
191For performance,
192we desire that multiple instances of
193.CW sleep
194and
195.CW wakeup
196may be running simultaneously on our multiprocessor.
197For example, a process calling
198.CW sleep
199to await a character from an input channel need not
200wait for another process to finish executing
201.CW sleep
202to await a disk block.
203At a finer level, we would like a process reading from one input channel
204to be able to execute
205.CW sleep
206in parallel with a process reading from another input channel.
207A standard approach to synchronisation is to interlock the channel `driver'
208so that only one process may be executing in the channel code at once.
209This method is clearly inadequate for our purposes; we need
210fine-grained synchronisation, and in particular to apply
211interlocks at the level of individual channels rather than at the level
212of the channel driver.
213.LP
214Our approach is to use an object called a
215.I rendezvous ,
216which is a data structure through which
217.CW sleep
218and
219.CW wakeup
220synchronise.
221(The similarly named construct in Ada is a control structure;
222ours is an unrelated data structure.)
223A rendezvous
224is allocated for each active source of events:
225one for each I/O channel,
226one for each end of a pipe, and so on.
227The rendezvous serves as an interlockable structure in which to record
228the state of the sleeping process, so that
229.CW sleep
230and
231.CW wakeup
232can communicate if the event happens before or while
233.CW sleep
234is executing.
235.LP
236Our design for
237.CW sleep
238is therefore a function
239.P1
240void sleep(Rendezvous *r, int (*condition)(void*), void *arg)
241.P2
242called by the sleeping process.
243The argument
244.CW r
245connects the call to
246.CW sleep
247with the call to
248.CW wakeup ,
249and is part of the data structure for the (say) device.
250The function
251.CW condition
252is described above;
253called with argument
254.CW arg ,
255it is used by
256.CW sleep
257to decide whether the event has occurred.
258.CW Wakeup
259has a simpler specification:
260.P1
261void wakeup(Rendezvous *r).
262.P2
263.CW Wakeup
264must be called after the condition has become true.
265.SH
266An implementation
267.LP
268The
269.CW Rendezvous
270data type is defined as
271.P1
272typedef struct{
273	Lock	l;
274	Proc	*p;
275}Rendezvous;
276.P2
277Our
278.CW Locks
279are test-and-set spin locks.
280The routine
281.CW lock(Lock\ *l)
282returns when the current process holds that lock;
283.CW unlock(Lock\ *l)
284releases the lock.
285.LP
286Here is our implementation of
287.CW sleep .
288Its details are discussed below.
289.CW Thisp
290is a pointer to the current process on the current processor.
291(Its value differs on each processor.)
292.P1
293void
294sleep(Rendezvous *r, int (*condition)(void*), void *arg)
295{
296	int s;
297
298	s = inhibit();		/* interrupts */
299	lock(&r->l);
300
301	/*
302	 * if condition happened, never mind
303	 */
304	if((*condition)(arg)){
305		unlock(&r->l);
306		allow();	/* interrupts */
307		return;
308	}
309
310	/*
311	 * now we are committed to
312	 * change state and call scheduler
313	 */
314	if(r->p)
315		error("double sleep %d %d", r->p->pid, thisp->pid);
316	thisp->state = Wakeme;
317	r->p = thisp;
318	unlock(&r->l);
319	allow(s);	/* interrupts */
320	sched();	/* relinquish CPU */
321}
322.P2
323.ne 3i
324Here is
325.CW wakeup.
326.P1
327void
328wakeup(Rendezvous *r)
329{
330	Proc *p;
331	int s;
332
333	s = inhibit();	/* interrupts; return old state */
334	lock(&r->l);
335	p = r->p;
336	if(p){
337		r->p = 0;
338		if(p->state != Wakeme)
339			panic("wakeup: not Wakeme");
340		ready(p);
341	}
342	unlock(&r->l);
343	if(s)
344		allow();
345}
346.P2
347.CW Sleep
348and
349.CW wakeup
350both begin by disabling interrupts
351and then locking the rendezvous structure.
352Because
353.CW wakeup
354may be called in an interrupt routine, the lock must be set only
355with interrupts disabled on the current processor,
356so that if the interrupt comes during
357.CW sleep
358it will occur only on a different processor;
359if it occurred on the processor executing
360.CW sleep ,
361the spin lock in
362.CW wakeup
363would hang forever.
364At the end of each routine, the lock is released and processor priority
365returned to its previous value.
366.CW Wakeup "" (
367needs to inhibit interrupts in case
368it is being called by a process;
369this is a no-op if called by an interrupt.)
370.LP
371.CW Sleep
372checks to see if the condition has become true, and returns if so.
373Otherwise the process posts its name in the rendezvous structure where
374.CW wakeup
375may find it, marks its state as waiting to be awakened
376(this is for error checking only) and goes to sleep by calling
377.CW sched() .
378The manipulation of the rendezvous structure is all done under the lock,
379and
380.CW wakeup
381only examines it under lock, so atomicity and mutual exclusion
382are guaranteed.
383.LP
384.CW Wakeup
385has a simpler job.  When it is called, the condition has implicitly become true,
386so it locks the rendezvous, sees if a process is waiting, and readies it to run.
387.SH
388Discussion
389.LP
390The synchronisation technique used here
391is similar to known methods, even as far back as Saltzer's thesis
392[Sal66].
393The code looks trivially correct in retrospect: all access to data structures is done
394under lock, and there is no place that things may get out of order.
395Nonetheless, it took us several iterations to arrive at the above
396implementation, because the things that
397.I can
398go wrong are often hard to see.  We had four earlier implementations
399that were examined at great length and only found faulty when a new,
400different style of device or activity was added to the system.
401.LP
402.ne 3i
403Here, for example, is an incorrect implementation of wakeup,
404closely related to one of our versions.
405.P1
406void
407wakeup(Rendezvous *r)
408{
409	Proc *p;
410	int s;
411
412	p = r->p;
413	if(p){
414		s = inhibit();
415		lock(&r->l);
416		r->p = 0;
417		if(p->state != Wakeme)
418			panic("wakeup: not Wakeme");
419		ready(p);
420		unlock(&r->l);
421		if(s)
422			allow();
423	}
424}
425.P2
426The mistake is that the reading of
427.CW r->p
428may occur just as the other process calls
429.CW sleep ,
430so when the interrupt examines the structure it sees no one to wake up,
431and the sleeping process misses its wakeup.
432We wrote the code this way because we reasoned that the fetch
433.CW p
434.CW =
435.CW r->p
436was inherently atomic and need not be interlocked.
437The bug was found by examination when a new, very fast device
438was added to the system and sleeps and interrupts were closely overlapped.
439However, it was in the system for a couple of months without causing an error.
440.LP
441How many errors lurk in our supposedly correct implementation above?
442We would like a way to guarantee correctness; formal proofs are beyond
443our abilities when the subtleties of interrupts and multiprocessors are
444involved.
445With that in mind, the first three authors approached the last to see
446if his automated tool for checking protocols
447[Hol91]
448could be
449used to verify our new
450.CW sleep
451and
452.CW wakeup
453for correctness.
454The code was translated into the language for that system
455(with, unfortunately, no way of proving that the translation is itself correct)
456and validated by exhaustive simulation.
457.LP
458The validator found a bug.
459Under our assumption that there is only one interrupt, the bug cannot
460occur, but in the more general case of multiple interrupts synchronising
461through the same condition function and rendezvous,
462the process and interrupt can enter a peculiar state.
463A process may return from
464.CW sleep
465with the condition function false
466if there is a delay between
467the condition coming true and
468.CW wakeup
469being called,
470with the delay occurring
471just as the receiving process calls
472.CW sleep .
473The condition is now true, so that process returns immediately,
474does whatever is appropriate, and then (say) decides to call
475.CW sleep
476again.  This time the condition is false, so it goes to sleep.
477The wakeup process then finds a sleeping process,
478and wakes it up, but the condition is now false.
479.LP
480There is an easy (and verified) solution: at the end of
481.CW sleep
482or after
483.CW sleep
484returns,
485if the condition is false, execute
486.CW sleep
487again.  This re-execution cannot repeat; the second synchronisation is guaranteed
488to function under the external conditions we are supposing.
489.LP
490Even though the original code is completely
491protected by interlocks and had been examined carefully by all of us
492and believed correct, it still had problems.
493It seems to us that some exhaustive automated analysis is
494required of multiprocessor algorithms to guarantee their safety.
495Our experience has confirmed that it is almost impossible to
496guarantee by inspection or simple testing the correctness
497of a multiprocessor algorithm.  Testing can demonstrate the presence
498of bugs but not their absence
499[Dij72].
500.LP
501We close by claiming that the code above with
502the suggested modification passes all tests we have for correctness
503under the assumptions used in the validation.
504We would not, however, go so far as to claim that it is universally correct.
505.SH
506References
507.LP
508[Bac86] Maurice J. Bach,
509.I "The Design of the UNIX Operating System,
510Prentice-Hall,
511Englewood Cliffs,
5121986.
513.LP
514[Dij72] Edsger W. Dijkstra,
515``The Humble Programmer \- 1972 Turing Award Lecture'',
516.I "Comm. ACM,
51715(10), pp. 859-866,
518October 1972.
519.LP
520[Hol91] Gerard J. Holzmann,
521.I "Design and Validation of Computer Protocols,
522Prentice-Hall,
523Englewood Cliffs,
5241991.
525.LP
526[Pik90]
527Rob Pike,
528Dave Presotto,
529Ken Thompson,
530Howard Trickey,
531``Plan 9 from Bell Labs'',
532.I "Proceedings of the Summer 1990 UKUUG Conference,
533pp. 1-9,
534London,
535July, 1990.
536.LP
537[Sal66] Jerome H. Saltzer,
538.I "Traffic Control in a Multiplexed Computer System
539MIT,
540Cambridge, Mass.,
5411966.
542