1.HTML "Process Sleep and Wakeup on a Shared-memory Multiprocessor 2.TL 3Process Sleep and Wakeup on a Shared-memory Multiprocessor 4.AU 5Rob Pike 6Dave Presotto 7Ken Thompson 8Gerard Holzmann 9.sp 10rob,presotto,ken,gerard@plan9.bell-labs.com 11.AB 12.FS 13Appeared in a slightly different form in 14.I 15Proceedings of the Spring 1991 EurOpen Conference, 16.R 17Tromsø, Norway, 1991, pp. 161-166. 18.FE 19The problem of enabling a `sleeping' process on a shared-memory multiprocessor 20is a difficult one, especially if the process is to be awakened by an interrupt-time 21event. We present here the code 22for sleep and wakeup primitives that we use in our multiprocessor system. 23The code has been exercised by years of active use and by a verification 24system. 25.AE 26.LP 27Our problem is to synchronise processes on a symmetric shared-memory multiprocessor. 28Processes suspend execution, or 29.I sleep, 30while awaiting an enabling event such as an I/O interrupt. 31When the event occurs, the process is issued a 32.I wakeup 33to resume its execution. 34During these events, other processes may be running and other interrupts 35occurring on other processors. 36.LP 37More specifically, we wish to implement subroutines called 38.CW sleep , 39callable by a process to relinquish control of its current processor, 40and 41.CW wakeup , 42callable by another process or an interrupt to resume the execution 43of a suspended process. 44The calling conventions of these subroutines will remain unspecified 45for the moment. 46.LP 47We assume the processors have an atomic test-and-set or equivalent 48operation but no other synchronisation method. Also, we assume interrupts 49can occur on any processor at any time, except on a processor that has 50locally inhibited them. 51.LP 52The problem is the generalisation to a multiprocessor of a familiar 53and well-understood uniprocessor problem. It may be reduced to a 54uniprocessor problem by using a global test-and-set to serialise the 55sleeps and wakeups, 56which is equivalent to synchronising through a monitor. 57For performance and cleanliness, however, 58we prefer to allow the interrupt handling and process control to be multiprocessed. 59.LP 60Our attempts to solve the sleep/wakeup problem in Plan 9 61[Pik90] 62prompted this paper. 63We implemented solutions several times over several months and each 64time convinced ourselves \(em wrongly \(em they were correct. 65Multiprocessor algorithms can be 66difficult to prove correct by inspection and formal reasoning about them 67is impractical. We finally developed an algorithm we trust by 68verifying our code using an 69empirical testing tool. 70We present that code here, along with some comments about the process by 71which it was designed. 72.SH 73History 74.LP 75Since processes in Plan 9 and the UNIX 76system have similar structure and properties, one might ask if 77UNIX 78.CW sleep 79and 80.CW wakeup 81[Bac86] 82could not easily be adapted from their standard uniprocessor implementation 83to our multiprocessor needs. 84The short answer is, no. 85.LP 86The 87UNIX 88routines 89take as argument a single global address 90that serves as a unique 91identifier to connect the wakeup with the appropriate process or processes. 92This has several inherent disadvantages. 93From the point of view of 94.CW sleep 95and 96.CW wakeup , 97it is difficult to associate a data structure with an arbitrary address; 98the routines are unable to maintain a state variable recording the 99status of the event and processes. 100(The reverse is of course easy \(em we could 101require the address to point to a special data structure \(em 102but we are investigating 103UNIX 104.CW sleep 105and 106.CW wakeup , 107not the code that calls them.) 108Also, multiple processes sleep `on' a given address, so 109.CW wakeup 110must enable them all, and let process scheduling determine which process 111actually benefits from the event. 112This is inefficient; 113a queueing mechanism would be preferable 114but, again, it is difficult to associate a queue with a general address. 115Moreover, the lack of state means that 116.CW sleep 117and 118.CW wakeup 119cannot know what the corresponding process (or interrupt) is doing; 120.CW sleep 121and 122.CW wakeup 123must be executed atomically. 124On a uniprocessor it suffices to disable interrupts during their 125execution. 126On a multiprocessor, however, 127most processors 128can inhibit interrupts only on the current processor, 129so while a process is executing 130.CW sleep 131the desired interrupt can come and go on another processor. 132If the wakeup is to be issued by another process, the problem is even harder. 133Some inter-process mutual exclusion mechanism must be used, 134which, yet again, is difficult to do without a way to communicate state. 135.LP 136In summary, to be useful on a multiprocessor, 137UNIX 138.CW sleep 139and 140.CW wakeup 141must either be made to run atomically on a single 142processor (such as by using a monitor) 143or they need a richer model for their communication. 144.SH 145The design 146.LP 147Consider the case of an interrupt waking up a sleeping process. 148(The other case, a process awakening a second process, is easier because 149atomicity can be achieved using an interlock.) 150The sleeping process is waiting for some event to occur, which may be 151modeled by a condition coming true. 152The condition could be just that the event has happened, or something 153more subtle such as a queue draining below some low-water mark. 154We represent the condition by a function of one 155argument of type 156.CW void* ; 157the code supporting the device generating the interrupts 158provides such a function to be used by 159.CW sleep 160and 161.CW wakeup 162to synchronise. The function returns 163.CW false 164if the event has not occurred, and 165.CW true 166some time after the event has occurred. 167The 168.CW sleep 169and 170.CW wakeup 171routines must, of course, work correctly if the 172event occurs while the process is executing 173.CW sleep . 174.LP 175We assume that a particular call to 176.CW sleep 177corresponds to a particular call to 178.CW wakeup , 179that is, 180at most one process is asleep waiting for a particular event. 181This can be guaranteed in the code that calls 182.CW sleep 183and 184.CW wakeup 185by appropriate interlocks. 186We also assume for the moment that there will be only one interrupt 187and that it may occur at any time, even before 188.CW sleep 189has been called. 190.LP 191For performance, 192we desire that multiple instances of 193.CW sleep 194and 195.CW wakeup 196may be running simultaneously on our multiprocessor. 197For example, a process calling 198.CW sleep 199to await a character from an input channel need not 200wait for another process to finish executing 201.CW sleep 202to await a disk block. 203At a finer level, we would like a process reading from one input channel 204to be able to execute 205.CW sleep 206in parallel with a process reading from another input channel. 207A standard approach to synchronisation is to interlock the channel `driver' 208so that only one process may be executing in the channel code at once. 209This method is clearly inadequate for our purposes; we need 210fine-grained synchronisation, and in particular to apply 211interlocks at the level of individual channels rather than at the level 212of the channel driver. 213.LP 214Our approach is to use an object called a 215.I rendezvous , 216which is a data structure through which 217.CW sleep 218and 219.CW wakeup 220synchronise. 221(The similarly named construct in Ada is a control structure; 222ours is an unrelated data structure.) 223A rendezvous 224is allocated for each active source of events: 225one for each I/O channel, 226one for each end of a pipe, and so on. 227The rendezvous serves as an interlockable structure in which to record 228the state of the sleeping process, so that 229.CW sleep 230and 231.CW wakeup 232can communicate if the event happens before or while 233.CW sleep 234is executing. 235.LP 236Our design for 237.CW sleep 238is therefore a function 239.P1 240void sleep(Rendezvous *r, int (*condition)(void*), void *arg) 241.P2 242called by the sleeping process. 243The argument 244.CW r 245connects the call to 246.CW sleep 247with the call to 248.CW wakeup , 249and is part of the data structure for the (say) device. 250The function 251.CW condition 252is described above; 253called with argument 254.CW arg , 255it is used by 256.CW sleep 257to decide whether the event has occurred. 258.CW Wakeup 259has a simpler specification: 260.P1 261void wakeup(Rendezvous *r). 262.P2 263.CW Wakeup 264must be called after the condition has become true. 265.SH 266An implementation 267.LP 268The 269.CW Rendezvous 270data type is defined as 271.P1 272typedef struct{ 273 Lock l; 274 Proc *p; 275}Rendezvous; 276.P2 277Our 278.CW Locks 279are test-and-set spin locks. 280The routine 281.CW lock(Lock\ *l) 282returns when the current process holds that lock; 283.CW unlock(Lock\ *l) 284releases the lock. 285.LP 286Here is our implementation of 287.CW sleep . 288Its details are discussed below. 289.CW Thisp 290is a pointer to the current process on the current processor. 291(Its value differs on each processor.) 292.P1 293void 294sleep(Rendezvous *r, int (*condition)(void*), void *arg) 295{ 296 int s; 297 298 s = inhibit(); /* interrupts */ 299 lock(&r->l); 300 301 /* 302 * if condition happened, never mind 303 */ 304 if((*condition)(arg)){ 305 unlock(&r->l); 306 allow(); /* interrupts */ 307 return; 308 } 309 310 /* 311 * now we are committed to 312 * change state and call scheduler 313 */ 314 if(r->p) 315 error("double sleep %d %d", r->p->pid, thisp->pid); 316 thisp->state = Wakeme; 317 r->p = thisp; 318 unlock(&r->l); 319 allow(s); /* interrupts */ 320 sched(); /* relinquish CPU */ 321} 322.P2 323.ne 3i 324Here is 325.CW wakeup. 326.P1 327void 328wakeup(Rendezvous *r) 329{ 330 Proc *p; 331 int s; 332 333 s = inhibit(); /* interrupts; return old state */ 334 lock(&r->l); 335 p = r->p; 336 if(p){ 337 r->p = 0; 338 if(p->state != Wakeme) 339 panic("wakeup: not Wakeme"); 340 ready(p); 341 } 342 unlock(&r->l); 343 if(s) 344 allow(); 345} 346.P2 347.CW Sleep 348and 349.CW wakeup 350both begin by disabling interrupts 351and then locking the rendezvous structure. 352Because 353.CW wakeup 354may be called in an interrupt routine, the lock must be set only 355with interrupts disabled on the current processor, 356so that if the interrupt comes during 357.CW sleep 358it will occur only on a different processor; 359if it occurred on the processor executing 360.CW sleep , 361the spin lock in 362.CW wakeup 363would hang forever. 364At the end of each routine, the lock is released and processor priority 365returned to its previous value. 366.CW Wakeup "" ( 367needs to inhibit interrupts in case 368it is being called by a process; 369this is a no-op if called by an interrupt.) 370.LP 371.CW Sleep 372checks to see if the condition has become true, and returns if so. 373Otherwise the process posts its name in the rendezvous structure where 374.CW wakeup 375may find it, marks its state as waiting to be awakened 376(this is for error checking only) and goes to sleep by calling 377.CW sched() . 378The manipulation of the rendezvous structure is all done under the lock, 379and 380.CW wakeup 381only examines it under lock, so atomicity and mutual exclusion 382are guaranteed. 383.LP 384.CW Wakeup 385has a simpler job. When it is called, the condition has implicitly become true, 386so it locks the rendezvous, sees if a process is waiting, and readies it to run. 387.SH 388Discussion 389.LP 390The synchronisation technique used here 391is similar to known methods, even as far back as Saltzer's thesis 392[Sal66]. 393The code looks trivially correct in retrospect: all access to data structures is done 394under lock, and there is no place that things may get out of order. 395Nonetheless, it took us several iterations to arrive at the above 396implementation, because the things that 397.I can 398go wrong are often hard to see. We had four earlier implementations 399that were examined at great length and only found faulty when a new, 400different style of device or activity was added to the system. 401.LP 402.ne 3i 403Here, for example, is an incorrect implementation of wakeup, 404closely related to one of our versions. 405.P1 406void 407wakeup(Rendezvous *r) 408{ 409 Proc *p; 410 int s; 411 412 p = r->p; 413 if(p){ 414 s = inhibit(); 415 lock(&r->l); 416 r->p = 0; 417 if(p->state != Wakeme) 418 panic("wakeup: not Wakeme"); 419 ready(p); 420 unlock(&r->l); 421 if(s) 422 allow(); 423 } 424} 425.P2 426The mistake is that the reading of 427.CW r->p 428may occur just as the other process calls 429.CW sleep , 430so when the interrupt examines the structure it sees no one to wake up, 431and the sleeping process misses its wakeup. 432We wrote the code this way because we reasoned that the fetch 433.CW p 434.CW = 435.CW r->p 436was inherently atomic and need not be interlocked. 437The bug was found by examination when a new, very fast device 438was added to the system and sleeps and interrupts were closely overlapped. 439However, it was in the system for a couple of months without causing an error. 440.LP 441How many errors lurk in our supposedly correct implementation above? 442We would like a way to guarantee correctness; formal proofs are beyond 443our abilities when the subtleties of interrupts and multiprocessors are 444involved. 445With that in mind, the first three authors approached the last to see 446if his automated tool for checking protocols 447[Hol91] 448could be 449used to verify our new 450.CW sleep 451and 452.CW wakeup 453for correctness. 454The code was translated into the language for that system 455(with, unfortunately, no way of proving that the translation is itself correct) 456and validated by exhaustive simulation. 457.LP 458The validator found a bug. 459Under our assumption that there is only one interrupt, the bug cannot 460occur, but in the more general case of multiple interrupts synchronising 461through the same condition function and rendezvous, 462the process and interrupt can enter a peculiar state. 463A process may return from 464.CW sleep 465with the condition function false 466if there is a delay between 467the condition coming true and 468.CW wakeup 469being called, 470with the delay occurring 471just as the receiving process calls 472.CW sleep . 473The condition is now true, so that process returns immediately, 474does whatever is appropriate, and then (say) decides to call 475.CW sleep 476again. This time the condition is false, so it goes to sleep. 477The wakeup process then finds a sleeping process, 478and wakes it up, but the condition is now false. 479.LP 480There is an easy (and verified) solution: at the end of 481.CW sleep 482or after 483.CW sleep 484returns, 485if the condition is false, execute 486.CW sleep 487again. This re-execution cannot repeat; the second synchronisation is guaranteed 488to function under the external conditions we are supposing. 489.LP 490Even though the original code is completely 491protected by interlocks and had been examined carefully by all of us 492and believed correct, it still had problems. 493It seems to us that some exhaustive automated analysis is 494required of multiprocessor algorithms to guarantee their safety. 495Our experience has confirmed that it is almost impossible to 496guarantee by inspection or simple testing the correctness 497of a multiprocessor algorithm. Testing can demonstrate the presence 498of bugs but not their absence 499[Dij72]. 500.LP 501We close by claiming that the code above with 502the suggested modification passes all tests we have for correctness 503under the assumptions used in the validation. 504We would not, however, go so far as to claim that it is universally correct. 505.SH 506References 507.LP 508[Bac86] Maurice J. Bach, 509.I "The Design of the UNIX Operating System, 510Prentice-Hall, 511Englewood Cliffs, 5121986. 513.LP 514[Dij72] Edsger W. Dijkstra, 515``The Humble Programmer \- 1972 Turing Award Lecture'', 516.I "Comm. ACM, 51715(10), pp. 859-866, 518October 1972. 519.LP 520[Hol91] Gerard J. Holzmann, 521.I "Design and Validation of Computer Protocols, 522Prentice-Hall, 523Englewood Cliffs, 5241991. 525.LP 526[Pik90] 527Rob Pike, 528Dave Presotto, 529Ken Thompson, 530Howard Trickey, 531``Plan 9 from Bell Labs'', 532.I "Proceedings of the Summer 1990 UKUUG Conference, 533pp. 1-9, 534London, 535July, 1990. 536.LP 537[Sal66] Jerome H. Saltzer, 538.I "Traffic Control in a Multiplexed Computer System 539MIT, 540Cambridge, Mass., 5411966. 542