1.TL 2Process Sleep and Wakeup on a Shared-memory Multiprocessor 3.AU 4Rob Pike 5Dave Presotto 6Ken Thompson 7Gerard Holzmann 8.sp 9rob,presotto,ken,gerard@plan9.att.com 10.AB 11The problem of enabling a `sleeping' process on a shared-memory multiprocessor 12is a difficult one, especially if the process is to be awakened by an interrupt-time 13event. We present here the code 14for sleep and wakeup primitives that we use in our multiprocessor system. 15The code has been exercised by years of active use and by a verification 16system. 17.AE 18.LP 19Our problem is to synchronise processes on a symmetric shared-memory multiprocessor. 20Processes suspend execution, or 21.I sleep, 22while awaiting an enabling event such as an I/O interrupt. 23When the event occurs, the process is issued a 24.I wakeup 25to resume its execution. 26During these events, other processes may be running and other interrupts 27occurring on other processors. 28.LP 29More specifically, we wish to implement subroutines called 30.CW sleep , 31callable by a process to relinquish control of its current processor, 32and 33.CW wakeup , 34callable by another process or an interrupt to resume the execution 35of a suspended process. 36The calling conventions of these subroutines will remain unspecified 37for the moment. 38.LP 39We assume the processors have an atomic test-and-set or equivalent 40operation but no other synchronisation method. Also, we assume interrupts 41can occur on any processor at any time, except on a processor that has 42locally inhibited them. 43.LP 44The problem is the generalisation to a multiprocessor of a familiar 45and well-understood uniprocessor problem. It may be reduced to a 46uniprocessor problem by using a global test-and-set to serialise the 47sleeps and wakeups, 48which is equivalent to synchronising through a monitor. 49For performance and cleanliness, however, 50we prefer to allow the interrupt handling and process control to be multiprocessed. 51.LP 52Our attempts to solve the sleep/wakeup problem in Plan 9 53[Pik90] 54prompted this paper. 55We implemented solutions several times over several months and each 56time convinced ourselves \(em wrongly \(em they were correct. 57Multiprocessor algorithms can be 58difficult to prove correct by inspection and formal reasoning about them 59is impractical. We finally developed an algorithm we trust by 60verifying our code using an 61empirical testing tool. 62We present that code here, along with some comments about the process by 63which it was designed. 64.SH 65History 66.LP 67Since processes in Plan 9 and the UNIX 68system have similar structure and properties, one might ask if 69UNIX 70.CW sleep 71and 72.CW wakeup 73[Bac86] 74could not easily be adapted from their standard uniprocessor implementation 75to our multiprocessor needs. 76The short answer is, no. 77.LP 78The 79UNIX 80routines 81take as argument a single global address 82that serves as a unique 83identifier to connect the wakeup with the appropriate process or processes. 84This has several inherent disadvantages. 85From the point of view of 86.CW sleep 87and 88.CW wakeup , 89it is difficult to associate a data structure with an arbitrary address; 90the routines are unable to maintain a state variable recording the 91status of the event and processes. 92(The reverse is of course easy \(em we could 93require the address to point to a special data structure \(em 94but we are investigating 95UNIX 96.CW sleep 97and 98.CW wakeup , 99not the code that calls them.) 100Also, multiple processes sleep `on' a given address, so 101.CW wakeup 102must enable them all, and let process scheduling determine which process 103actually benefits from the event. 104This is inefficient; 105a queueing mechanism would be preferable 106but, again, it is difficult to associate a queue with a general address. 107Moreover, the lack of state means that 108.CW sleep 109and 110.CW wakeup 111cannot know what the corresponding process (or interrupt) is doing; 112.CW sleep 113and 114.CW wakeup 115must be executed atomically. 116On a uniprocessor it suffices to disable interrupts during their 117execution. 118On a multiprocessor, however, 119most processors 120can inhibit interrupts only on the current processor, 121so while a process is executing 122.CW sleep 123the desired interrupt can come and go on another processor. 124If the wakeup is to be issued by another process, the problem is even harder. 125Some inter-process mutual exclusion mechanism must be used, 126which, yet again, is difficult to do without a way to communicate state. 127.LP 128In summary, to be useful on a multiprocessor, 129UNIX 130.CW sleep 131and 132.CW wakeup 133must either be made to run atomically on a single 134processor (such as by using a monitor) 135or they need a richer model for their communication. 136.SH 137The design 138.LP 139Consider the case of an interrupt waking up a sleeping process. 140(The other case, a process awakening a second process, is easier because 141atomicity can be achieved using an interlock.) 142The sleeping process is waiting for some event to occur, which may be 143modeled by a condition coming true. 144The condition could be just that the event has happened, or something 145more subtle such as a queue draining below some low-water mark. 146We represent the condition by a function of one 147argument of type 148.CW void* ; 149the code supporting the device generating the interrupts 150provides such a function to be used by 151.CW sleep 152and 153.CW wakeup 154to synchronise. The function returns 155.CW false 156if the event has not occurred, and 157.CW true 158some time after the event has occurred. 159The 160.CW sleep 161and 162.CW wakeup 163routines must, of course, work correctly if the 164event occurs while the process is executing 165.CW sleep . 166.LP 167We assume that a particular call to 168.CW sleep 169corresponds to a particular call to 170.CW wakeup , 171that is, 172at most one process is asleep waiting for a particular event. 173This can be guaranteed in the code that calls 174.CW sleep 175and 176.CW wakeup 177by appropriate interlocks. 178We also assume for the moment that there will be only one interrupt 179and that it may occur at any time, even before 180.CW sleep 181has been called. 182.LP 183For performance, 184we desire that multiple instances of 185.CW sleep 186and 187.CW wakeup 188may be running simultaneously on our multiprocessor. 189For example, a process calling 190.CW sleep 191to await a character from an input channel need not 192wait for another process to finish executing 193.CW sleep 194to await a disk block. 195At a finer level, we would like a process reading from one input channel 196to be able to execute 197.CW sleep 198in parallel with a process reading from another input channel. 199A standard approach to synchronisation is to interlock the channel `driver' 200so that only one process may be executing in the channel code at once. 201This method is clearly inadequate for our purposes; we need 202fine-grained synchronisation, and in particular to apply 203interlocks at the level of individual channels rather than at the level 204of the channel driver. 205.LP 206Our approach is to use an object called a 207.I rendezvous , 208which is a data structure through which 209.CW sleep 210and 211.CW wakeup 212synchronise. 213(The similarly named construct in Ada is a control structure; 214ours is an unrelated data structure.) 215A rendezvous 216is allocated for each active source of events: 217one for each I/O channel, 218one for each end of a pipe, and so on. 219The rendezvous serves as an interlockable structure in which to record 220the state of the sleeping process, so that 221.CW sleep 222and 223.CW wakeup 224can communicate if the event happens before or while 225.CW sleep 226is executing. 227.LP 228Our design for 229.CW sleep 230is therefore a function 231.P1 232void sleep(Rendezvous *r, int (*condition)(void*), void *arg) 233.P2 234called by the sleeping process. 235The argument 236.CW r 237connects the call to 238.CW sleep 239with the call to 240.CW wakeup , 241and is part of the data structure for the (say) device. 242The function 243.CW condition 244is described above; 245called with argument 246.CW arg , 247it is used by 248.CW sleep 249to decide whether the event has occurred. 250.CW Wakeup 251has a simpler specification: 252.P1 253void wakeup(Rendezvous *r). 254.P2 255.CW Wakeup 256must be called after the condition has become true. 257.SH 258An implementation 259.LP 260The 261.CW Rendezvous 262data type is defined as 263.P1 264typedef struct{ 265 Lock l; 266 Proc *p; 267}Rendezvous; 268.P2 269Our 270.CW Locks 271are test-and-set spin locks. 272The routine 273.CW lock(Lock\ *l) 274returns when the current process holds that lock; 275.CW unlock(Lock\ *l) 276releases the lock. 277.LP 278Here is our implementation of 279.CW sleep . 280Its details are discussed below. 281.CW Thisp 282is a pointer to the current process on the current processor. 283(Its value differs on each processor.) 284.P1 285void 286sleep(Rendezvous *r, int (*condition)(void*), void *arg) 287{ 288 int s; 289 290 s = inhibit(); /* interrupts */ 291 lock(&r->l); 292 293 /* 294 * if condition happened, never mind 295 */ 296 if((*condition)(arg)){ 297 unlock(&r->l); 298 allow(); /* interrupts */ 299 return; 300 } 301 302 /* 303 * now we are committed to 304 * change state and call scheduler 305 */ 306 if(r->p) 307 error("double sleep %d %d", r->p->pid, thisp->pid); 308 thisp->state = Wakeme; 309 r->p = thisp; 310 unlock(&r->l); 311 allow(s); /* interrupts */ 312 sched(); /* relinquish CPU */ 313} 314.P2 315.ne 3i 316Here is 317.CW wakeup. 318.P1 319void 320wakeup(Rendezvous *r) 321{ 322 Proc *p; 323 int s; 324 325 s = inhibit(); /* interrupts; return old state */ 326 lock(&r->l); 327 p = r->p; 328 if(p){ 329 r->p = 0; 330 if(p->state != Wakeme) 331 panic("wakeup: not Wakeme"); 332 ready(p); 333 } 334 unlock(&r->l); 335 if(s) 336 allow(); 337} 338.P2 339.CW Sleep 340and 341.CW wakeup 342both begin by disabling interrupts 343and then locking the rendezvous structure. 344Because 345.CW wakeup 346may be called in an interrupt routine, the lock must be set only 347with interrupts disabled on the current processor, 348so that if the interrupt comes during 349.CW sleep 350it will occur only on a different processor; 351if it occurred on the processor executing 352.CW sleep , 353the spin lock in 354.CW wakeup 355would hang forever. 356At the end of each routine, the lock is released and processor priority 357returned to its previous value. 358.CW Wakeup "" ( 359needs to inhibit interrupts in case 360it is being called by a process; 361this is a no-op if called by an interrupt.) 362.LP 363.CW Sleep 364checks to see if the condition has become true, and returns if so. 365Otherwise the process posts its name in the rendezvous structure where 366.CW wakeup 367may find it, marks its state as waiting to be awakened 368(this is for error checking only) and goes to sleep by calling 369.CW sched() . 370The manipulation of the rendezvous structure is all done under the lock, 371and 372.CW wakeup 373only examines it under lock, so atomicity and mutual exclusion 374are guaranteed. 375.LP 376.CW Wakeup 377has a simpler job. When it is called, the condition has implicitly become true, 378so it locks the rendezvous, sees if a process is waiting, and readies it to run. 379.SH 380Discussion 381.LP 382The synchronisation technique used here 383is similar to known methods, even as far back as Saltzer's thesis 384[Sal66]. 385The code looks trivially correct in retrospect: all access to data structures is done 386under lock, and there is no place that things may get out of order. 387Nonetheless, it took us several iterations to arrive at the above 388implementation, because the things that 389.I can 390go wrong are often hard to see. We had four earlier implementations 391that were examined at great length and only found faulty when a new, 392different style of device or activity was added to the system. 393.LP 394.ne 3i 395Here, for example, is an incorrect implementation of wakeup, 396closely related to one of our versions. 397.P1 398void 399wakeup(Rendezvous *r) 400{ 401 Proc *p; 402 int s; 403 404 p = r->p; 405 if(p){ 406 s = inhibit(); 407 lock(&r->l); 408 r->p = 0; 409 if(p->state != Wakeme) 410 panic("wakeup: not Wakeme"); 411 ready(p); 412 unlock(&r->l); 413 if(s) 414 allow(); 415 } 416} 417.P2 418The mistake is that the reading of 419.CW r->p 420may occur just as the other process calls 421.CW sleep , 422so when the interrupt examines the structure it sees no one to wake up, 423and the sleeping process misses its wakeup. 424We wrote the code this way because we reasoned that the fetch 425.CW p 426.CW = 427.CW r->p 428was inherently atomic and need not be interlocked. 429The bug was found by examination when a new, very fast device 430was added to the system and sleeps and interrupts were closely overlapped. 431However, it was in the system for a couple of months without causing an error. 432.LP 433How many errors lurk in our supposedly correct implementation above? 434We would like a way to guarantee correctness; formal proofs are beyond 435our abilities when the subtleties of interrupts and multiprocessors are 436involved. 437With that in mind, the first three authors approached the last to see 438if his automated tool for checking protocols 439[Hol91] 440could be 441used to verify our new 442.CW sleep 443and 444.CW wakeup 445for correctness. 446The code was translated into the language for that system 447(with, unfortunately, no way of proving that the translation is itself correct) 448and validated by exhaustive simulation. 449.LP 450The validator found a bug. 451Under our assumption that there is only one interrupt, the bug cannot 452occur, but in the more general case of multiple interrupts synchronising 453through the same condition function and rendezvous, 454the process and interrupt can enter a peculiar state. 455A process may return from 456.CW sleep 457with the condition function false 458if there is a delay between 459the condition coming true and 460.CW wakeup 461being called, 462with the delay occurring 463just as the receiving process calls 464.CW sleep . 465The condition is now true, so that process returns immediately, 466does whatever is appropriate, and then (say) decides to call 467.CW sleep 468again. This time the condition is false, so it goes to sleep. 469The wakeup process then finds a sleeping process, 470and wakes it up, but the condition is now false. 471.LP 472There is an easy (and verified) solution: at the end of 473.CW sleep 474or after 475.CW sleep 476returns, 477if the condition is false, execute 478.CW sleep 479again. This re-execution cannot repeat; the second synchronisation is guaranteed 480to function under the external conditions we are supposing. 481.LP 482Even though the original code is completely 483protected by interlocks and had been examined carefully by all of us 484and believed correct, it still had problems. 485It seems to us that some exhaustive automated analysis is 486required of multiprocessor algorithms to guarantee their safety. 487Our experience has confirmed that it is almost impossible to 488guarantee by inspection or simple testing the correctness 489of a multiprocessor algorithm. Testing can demonstrate the presence 490of bugs but not their absence 491[Dij72]. 492.LP 493We close by claiming that the code above with 494the suggested modification passes all tests we have for correctness 495under the assumptions used in the validation. 496We would not, however, go so far as to claim that it is universally correct. 497.SH 498References 499.LP 500[Bac86] Maurice J. Bach, 501.I "The Design of the UNIX Operating System, 502Prentice-Hall, 503Englewood Cliffs, 5041986. 505.LP 506[Dij72] Edsger W. Dijkstra, 507``The Humble Programmer \- 1972 Turing Award Lecture'', 508.I "Comm. ACM, 50915(10), pp. 859-866, 510October 1972. 511.LP 512[Hol91] Gerard J. Holzmann, 513.I "Design and Validation of Computer Protocols, 514Prentice-Hall, 515Englewood Cliffs, 5161991. 517.LP 518[Pik90] 519Rob Pike, 520Dave Presotto, 521Ken Thompson, 522Howard Trickey, 523``Plan 9 from Bell Labs'', 524.I "Proceedings of the Summer 1990 UKUUG Conference, 525pp. 1-9, 526London, 527July, 1990. 528.LP 529[Sal66] Jerome H. Saltzer, 530.I "Traffic Control in a Multiplexed Computer System 531MIT, 532Cambridge, Mass., 5331966. 534