1.HTML "The Organization of Networks in Plan 9 2.TL 3The Organization of Networks in Plan 9 4.AU 5Dave Presotto 6Phil Winterbottom 7.sp 8presotto,philw@plan9.bell-labs.com 9.AB 10.FS 11Originally appeared in 12.I 13Proc. of the Winter 1993 USENIX Conf., 14.R 15pp. 271-280, 16San Diego, CA 17.FE 18In a distributed system networks are of paramount importance. This 19paper describes the implementation, design philosophy, and organization 20of network support in Plan 9. Topics include network requirements 21for distributed systems, our kernel implementation, network naming, user interfaces, 22and performance. We also observe that much of this organization is relevant to 23current systems. 24.AE 25.NH 26Introduction 27.PP 28Plan 9 [Pike90] is a general-purpose, multi-user, portable distributed system 29implemented on a variety of computers and networks. 30What distinguishes Plan 9 is its organization. 31The goals of this organization were to 32reduce administration 33and to promote resource sharing. One of the keys to its success as a distributed 34system is the organization and management of its networks. 35.PP 36A Plan 9 system comprises file servers, CPU servers and terminals. 37The file servers and CPU servers are typically centrally 38located multiprocessor machines with large memories and 39high speed interconnects. 40A variety of workstation-class machines 41serve as terminals 42connected to the central servers using several networks and protocols. 43The architecture of the system demands a hierarchy of network 44speeds matching the needs of the components. 45Connections between file servers and CPU servers are high-bandwidth point-to-point 46fiber links. 47Connections from the servers fan out to local terminals 48using medium speed networks 49such as Ethernet [Met80] and Datakit [Fra80]. 50Low speed connections via the Internet and 51the AT&T backbone serve users in Oregon and Illinois. 52Basic Rate ISDN data service and 9600 baud serial lines provide slow 53links to users at home. 54.PP 55Since CPU servers and terminals use the same kernel, 56users may choose to run programs locally on 57their terminals or remotely on CPU servers. 58The organization of Plan 9 hides the details of system connectivity 59allowing both users and administrators to configure their environment 60to be as distributed or centralized as they wish. 61Simple commands support the 62construction of a locally represented name space 63spanning many machines and networks. 64At work, users tend to use their terminals like workstations, 65running interactive programs locally and 66reserving the CPU servers for data or compute intensive jobs 67such as compiling and computing chess endgames. 68At home or when connected over 69a slow network, users tend to do most work on the CPU server to minimize 70traffic on the slow links. 71The goal of the network organization is to provide the same 72environment to the user wherever resources are used. 73.NH 74Kernel Network Support 75.PP 76Networks play a central role in any distributed system. This is particularly 77true in Plan 9 where most resources are provided by servers external to the kernel. 78The importance of the networking code within the kernel 79is reflected by its size; 80of 25,000 lines of kernel code, 12,500 are network and protocol related. 81Networks are continually being added and the fraction of code 82devoted to communications 83is growing. 84Moreover, the network code is complex. 85Protocol implementations consist almost entirely of 86synchronization and dynamic memory management, areas demanding 87subtle error recovery 88strategies. 89The kernel currently supports Datakit, point-to-point fiber links, 90an Internet (IP) protocol suite and ISDN data service. 91The variety of networks and machines 92has raised issues not addressed by other systems running on commercial 93hardware supporting only Ethernet or FDDI. 94.NH 2 95The File System protocol 96.PP 97A central idea in Plan 9 is the representation of a resource as a hierarchical 98file system. 99Each process assembles a view of the system by building a 100.I "name space 101[Needham] connecting its resources. 102File systems need not represent disc files; in fact, most Plan 9 file systems have no 103permanent storage. 104A typical file system dynamically represents 105some resource like a set of network connections or the process table. 106Communication between the kernel, device drivers, and local or remote file servers uses a 107protocol called 9P. The protocol consists of 17 messages 108describing operations on files and directories. 109Kernel resident device and protocol drivers use a procedural version 110of the protocol while external file servers use an RPC form. 111Nearly all traffic between Plan 9 systems consists 112of 9P messages. 1139P relies on several properties of the underlying transport protocol. 114It assumes messages arrive reliably and in sequence and 115that delimiters between messages 116are preserved. 117When a protocol does not meet these 118requirements (for example, TCP does not preserve delimiters) 119we provide mechanisms to marshal messages before handing them 120to the system. 121.PP 122A kernel data structure, the 123.I channel , 124is a handle to a file server. 125Operations on a channel generate the following 9P messages. 126The 127.CW session 128and 129.CW attach 130messages authenticate a connection, established by means external to 9P, 131and validate its user. 132The result is an authenticated 133channel 134referencing the root of the 135server. 136The 137.CW clone 138message makes a new channel identical to an existing channel, much like 139the 140.CW dup 141system call. 142A 143channel 144may be moved to a file on the server using a 145.CW walk 146message to descend each level in the hierarchy. 147The 148.CW stat 149and 150.CW wstat 151messages read and write the attributes of the file referenced by a channel. 152The 153.CW open 154message prepares a channel for subsequent 155.CW read 156and 157.CW write 158messages to access the contents of the file. 159.CW Create 160and 161.CW remove 162perform the actions implied by their names on the file 163referenced by the channel. 164The 165.CW clunk 166message discards a channel without affecting the file. 167.PP 168A kernel resident file server called the 169.I "mount driver" 170converts the procedural version of 9P into RPCs. 171The 172.I mount 173system call provides a file descriptor, which can be 174a pipe to a user process or a network connection to a remote machine, to 175be associated with the mount point. 176After a mount, operations 177on the file tree below the mount point are sent as messages to the file server. 178The 179mount 180driver manages buffers, packs and unpacks parameters from 181messages, and demultiplexes among processes using the file server. 182.NH 2 183Kernel Organization 184.PP 185The network code in the kernel is divided into three layers: hardware interface, 186protocol processing, and program interface. 187A device driver typically uses streams to connect the two interface layers. 188Additional stream modules may be pushed on 189a device to process protocols. 190Each device driver is a kernel-resident file system. 191Simple device drivers serve a single level 192directory containing just a few files; 193for example, we represent each UART 194by a data and a control file. 195.P1 196cpu% cd /dev 197cpu% ls -l eia* 198--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1 199--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1ctl 200--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2 201--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2ctl 202cpu% 203.P2 204The control file is used to control the device; 205writing the string 206.CW b1200 207to 208.CW /dev/eia1ctl 209sets the line to 1200 baud. 210.PP 211Multiplexed devices present 212a more complex interface structure. 213For example, the LANCE Ethernet driver 214serves a two level file tree (Figure 1) 215providing 216.IP \(bu 217device control and configuration 218.IP \(bu 219user-level protocols like ARP 220.IP \(bu 221diagnostic interfaces for snooping software. 222.LP 223The top directory contains a 224.CW clone 225file and a directory for each connection, numbered 226.CW 1 227to 228.CW n . 229Each connection directory corresponds to an Ethernet packet type. 230Opening the 231.CW clone 232file finds an unused connection directory 233and opens its 234.CW ctl 235file. 236Reading the control file returns the ASCII connection number; the user 237process can use this value to construct the name of the proper 238connection directory. 239In each connection directory files named 240.CW ctl , 241.CW data , 242.CW stats , 243and 244.CW type 245provide access to the connection. 246Writing the string 247.CW "connect 2048" 248to the 249.CW ctl 250file sets the packet type to 2048 251and 252configures the connection to receive 253all IP packets sent to the machine. 254Subsequent reads of the file 255.CW type 256yield the string 257.CW 2048 . 258The 259.CW data 260file accesses the media; 261reading it 262returns the 263next packet of the selected type. 264Writing the file 265queues a packet for transmission after 266appending a packet header containing the source address and packet type. 267The 268.CW stats 269file returns ASCII text containing the interface address, 270packet input/output counts, error statistics, and general information 271about the state of the interface. 272.so tree.pout 273.PP 274If several connections on an interface 275are configured for a particular packet type, each receives a 276copy of the incoming packets. 277The special packet type 278.CW -1 279selects all packets. 280Writing the strings 281.CW promiscuous 282and 283.CW connect 284.CW -1 285to the 286.CW ctl 287file 288configures a conversation to receive all packets on the Ethernet. 289.PP 290Although the driver interface may seem elaborate, 291the representation of a device as a set of files using ASCII strings for 292communication has several advantages. 293Any mechanism supporting remote access to files immediately 294allows a remote machine to use our interfaces as gateways. 295Using ASCII strings to control the interface avoids byte order problems and 296ensures a uniform representation for 297devices on the same machine and even allows devices to be accessed remotely. 298Representing dissimilar devices by the same set of files allows common tools 299to serve 300several networks or interfaces. 301Programs like 302.CW stty 303are replaced by 304.CW echo 305and shell redirection. 306.NH 2 307Protocol devices 308.PP 309Network connections are represented as pseudo-devices called protocol devices. 310Protocol device drivers exist for the Datakit URP protocol and for each of the 311Internet IP protocols TCP, UDP, and IL. 312IL, described below, is a new communication protocol used by Plan 9 for 313transmitting file system RPC's. 314All protocol devices look identical so user programs contain no 315network-specific code. 316.PP 317Each protocol device driver serves a directory structure 318similar to that of the Ethernet driver. 319The top directory contains a 320.CW clone 321file and a directory for each connection numbered 322.CW 0 323to 324.CW n . 325Each connection directory contains files to control one 326connection and to send and receive information. 327A TCP connection directory looks like this: 328.P1 329cpu% cd /net/tcp/2 330cpu% ls -l 331--rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 ctl 332--rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 data 333--rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 listen 334--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 local 335--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 remote 336--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 status 337cpu% cat local remote status 338135.104.9.31 5012 339135.104.53.11 564 340tcp/2 1 Established connect 341cpu% 342.P2 343The files 344.CW local , 345.CW remote , 346and 347.CW status 348supply information about the state of the connection. 349The 350.CW data 351and 352.CW ctl 353files 354provide access to the process end of the stream implementing the protocol. 355The 356.CW listen 357file is used to accept incoming calls from the network. 358.PP 359The following steps establish a connection. 360.IP 1) 361The clone device of the 362appropriate protocol directory is opened to reserve an unused connection. 363.IP 2) 364The file descriptor returned by the open points to the 365.CW ctl 366file of the new connection. 367Reading that file descriptor returns an ASCII string containing 368the connection number. 369.IP 3) 370A protocol/network specific ASCII address string is written to the 371.CW ctl 372file. 373.IP 4) 374The path of the 375.CW data 376file is constructed using the connection number. 377When the 378.CW data 379file is opened the connection is established. 380.LP 381A process can read and write this file descriptor 382to send and receive messages from the network. 383If the process opens the 384.CW listen 385file it blocks until an incoming call is received. 386An address string written to the 387.CW ctl 388file before the listen selects the 389ports or services the process is prepared to accept. 390When an incoming call is received, the open completes 391and returns a file descriptor 392pointing to the 393.CW ctl 394file of the new connection. 395Reading the 396.CW ctl 397file yields a connection number used to construct the path of the 398.CW data 399file. 400A connection remains established while any of the files in the connection directory 401are referenced or until a close is received from the network. 402.NH 2 403Streams 404.PP 405A 406.I stream 407[Rit84a][Presotto] is a bidirectional channel connecting a 408physical or pseudo-device to user processes. 409The user processes insert and remove data at one end of the stream. 410Kernel processes acting on behalf of a device insert data at 411the other end. 412Asynchronous communications channels such as pipes, 413TCP conversations, Datakit conversations, and RS232 lines are implemented using 414streams. 415.PP 416A stream comprises a linear list of 417.I "processing modules" . 418Each module has both an upstream (toward the process) and 419downstream (toward the device) 420.I "put routine" . 421Calling the put routine of the module on either end of the stream 422inserts data into the stream. 423Each module calls the succeeding one to send data up or down the stream. 424.PP 425An instance of a processing module is represented by a pair of 426.I queues , 427one for each direction. 428The queues point to the put procedures and can be used 429to queue information traveling along the stream. 430Some put routines queue data locally and send it along the stream at some 431later time, either due to a subsequent call or an asynchronous 432event such as a retransmission timer or a device interrupt. 433Processing modules create helper kernel processes to 434provide a context for handling asynchronous events. 435For example, a helper kernel process awakens periodically 436to perform any necessary TCP retransmissions. 437The use of kernel processes instead of serialized run-to-completion service routines 438differs from the implementation of Unix streams. 439Unix service routines cannot 440use any blocking kernel resource and they lack a local long-lived state. 441Helper kernel processes solve these problems and simplify the stream code. 442.PP 443There is no implicit synchronization in our streams. 444Each processing module must ensure that concurrent processes using the stream 445are synchronized. 446This maximizes concurrency but introduces the 447possibility of deadlock. 448However, deadlocks are easily avoided by careful programming; to 449date they have not caused us problems. 450.PP 451Information is represented by linked lists of kernel structures called 452.I blocks . 453Each block contains a type, some state flags, and pointers to 454an optional buffer. 455Block buffers can hold either data or control information, i.e., directives 456to the processing modules. 457Blocks and block buffers are dynamically allocated from kernel memory. 458.NH 3 459User Interface 460.PP 461A stream is represented at user level as two files, 462.CW ctl 463and 464.CW data . 465The actual names can be changed by the device driver using the stream, 466as we saw earlier in the example of the UART driver. 467The first process to open either file creates the stream automatically. 468The last close destroys it. 469Writing to the 470.CW data 471file copies the data into kernel blocks 472and passes them to the downstream put routine of the first processing module. 473A write of less than 32K is guaranteed to be contained by a single block. 474Concurrent writes to the same stream are not synchronized, although the 47532K block size assures atomic writes for most protocols. 476The last block written is flagged with a delimiter 477to alert downstream modules that care about write boundaries. 478In most cases the first put routine calls the second, the second 479calls the third, and so on until the data is output. 480As a consequence, most data is output without context switching. 481.PP 482Reading from the 483.CW data 484file returns data queued at the top of the stream. 485The read terminates when the read count is reached 486or when the end of a delimited block is encountered. 487A per stream read lock ensures only one process 488can read from a stream at a time and guarantees 489that the bytes read were contiguous bytes from the 490stream. 491.PP 492Like UNIX streams [Rit84a], 493Plan 9 streams can be dynamically configured. 494The stream system intercepts and interprets 495the following control blocks: 496.IP "\f(CWpush\fP \fIname\fR" 15 497adds an instance of the processing module 498.I name 499to the top of the stream. 500.IP \f(CWpop\fP 15 501removes the top module of the stream. 502.IP \f(CWhangup\fP 15 503sends a hangup message 504up the stream from the device end. 505.LP 506Other control blocks are module-specific and are interpreted by each 507processing module 508as they pass. 509.PP 510The convoluted syntax and semantics of the UNIX 511.CW ioctl 512system call convinced us to leave it out of Plan 9. 513Instead, 514.CW ioctl 515is replaced by the 516.CW ctl 517file. 518Writing to the 519.CW ctl 520file 521is identical to writing to a 522.CW data 523file except the blocks are of type 524.I control . 525A processing module parses each control block it sees. 526Commands in control blocks are ASCII strings, so 527byte ordering is not an issue when one system 528controls streams in a name space implemented on another processor. 529The time to parse control blocks is not important, since control 530operations are rare. 531.NH 3 532Device Interface 533.PP 534The module at the downstream end of the stream is part of a device interface. 535The particulars of the interface vary with the device. 536Most device interfaces consist of an interrupt routine, an output 537put routine, and a kernel process. 538The output put routine stages data for the 539device and starts the device if it is stopped. 540The interrupt routine wakes up the kernel process whenever 541the device has input to be processed or needs more output staged. 542The kernel process puts information up the stream or stages more data for output. 543The division of labor among the different pieces varies depending on 544how much must be done at interrupt level. 545However, the interrupt routine may not allocate blocks or call 546a put routine since both actions require a process context. 547.NH 3 548Multiplexing 549.PP 550The conversations using a protocol device must be 551multiplexed onto a single physical wire. 552We push a multiplexer processing module 553onto the physical device stream to group the conversations. 554The device end modules on the conversations add the necessary header 555onto downstream messages and then put them to the module downstream 556of the multiplexer. 557The multiplexing module looks at each message moving up its stream and 558puts it to the correct conversation stream after stripping 559the header controlling the demultiplexing. 560.PP 561This is similar to the Unix implementation of multiplexer streams. 562The major difference is that we have no general structure that 563corresponds to a multiplexer. 564Each attempt to produce a generalized multiplexer created a more complicated 565structure and underlined the basic difficulty of generalizing this mechanism. 566We now code each multiplexer from scratch and favor simplicity over 567generality. 568.NH 3 569Reflections 570.PP 571Despite five year's experience and the efforts of many programmers, 572we remain dissatisfied with the stream mechanism. 573Performance is not an issue; 574the time to process protocols and drive 575device interfaces continues to dwarf the 576time spent allocating, freeing, and moving blocks 577of data. 578However the mechanism remains inordinately 579complex. 580Much of the complexity results from our efforts 581to make streams dynamically configurable, to 582reuse processing modules on different devices 583and to provide kernel synchronization 584to ensure data structures 585don't disappear under foot. 586This is particularly irritating since we seldom use these properties. 587.PP 588Streams remain in our kernel because we are unable to 589devise a better alternative. 590Larry Peterson's X-kernel [Pet89a] 591is the closest contender but 592doesn't offer enough advantage to switch. 593If we were to rewrite the streams code, we would probably statically 594allocate resources for a large fixed number of conversations and burn 595memory in favor of less complexity. 596.NH 597The IL Protocol 598.PP 599None of the standard IP protocols is suitable for transmission of 6009P messages over an Ethernet or the Internet. 601TCP has a high overhead and does not preserve delimiters. 602UDP, while cheap, does not provide reliable sequenced delivery. 603Early versions of the system used a custom protocol that was 604efficient but unsatisfactory for internetwork transmission. 605When we implemented IP, TCP, and UDP we looked around for a suitable 606replacement with the following properties: 607.IP \(bu 608Reliable datagram service with sequenced delivery 609.IP \(bu 610Runs over IP 611.IP \(bu 612Low complexity, high performance 613.IP \(bu 614Adaptive timeouts 615.LP 616None met our needs so a new protocol was designed. 617IL is a lightweight protocol designed to be encapsulated by IP. 618It is a connection-based protocol 619providing reliable transmission of sequenced messages between machines. 620No provision is made for flow control since the protocol is designed to transport RPC 621messages between client and server. 622A small outstanding message window prevents too 623many incoming messages from being buffered; 624messages outside the window are discarded 625and must be retransmitted. 626Connection setup uses a two way handshake to generate 627initial sequence numbers at each end of the connection; 628subsequent data messages increment the 629sequence numbers allowing 630the receiver to resequence out of order messages. 631In contrast to other protocols, IL does not do blind retransmission. 632If a message is lost and a timeout occurs, a query message is sent. 633The query message is a small control message containing the current 634sequence numbers as seen by the sender. 635The receiver responds to a query by retransmitting missing messages. 636This allows the protocol to behave well in congested networks, 637where blind retransmission would cause further 638congestion. 639Like TCP, IL has adaptive timeouts. 640A round-trip timer is used 641to calculate acknowledge and retransmission times in terms of the network speed. 642This allows the protocol to perform well on both the Internet and on local Ethernets. 643.PP 644In keeping with the minimalist design of the rest of the kernel, IL is small. 645The entire protocol is 847 lines of code, compared to 2200 lines for TCP. 646IL is our protocol of choice. 647.NH 648Network Addressing 649.PP 650A uniform interface to protocols and devices is not sufficient to 651support the transparency we require. 652Since each network uses a different 653addressing scheme, 654the ASCII strings written to a control file have no common format. 655As a result, every tool must know the specifics of the networks it 656is capable of addressing. 657Moreover, since each machine supplies a subset 658of the available networks, each user must be aware of the networks supported 659by every terminal and server machine. 660This is obviously unacceptable. 661.PP 662Several possible solutions were considered and rejected; one deserves 663more discussion. 664We could have used a user-level file server 665to represent the network name space as a Plan 9 file tree. 666This global naming scheme has been implemented in other distributed systems. 667The file hierarchy provides paths to 668directories representing network domains. 669Each directory contains 670files representing the names of the machines in that domain; 671an example might be the path 672.CW /net/name/usa/edu/mit/ai . 673Each machine file contains information like the IP address of the machine. 674We rejected this representation for several reasons. 675First, it is hard to devise a hierarchy encompassing all representations 676of the various network addressing schemes in a uniform manner. 677Datakit and Ethernet address strings have nothing in common. 678Second, the address of a machine is 679often only a small part of the information required to connect to a service on 680the machine. 681For example, the IP protocols require symbolic service names to be mapped into 682numeric port numbers, some of which are privileged and hence special. 683Information of this sort is hard to represent in terms of file operations. 684Finally, the size and number of the networks being represented burdens users with 685an unacceptably large amount of information about the organization of the network 686and its connectivity. 687In this case the Plan 9 representation of a 688resource as a file is not appropriate. 689.PP 690If tools are to be network independent, a third-party server must resolve 691network names. 692A server on each machine, with local knowledge, can select the best network 693for any particular destination machine or service. 694Since the network devices present a common interface, 695the only operation which differs between networks is name resolution. 696A symbolic name must be translated to 697the path of the clone file of a protocol 698device and an ASCII address string to write to the 699.CW ctl 700file. 701A connection server (CS) provides this service. 702.NH 2 703Network Database 704.PP 705On most systems several 706files such as 707.CW /etc/hosts , 708.CW /etc/networks , 709.CW /etc/services , 710.CW /etc/hosts.equiv , 711.CW /etc/bootptab , 712and 713.CW /etc/named.d 714hold network information. 715Much time and effort is spent 716administering these files and keeping 717them mutually consistent. 718Tools attempt to 719automatically derive one or more of the files from 720information in other files but maintenance continues to be 721difficult and error prone. 722.PP 723Since we were writing an entirely new system, we were free to 724try a simpler approach. 725One database on a shared server contains all the information 726needed for network administration. 727Two ASCII files comprise the main database: 728.CW /lib/ndb/local 729contains locally administered information and 730.CW /lib/ndb/global 731contains information imported from elsewhere. 732The files contain sets of attribute/value pairs of the form 733.I attr\f(CW=\fPvalue , 734where 735.I attr 736and 737.I value 738are alphanumeric strings. 739Systems are described by multi-line entries; 740a header line at the left margin begins each entry followed by zero or more 741indented attribute/value pairs specifying 742names, addresses, properties, etc. 743For example, the entry for our CPU server 744specifies a domain name, an IP address, an Ethernet address, 745a Datakit address, a boot file, and supported protocols. 746.P1 747sys=helix 748 dom=helix.research.bell-labs.com 749 bootf=/mips/9power 750 ip=135.104.9.31 ether=0800690222f0 751 dk=nj/astro/helix 752 proto=il flavor=9cpu 753.P2 754If several systems share entries such as 755network mask and gateway, we specify that information 756with the network or subnetwork instead of the system. 757The following entries define a Class B IP network and 758a few subnets derived from it. 759The entry for the network specifies the IP mask, 760file system, and authentication server for all systems 761on the network. 762Each subnetwork specifies its default IP gateway. 763.P1 764ipnet=mh-astro-net ip=135.104.0.0 ipmask=255.255.255.0 765 fs=bootes.research.bell-labs.com 766 auth=1127auth 767ipnet=unix-room ip=135.104.117.0 768 ipgw=135.104.117.1 769ipnet=third-floor ip=135.104.51.0 770 ipgw=135.104.51.1 771ipnet=fourth-floor ip=135.104.52.0 772 ipgw=135.104.52.1 773.P2 774Database entries also define the mapping of service names 775to port numbers for TCP, UDP, and IL. 776.P1 777tcp=echo port=7 778tcp=discard port=9 779tcp=systat port=11 780tcp=daytime port=13 781.P2 782.PP 783All programs read the database directly so 784consistency problems are rare. 785However the database files can become large. 786Our global file, containing all information about 787both Datakit and Internet systems in AT&T, has 43,000 788lines. 789To speed searches, we build hash table files for each 790attribute we expect to search often. 791The hash file entries point to entries 792in the master files. 793Every hash file contains the modification time of its master 794file so we can avoid using an out-of-date hash table. 795Searches for attributes that aren't hashed or whose hash table 796is out-of-date still work, they just take longer. 797.NH 2 798Connection Server 799.PP 800On each system a user level connection server process, CS, translates 801symbolic names to addresses. 802CS uses information about available networks, the network database, and 803other servers (such as DNS) to translate names. 804CS is a file server serving a single file, 805.CW /net/cs . 806A client writes a symbolic name to 807.CW /net/cs 808then reads one line for each matching destination reachable 809from this system. 810The lines are of the form 811.I "filename message", 812where 813.I filename 814is the path of the clone file to open for a new connection and 815.I message 816is the string to write to it to make the connection. 817The following example illustrates this. 818.CW Ndb/csquery 819is a program that prompts for strings to write to 820.CW /net/cs 821and prints the replies. 822.P1 823% ndb/csquery 824> net!helix!9fs 825/net/il/clone 135.104.9.31!17008 826/net/dk/clone nj/astro/helix!9fs 827.P2 828.PP 829CS provides meta-name translation to perform complicated 830searches. 831The special network name 832.CW net 833selects any network in common between source and 834destination supporting the specified service. 835A host name of the form \f(CW$\fIattr\f1 836is the name of an attribute in the network database. 837The database search returns the value 838of the matching attribute/value pair 839most closely associated with the source host. 840Most closely associated is defined on a per network basis. 841For example, the symbolic name 842.CW tcp!$auth!rexauth 843causes CS to search for the 844.CW auth 845attribute in the database entry for the source system, then its 846subnetwork (if there is one) and then its network. 847.P1 848% ndb/csquery 849> net!$auth!rexauth 850/net/il/clone 135.104.9.34!17021 851/net/dk/clone nj/astro/p9auth!rexauth 852/net/il/clone 135.104.9.6!17021 853/net/dk/clone nj/astro/musca!rexauth 854.P2 855.PP 856Normally CS derives naming information from its database files. 857For domain names however, CS first consults another user level 858process, the domain name server (DNS). 859If no DNS is reachable, CS relies on its own tables. 860.PP 861Like CS, the domain name server is a user level process providing 862one file, 863.CW /net/dns . 864A client writes a request of the form 865.I "domain-name type" , 866where 867.I type 868is a domain name service resource record type. 869DNS performs a recursive query through the 870Internet domain name system producing one line 871per resource record found. The client reads 872.CW /net/dns 873to retrieve the records. 874Like other domain name servers, DNS caches information 875learned from the network. 876DNS is implemented as a multi-process shared memory application 877with separate processes listening for network and local requests. 878.NH 879Library routines 880.PP 881The section on protocol devices described the details 882of making and receiving connections across a network. 883The dance is straightforward but tedious. 884Library routines are provided to relieve 885the programmer of the details. 886.NH 2 887Connecting 888.PP 889The 890.CW dial 891library call establishes a connection to a remote destination. 892It 893returns an open file descriptor for the 894.CW data 895file in the connection directory. 896.P1 897int dial(char *dest, char *local, char *dir, int *cfdp) 898.P2 899.IP \f(CWdest\fP 10 900is the symbolic name/address of the destination. 901.IP \f(CWlocal\fP 10 902is the local address. 903Since most networks do not support this, it is 904usually zero. 905.IP \f(CWdir\fP 10 906is a pointer to a buffer to hold the path name of the protocol directory 907representing this connection. 908.CW Dial 909fills this buffer if the pointer is non-zero. 910.IP \f(CWcfdp\fP 10 911is a pointer to a file descriptor for the 912.CW ctl 913file of the connection. 914If the pointer is non-zero, 915.CW dial 916opens the control file and tucks the file descriptor here. 917.LP 918Most programs call 919.CW dial 920with a destination name and all other arguments zero. 921.CW Dial 922uses CS to 923translate the symbolic name to all possible destination addresses 924and attempts to connect to each in turn until one works. 925Specifying the special name 926.CW net 927in the network portion of the destination 928allows CS to pick a network/protocol in common 929with the destination for which the requested service is valid. 930For example, assume the system 931.CW research.bell-labs.com 932has the Datakit address 933.CW nj/astro/research 934and IP addresses 935.CW 135.104.117.5 936and 937.CW 129.11.4.1 . 938The call 939.P1 940fd = dial("net!research.bell-labs.com!login", 0, 0, 0, 0); 941.P2 942tries in succession to connect to 943.CW nj/astro/research!login 944on the Datakit and both 945.CW 135.104.117.5!513 946and 947.CW 129.11.4.1!513 948across the Internet. 949.PP 950.CW Dial 951accepts addresses instead of symbolic names. 952For example, the destinations 953.CW tcp!135.104.117.5!513 954and 955.CW tcp!research.bell-labs.com!login 956are equivalent 957references to the same machine. 958.NH 2 959Listening 960.PP 961A program uses 962four routines to listen for incoming connections. 963It first 964.CW announce() s 965its intention to receive connections, 966then 967.CW listen() s 968for calls and finally 969.CW accept() s 970or 971.CW reject() s 972them. 973.CW Announce 974returns an open file descriptor for the 975.CW ctl 976file of a connection and fills 977.CW dir 978with the 979path of the protocol directory 980for the announcement. 981.P1 982int announce(char *addr, char *dir) 983.P2 984.CW Addr 985is the symbolic name/address announced; 986if it does not contain a service, the announcement is for 987all services not explicitly announced. 988Thus, one can easily write the equivalent of the 989.CW inetd 990program without 991having to announce each separate service. 992An announcement remains in force until the control file is 993closed. 994.LP 995.CW Listen 996returns an open file descriptor for the 997.CW ctl 998file and fills 999.CW ldir 1000with the path 1001of the protocol directory 1002for the received connection. 1003It is passed 1004.CW dir 1005from the announcement. 1006.P1 1007int listen(char *dir, char *ldir) 1008.P2 1009.LP 1010.CW Accept 1011and 1012.CW reject 1013are called with the control file descriptor and 1014.CW ldir 1015returned by 1016.CW listen. 1017Some networks such as Datakit accept a reason for a rejection; 1018networks such as IP ignore the third argument. 1019.P1 1020int accept(int ctl, char *ldir) 1021int reject(int ctl, char *ldir, char *reason) 1022.P2 1023.PP 1024The following code implements a typical TCP listener. 1025It announces itself, listens for connections, and forks a new 1026process for each. 1027The new process echoes data on the connection until the 1028remote end closes it. 1029The "*" in the symbolic name means the announcement is valid for 1030any addresses bound to the machine the program is run on. 1031.P1 1032.ta 8n 16n 24n 32n 40n 48n 56n 64n 1033int 1034echo_server(void) 1035{ 1036 int dfd, lcfd; 1037 char adir[40], ldir[40]; 1038 int n; 1039 char buf[256]; 1040 1041 afd = announce("tcp!*!echo", adir); 1042 if(afd < 0) 1043 return -1; 1044 1045 for(;;){ 1046 /* listen for a call */ 1047 lcfd = listen(adir, ldir); 1048 if(lcfd < 0) 1049 return -1; 1050 1051 /* fork a process to echo */ 1052 switch(fork()){ 1053 case 0: 1054 /* accept the call and open the data file */ 1055 dfd = accept(lcfd, ldir); 1056 if(dfd < 0) 1057 return -1; 1058 1059 /* echo until EOF */ 1060 while((n = read(dfd, buf, sizeof(buf))) > 0) 1061 write(dfd, buf, n); 1062 exits(0); 1063 case -1: 1064 perror("forking"); 1065 default: 1066 close(lcfd); 1067 break; 1068 } 1069 1070 } 1071} 1072.P2 1073.NH 1074User Level 1075.PP 1076Communication between Plan 9 machines is done almost exclusively in 1077terms of 9P messages. Only the two services 1078.CW cpu 1079and 1080.CW exportfs 1081are used. 1082The 1083.CW cpu 1084service is analogous to 1085.CW rlogin . 1086However, rather than emulating a terminal session 1087across the network, 1088.CW cpu 1089creates a process on the remote machine whose name space is an analogue of the window 1090in which it was invoked. 1091.CW Exportfs 1092is a user level file server which allows a piece of name space to be 1093exported from machine to machine across a network. It is used by the 1094.CW cpu 1095command to serve the files in the terminal's name space when they are 1096accessed from the 1097cpu server. 1098.PP 1099By convention, the protocol and device driver file systems are mounted in a 1100directory called 1101.CW /net . 1102Although the per-process name space allows users to configure an 1103arbitrary view of the system, in practice their profiles build 1104a conventional name space. 1105.NH 2 1106Exportfs 1107.PP 1108.CW Exportfs 1109is invoked by an incoming network call. 1110The 1111.I listener 1112(the Plan 9 equivalent of 1113.CW inetd ) 1114runs the profile of the user 1115requesting the service to construct a name space before starting 1116.CW exportfs . 1117After an initial protocol 1118establishes the root of the file tree being 1119exported, 1120the remote process mounts the connection, 1121allowing 1122.CW exportfs 1123to act as a relay file server. Operations in the imported file tree 1124are executed on the remote server and the results returned. 1125As a result 1126the name space of the remote machine appears to be exported into a 1127local file tree. 1128.PP 1129The 1130.CW import 1131command calls 1132.CW exportfs 1133on a remote machine, mounts the result in the local name space, 1134and 1135exits. 1136No local process is required to serve mounts; 11379P messages are generated by the kernel's mount driver and sent 1138directly over the network. 1139.PP 1140.CW Exportfs 1141must be multithreaded since the system calls 1142.CW open, 1143.CW read 1144and 1145.CW write 1146may block. 1147Plan 9 does not implement the 1148.CW select 1149system call but does allow processes to share file descriptors, 1150memory and other resources. 1151.CW Exportfs 1152and the configurable name space 1153provide a means of sharing resources between machines. 1154It is a building block for constructing complex name spaces 1155served from many machines. 1156.PP 1157The simplicity of the interfaces encourages naive users to exploit the potential 1158of a richly connected environment. 1159Using these tools it is easy to gateway between networks. 1160For example a terminal with only a Datakit connection can import from the server 1161.CW helix : 1162.P1 1163import -a helix /net 1164telnet ai.mit.edu 1165.P2 1166The 1167.CW import 1168command makes a Datakit connection to the machine 1169.CW helix 1170where 1171it starts an instance 1172.CW exportfs 1173to serve 1174.CW /net . 1175The 1176.CW import 1177command mounts the remote 1178.CW /net 1179directory after (the 1180.CW -a 1181option to 1182.CW import ) 1183the existing contents 1184of the local 1185.CW /net 1186directory. 1187The directory contains the union of the local and remote contents of 1188.CW /net . 1189Local entries supersede remote ones of the same name so 1190networks on the local machine are chosen in preference 1191to those supplied remotely. 1192However, unique entries in the remote directory are now visible in the local 1193.CW /net 1194directory. 1195All the networks connected to 1196.CW helix , 1197not just Datakit, 1198are now available in the terminal. The effect on the name space is shown by the following 1199example: 1200.P1 1201philw-gnot% ls /net 1202/net/cs 1203/net/dk 1204philw-gnot% import -a musca /net 1205philw-gnot% ls /net 1206/net/cs 1207/net/cs 1208/net/dk 1209/net/dk 1210/net/dns 1211/net/ether 1212/net/il 1213/net/tcp 1214/net/udp 1215.P2 1216.NH 2 1217Ftpfs 1218.PP 1219We decided to make our interface to FTP 1220a file system rather than the traditional command. 1221Our command, 1222.I ftpfs, 1223dials the FTP port of a remote system, prompts for login and password, sets image mode, 1224and mounts the remote file system onto 1225.CW /n/ftp . 1226Files and directories are cached to reduce traffic. 1227The cache is updated whenever a file is created. 1228Ftpfs works with TOPS-20, VMS, and various Unix flavors 1229as the remote system. 1230.NH 1231Cyclone Fiber Links 1232.PP 1233The file servers and CPU servers are connected by 1234high-bandwidth 1235point-to-point links. 1236A link consists of two VME cards connected by a pair of optical 1237fibers. 1238The VME cards use 33MHz Intel 960 processors and AMD's TAXI 1239fiber transmitter/receivers to drive the lines at 125 Mbit/sec. 1240Software in the VME card reduces latency by copying messages from system memory 1241to fiber without intermediate buffering. 1242.NH 1243Performance 1244.PP 1245We measured both latency and throughput 1246of reading and writing bytes between two processes 1247for a number of different paths. 1248Measurements were made on two- and four-CPU SGI Power Series processors. 1249The CPUs are 25 MHz MIPS 3000s. 1250The latency is measured as the round trip time 1251for a byte sent from one process to another and 1252back again. 1253Throughput is measured using 16k writes from 1254one process to another. 1255.DS C 1256.TS 1257box, tab(:); 1258c s s 1259c | c | c 1260l | n | n. 1261Table 1 - Performance 1262_ 1263test:throughput:latency 1264:MBytes/sec:millisec 1265_ 1266pipes:8.15:.255 1267_ 1268IL/ether:1.02:1.42 1269_ 1270URP/Datakit:0.22:1.75 1271_ 1272Cyclone:3.2:0.375 1273.TE 1274.DE 1275.NH 1276Conclusion 1277.PP 1278The representation of all resources as file systems 1279coupled with an ASCII interface has proved more powerful 1280than we had originally imagined. 1281Resources can be used by any computer in our networks 1282independent of byte ordering or CPU type. 1283The connection server provides an elegant means 1284of decoupling tools from the networks they use. 1285Users successfully use Plan 9 without knowing the 1286topology of the system or the networks they use. 1287More information about 9P can be found in the Section 5 of the Plan 9 Programmer's 1288Manual, Volume I. 1289.NH 1290References 1291.LP 1292[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey, 1293``Plan 9 from Bell Labs'', 1294.I 1295UKUUG Proc. of the Summer 1990 Conf. , 1296London, England, 12971990. 1298.LP 1299[Needham] R. Needham, ``Names'', in 1300.I 1301Distributed systems, 1302.R 1303S. Mullender, ed., 1304Addison Wesley, 1989. 1305.LP 1306[Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'', 1307.I 1308UKUUG Proc. of the Summer 1990 Conf. , 1309.R 1310London, England, 1990. 1311.LP 1312[Met80] R. Metcalfe, D. Boggs, C. Crane, E. Taf and J. Hupp, ``The 1313Ethernet Local Network: Three reports'', 1314.I 1315CSL-80-2, 1316.R 1317XEROX Palo Alto Research Center, February 1980. 1318.LP 1319[Fra80] A. G. Fraser, ``Datakit - A Modular Network for Synchronous 1320and Asynchronous Traffic'', 1321.I 1322Proc. Int'l Conf. on Communication, 1323.R 1324Boston, June 1980. 1325.LP 1326[Pet89a] L. Peterson, ``RPC in the X-Kernel: Evaluating new Design Techniques'', 1327.I 1328Proc. Twelfth Symp. on Op. Sys. Princ., 1329.R 1330Litchfield Park, AZ, December 1990. 1331.LP 1332[Rit84a] D. M. Ritchie, ``A Stream Input-Output System'', 1333.I 1334AT&T Bell Laboratories Technical Journal, 68(8), 1335.R 1336October 1984. 1337