1.TL 2The Organization of Networks in Plan 9 3.AU 4Dave Presotto 5Phil Winterbottom 6.sp 7presotto,philw@plan9.att.com 8.AB 9.FS 10Originally appeared in 11.I 12Proc. of the Winter 1993 USENIX Conf., 13.R 14pp. 271-280, 15San Diego, CA 16.FE 17In a distributed system networks are of paramount importance. This 18paper describes the implementation, design philosophy, and organization 19of network support in Plan 9. Topics include network requirements 20for distributed systems, our kernel implementation, network naming, user interfaces, 21and performance. We also observe that much of this organization is relevant to 22current systems. 23.AE 24.NH 25Introduction 26.PP 27Plan 9 [Pike90] is a general-purpose, multi-user, portable distributed system 28implemented on a variety of computers and networks. 29What distinguishes Plan 9 is its organization. 30The goals of this organization were to 31reduce administration 32and to promote resource sharing. One of the keys to its success as a distributed 33system is the organization and management of its networks. 34.PP 35A Plan 9 system comprises file servers, CPU servers and terminals. 36The file servers and CPU servers are typically centrally 37located multiprocessor machines with large memories and 38high speed interconnects. 39A variety of workstation-class machines 40serve as terminals 41connected to the central servers using several networks and protocols. 42The architecture of the system demands a hierarchy of network 43speeds matching the needs of the components. 44Connections between file servers and CPU servers are high-bandwidth point-to-point 45fiber links. 46Connections from the servers fan out to local terminals 47using medium speed networks 48such as Ethernet [Met80] and Datakit [Fra80]. 49Low speed connections via the Internet and 50the AT&T backbone serve users in Oregon and Illinois. 51Basic Rate ISDN data service and 9600 baud serial lines provide slow 52links to users at home. 53.PP 54Since CPU servers and terminals use the same kernel, 55users may choose to run programs locally on 56their terminals or remotely on CPU servers. 57The organization of Plan 9 hides the details of system connectivity 58allowing both users and administrators to configure their environment 59to be as distributed or centralized as they wish. 60Simple commands support the 61construction of a locally represented name space 62spanning many machines and networks. 63At work, users tend to use their terminals like workstations, 64running interactive programs locally and 65reserving the CPU servers for data or compute intensive jobs 66such as compiling and computing chess endgames. 67At home or when connected over 68a slow network, users tend to do most work on the CPU server to minimize 69traffic on the slow links. 70The goal of the network organization is to provide the same 71environment to the user wherever resources are used. 72.NH 73Kernel Network Support 74.PP 75Networks play a central role in any distributed system. This is particularly 76true in Plan 9 where most resources are provided by servers external to the kernel. 77The importance of the networking code within the kernel 78is reflected by its size; 79of 25,000 lines of kernel code, 12,500 are network and protocol related. 80Networks are continually being added and the fraction of code 81devoted to communications 82is growing. 83Moreover, the network code is complex. 84Protocol implementations consist almost entirely of 85synchronization and dynamic memory management, areas demanding 86subtle error recovery 87strategies. 88The kernel currently supports Datakit, point-to-point fiber links, 89an Internet (IP) protocol suite and ISDN data service. 90The variety of networks and machines 91has raised issues not addressed by other systems running on commercial 92hardware supporting only Ethernet or FDDI. 93.NH 2 94The File System protocol 95.PP 96A central idea in Plan 9 is the representation of a resource as a hierarchical 97file system. 98Each process assembles a view of the system by building a 99.I "name space 100[Needham] connecting its resources. 101File systems need not represent disc files; in fact, most Plan 9 file systems have no 102permanent storage. 103A typical file system dynamically represents 104some resource like a set of network connections or the process table. 105Communication between the kernel, device drivers, and local or remote file servers uses a 106protocol called 9P. The protocol consists of 17 messages 107describing operations on files and directories. 108Kernel resident device and protocol drivers use a procedural version 109of the protocol while external file servers use an RPC form. 110Nearly all traffic between Plan 9 systems consists 111of 9P messages. 1129P relies on several properties of the underlying transport protocol. 113It assumes messages arrive reliably and in sequence and 114that delimiters between messages 115are preserved. 116When a protocol does not meet these 117requirements (for example, TCP does not preserve delimiters) 118we provide mechanisms to marshal messages before handing them 119to the system. 120.PP 121A kernel data structure, the 122.I channel , 123is a handle to a file server. 124Operations on a channel generate the following 9P messages. 125The 126.CW session 127and 128.CW attach 129messages authenticate a connection, established by means external to 9P, 130and validate its user. 131The result is an authenticated 132channel 133referencing the root of the 134server. 135The 136.CW clone 137message makes a new channel identical to an existing channel, much like 138the 139.CW dup 140system call. 141A 142channel 143may be moved to a file on the server using a 144.CW walk 145message to descend each level in the hierarchy. 146The 147.CW stat 148and 149.CW wstat 150messages read and write the attributes of the file referenced by a channel. 151The 152.CW open 153message prepares a channel for subsequent 154.CW read 155and 156.CW write 157messages to access the contents of the file. 158.CW Create 159and 160.CW remove 161perform the actions implied by their names on the file 162referenced by the channel. 163The 164.CW clunk 165message discards a channel without affecting the file. 166.PP 167A kernel resident file server called the 168.I "mount driver" 169converts the procedural version of 9P into RPCs. 170The 171.I mount 172system call provides a file descriptor, which can be 173a pipe to a user process or a network connection to a remote machine, to 174be associated with the mount point. 175After a mount, operations 176on the file tree below the mount point are sent as messages to the file server. 177The 178mount 179driver manages buffers, packs and unpacks parameters from 180messages, and demultiplexes among processes using the file server. 181.NH 2 182Kernel Organization 183.PP 184The network code in the kernel is divided into three layers: hardware interface, 185protocol processing, and program interface. 186A device driver typically uses streams to connect the two interface layers. 187Additional stream modules may be pushed on 188a device to process protocols. 189Each device driver is a kernel-resident file system. 190Simple device drivers serve a single level 191directory containing just a few files; 192for example, we represent each UART 193by a data and a control file. 194.P1 195cpu% cd /dev 196cpu% ls -l eia* 197--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1 198--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1ctl 199--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2 200--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2ctl 201cpu% 202.P2 203The control file is used to control the device; 204writing the string 205.CW b1200 206to 207.CW /dev/eia1ctl 208sets the line to 1200 baud. 209.PP 210Multiplexed devices present 211a more complex interface structure. 212For example, the LANCE Ethernet driver 213serves a two level file tree (Figure 1) 214providing 215.IP \(bu 216device control and configuration 217.IP \(bu 218user-level protocols like ARP 219.IP \(bu 220diagnostic interfaces for snooping software. 221.LP 222The top directory contains a 223.CW clone 224file and a directory for each connection, numbered 225.CW 1 226to 227.CW n . 228Each connection directory corresponds to an Ethernet packet type. 229Opening the 230.CW clone 231file finds an unused connection directory 232and opens its 233.CW ctl 234file. 235Reading the control file returns the ASCII connection number; the user 236process can use this value to construct the name of the proper 237connection directory. 238In each connection directory files named 239.CW ctl , 240.CW data , 241.CW stats , 242and 243.CW type 244provide access to the connection. 245Writing the string 246.CW "connect 2048" 247to the 248.CW ctl 249file sets the packet type to 2048 250and 251configures the connection to receive 252all IP packets sent to the machine. 253Subsequent reads of the file 254.CW type 255yield the string 256.CW 2048 . 257The 258.CW data 259file accesses the media; 260reading it 261returns the 262next packet of the selected type. 263Writing the file 264queues a packet for transmission after 265appending a packet header containing the source address and packet type. 266The 267.CW stats 268file returns ASCII text containing the interface address, 269packet input/output counts, error statistics, and general information 270about the state of the interface. 271.so tree.pout 272.PP 273If several connections on an interface 274are configured for a particular packet type, each receives a 275copy of the incoming packets. 276The special packet type 277.CW -1 278selects all packets. 279Writing the strings 280.CW promiscuous 281and 282.CW connect 283.CW -1 284to the 285.CW ctl 286file 287configures a conversation to receive all packets on the Ethernet. 288.PP 289Although the driver interface may seem elaborate, 290the representation of a device as a set of files using ASCII strings for 291communication has several advantages. 292Any mechanism supporting remote access to files immediately 293allows a remote machine to use our interfaces as gateways. 294Using ASCII strings to control the interface avoids byte order problems and 295ensures a uniform representation for 296devices on the same machine and even allows devices to be accessed remotely. 297Representing dissimilar devices by the same set of files allows common tools 298to serve 299several networks or interfaces. 300Programs like 301.CW stty 302are replaced by 303.CW echo 304and shell redirection. 305.NH 2 306Protocol devices 307.PP 308Network connections are represented as pseudo-devices called protocol devices. 309Protocol device drivers exist for the Datakit URP protocol and for each of the 310Internet IP protocols TCP, UDP, and IL. 311IL, described below, is a new communication protocol used by Plan 9 for 312transmitting file system RPC's. 313All protocol devices look identical so user programs contain no 314network-specific code. 315.PP 316Each protocol device driver serves a directory structure 317similar to that of the Ethernet driver. 318The top directory contains a 319.CW clone 320file and a directory for each connection numbered 321.CW 0 322to 323.CW n . 324Each connection directory contains files to control one 325connection and to send and receive information. 326A TCP connection directory looks like this: 327.P1 328cpu% cd /net/tcp/2 329cpu% ls -l 330--rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 ctl 331--rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 data 332--rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 listen 333--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 local 334--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 remote 335--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 status 336cpu% cat local remote status 337135.104.9.31 5012 338135.104.53.11 564 339tcp/2 1 Established connect 340cpu% 341.P2 342The files 343.CW local , 344.CW remote , 345and 346.CW status 347supply information about the state of the connection. 348The 349.CW data 350and 351.CW ctl 352files 353provide access to the process end of the stream implementing the protocol. 354The 355.CW listen 356file is used to accept incoming calls from the network. 357.PP 358The following steps establish a connection. 359.IP 1) 360The clone device of the 361appropriate protocol directory is opened to reserve an unused connection. 362.IP 2) 363The file descriptor returned by the open points to the 364.CW ctl 365file of the new connection. 366Reading that file descriptor returns an ASCII string containing 367the connection number. 368.IP 3) 369A protocol/network specific ASCII address string is written to the 370.CW ctl 371file. 372.IP 4) 373The path of the 374.CW data 375file is constructed using the connection number. 376When the 377.CW data 378file is opened the connection is established. 379.LP 380A process can read and write this file descriptor 381to send and receive messages from the network. 382If the process opens the 383.CW listen 384file it blocks until an incoming call is received. 385An address string written to the 386.CW ctl 387file before the listen selects the 388ports or services the process is prepared to accept. 389When an incoming call is received, the open completes 390and returns a file descriptor 391pointing to the 392.CW ctl 393file of the new connection. 394Reading the 395.CW ctl 396file yields a connection number used to construct the path of the 397.CW data 398file. 399A connection remains established while any of the files in the connection directory 400are referenced or until a close is received from the network. 401.NH 2 402Streams 403.PP 404A 405.I stream 406[Rit84a][Presotto] is a bidirectional channel connecting a 407physical or pseudo-device to user processes. 408The user processes insert and remove data at one end of the stream. 409Kernel processes acting on behalf of a device insert data at 410the other end. 411Asynchronous communications channels such as pipes, 412TCP conversations, Datakit conversations, and RS232 lines are implemented using 413streams. 414.PP 415A stream comprises a linear list of 416.I "processing modules" . 417Each module has both an upstream (toward the process) and 418downstream (toward the device) 419.I "put routine" . 420Calling the put routine of the module on either end of the stream 421inserts data into the stream. 422Each module calls the succeeding one to send data up or down the stream. 423.PP 424An instance of a processing module is represented by a pair of 425.I queues , 426one for each direction. 427The queues point to the put procedures and can be used 428to queue information traveling along the stream. 429Some put routines queue data locally and send it along the stream at some 430later time, either due to a subsequent call or an asynchronous 431event such as a retransmission timer or a device interrupt. 432Processing modules create helper kernel processes to 433provide a context for handling asynchronous events. 434For example, a helper kernel process awakens periodically 435to perform any necessary TCP retransmissions. 436The use of kernel processes instead of serialized run-to-completion service routines 437differs from the implementation of Unix streams. 438Unix service routines cannot 439use any blocking kernel resource and they lack a local long-lived state. 440Helper kernel processes solve these problems and simplify the stream code. 441.PP 442There is no implicit synchronization in our streams. 443Each processing module must ensure that concurrent processes using the stream 444are synchronized. 445This maximizes concurrency but introduces the 446possibility of deadlock. 447However, deadlocks are easily avoided by careful programming; to 448date they have not caused us problems. 449.PP 450Information is represented by linked lists of kernel structures called 451.I blocks . 452Each block contains a type, some state flags, and pointers to 453an optional buffer. 454Block buffers can hold either data or control information, i.e., directives 455to the processing modules. 456Blocks and block buffers are dynamically allocated from kernel memory. 457.NH 3 458User Interface 459.PP 460A stream is represented at user level as two files, 461.CW ctl 462and 463.CW data . 464The actual names can be changed by the device driver using the stream, 465as we saw earlier in the example of the UART driver. 466The first process to open either file creates the stream automatically. 467The last close destroys it. 468Writing to the 469.CW data 470file copies the data into kernel blocks 471and passes them to the downstream put routine of the first processing module. 472A write of less than 32K is guaranteed to be contained by a single block. 473Concurrent writes to the same stream are not synchronized, although the 47432K block size assures atomic writes for most protocols. 475The last block written is flagged with a delimiter 476to alert downstream modules that care about write boundaries. 477In most cases the first put routine calls the second, the second 478calls the third, and so on until the data is output. 479As a consequence, most data is output without context switching. 480.PP 481Reading from the 482.CW data 483file returns data queued at the top of the stream. 484The read terminates when the read count is reached 485or when the end of a delimited block is encountered. 486A per stream read lock ensures only one process 487can read from a stream at a time and guarantees 488that the bytes read were contiguous bytes from the 489stream. 490.PP 491Like UNIX streams [Rit84a], 492Plan 9 streams can be dynamically configured. 493The stream system intercepts and interprets 494the following control blocks: 495.IP "\f(CWpush\fP \fIname\fR" 15 496adds an instance of the processing module 497.I name 498to the top of the stream. 499.IP \f(CWpop\fP 15 500removes the top module of the stream. 501.IP \f(CWhangup\fP 15 502sends a hangup message 503up the stream from the device end. 504.LP 505Other control blocks are module-specific and are interpreted by each 506processing module 507as they pass. 508.PP 509The convoluted syntax and semantics of the UNIX 510.CW ioctl 511system call convinced us to leave it out of Plan 9. 512Instead, 513.CW ioctl 514is replaced by the 515.CW ctl 516file. 517Writing to the 518.CW ctl 519file 520is identical to writing to a 521.CW data 522file except the blocks are of type 523.I control . 524A processing module parses each control block it sees. 525Commands in control blocks are ASCII strings, so 526byte ordering is not an issue when one system 527controls streams in a name space implemented on another processor. 528The time to parse control blocks is not important, since control 529operations are rare. 530.NH 3 531Device Interface 532.PP 533The module at the downstream end of the stream is part of a device interface. 534The particulars of the interface vary with the device. 535Most device interfaces consist of an interrupt routine, an output 536put routine, and a kernel process. 537The output put routine stages data for the 538device and starts the device if it is stopped. 539The interrupt routine wakes up the kernel process whenever 540the device has input to be processed or needs more output staged. 541The kernel process puts information up the stream or stages more data for output. 542The division of labor among the different pieces varies depending on 543how much must be done at interrupt level. 544However, the interrupt routine may not allocate blocks or call 545a put routine since both actions require a process context. 546.NH 3 547Multiplexing 548.PP 549The conversations using a protocol device must be 550multiplexed onto a single physical wire. 551We push a multiplexer processing module 552onto the physical device stream to group the conversations. 553The device end modules on the conversations add the necessary header 554onto downstream messages and then put them to the module downstream 555of the multiplexer. 556The multiplexing module looks at each message moving up its stream and 557puts it to the correct conversation stream after stripping 558the header controlling the demultiplexing. 559.PP 560This is similar to the Unix implementation of multiplexer streams. 561The major difference is that we have no general structure that 562corresponds to a multiplexer. 563Each attempt to produce a generalized multiplexer created a more complicated 564structure and underlined the basic difficulty of generalizing this mechanism. 565We now code each multiplexer from scratch and favor simplicity over 566generality. 567.NH 3 568Reflections 569.PP 570Despite five year's experience and the efforts of many programmers, 571we remain dissatisfied with the stream mechanism. 572Performance is not an issue; 573the time to process protocols and drive 574device interfaces continues to dwarf the 575time spent allocating, freeing, and moving blocks 576of data. 577However the mechanism remains inordinately 578complex. 579Much of the complexity results from our efforts 580to make streams dynamically configurable, to 581reuse processing modules on different devices 582and to provide kernel synchronization 583to ensure data structures 584don't disappear under foot. 585This is particularly irritating since we seldom use these properties. 586.PP 587Streams remain in our kernel because we are unable to 588devise a better alternative. 589Larry Peterson's X-kernel [Pet89a] 590is the closest contender but 591doesn't offer enough advantage to switch. 592If we were to rewrite the streams code, we would probably statically 593allocate resources for a large fixed number of conversations and burn 594memory in favor of less complexity. 595.NH 596The IL Protocol 597.PP 598None of the standard IP protocols is suitable for transmission of 5999P messages over an Ethernet or the Internet. 600TCP has a high overhead and does not preserve delimiters. 601UDP, while cheap, does not provide reliable sequenced delivery. 602Early versions of the system used a custom protocol that was 603efficient but unsatisfactory for internetwork transmission. 604When we implemented IP, TCP, and UDP we looked around for a suitable 605replacement with the following properties: 606.IP \(bu 607Reliable datagram service with sequenced delivery 608.IP \(bu 609Runs over IP 610.IP \(bu 611Low complexity, high performance 612.IP \(bu 613Adaptive timeouts 614.LP 615None met our needs so a new protocol was designed. 616IL is a lightweight protocol designed to be encapsulated by IP. 617It is a connection-based protocol 618providing reliable transmission of sequenced messages between machines. 619No provision is made for flow control since the protocol is designed to transport RPC 620messages between client and server. 621A small outstanding message window prevents too 622many incoming messages from being buffered; 623messages outside the window are discarded 624and must be retransmitted. 625Connection setup uses a two way handshake to generate 626initial sequence numbers at each end of the connection; 627subsequent data messages increment the 628sequence numbers allowing 629the receiver to resequence out of order messages. 630In contrast to other protocols, IL does not do blind retransmission. 631If a message is lost and a timeout occurs, a query message is sent. 632The query message is a small control message containing the current 633sequence numbers as seen by the sender. 634The receiver responds to a query by retransmitting missing messages. 635This allows the protocol to behave well in congested networks, 636where blind retransmission would cause further 637congestion. 638Like TCP, IL has adaptive timeouts. 639A round-trip timer is used 640to calculate acknowledge and retransmission times in terms of the network speed. 641This allows the protocol to perform well on both the Internet and on local Ethernets. 642.PP 643In keeping with the minimalist design of the rest of the kernel, IL is small. 644The entire protocol is 847 lines of code, compared to 2200 lines for TCP. 645IL is our protocol of choice. 646.NH 647Network Addressing 648.PP 649A uniform interface to protocols and devices is not sufficient to 650support the transparency we require. 651Since each network uses a different 652addressing scheme, 653the ASCII strings written to a control file have no common format. 654As a result, every tool must know the specifics of the networks it 655is capable of addressing. 656Moreover, since each machine supplies a subset 657of the available networks, each user must be aware of the networks supported 658by every terminal and server machine. 659This is obviously unacceptable. 660.PP 661Several possible solutions were considered and rejected; one deserves 662more discussion. 663We could have used a user-level file server 664to represent the network name space as a Plan 9 file tree. 665This global naming scheme has been implemented in other distributed systems. 666The file hierarchy provides paths to 667directories representing network domains. 668Each directory contains 669files representing the names of the machines in that domain; 670an example might be the path 671.CW /net/name/usa/edu/mit/ai . 672Each machine file contains information like the IP address of the machine. 673We rejected this representation for several reasons. 674First, it is hard to devise a hierarchy encompassing all representations 675of the various network addressing schemes in a uniform manner. 676Datakit and Ethernet address strings have nothing in common. 677Second, the address of a machine is 678often only a small part of the information required to connect to a service on 679the machine. 680For example, the IP protocols require symbolic service names to be mapped into 681numeric port numbers, some of which are privileged and hence special. 682Information of this sort is hard to represent in terms of file operations. 683Finally, the size and number of the networks being represented burdens users with 684an unacceptably large amount of information about the organization of the network 685and its connectivity. 686In this case the Plan 9 representation of a 687resource as a file is not appropriate. 688.PP 689If tools are to be network independent, a third-party server must resolve 690network names. 691A server on each machine, with local knowledge, can select the best network 692for any particular destination machine or service. 693Since the network devices present a common interface, 694the only operation which differs between networks is name resolution. 695A symbolic name must be translated to 696the path of the clone file of a protocol 697device and an ASCII address string to write to the 698.CW ctl 699file. 700A connection server (CS) provides this service. 701.NH 2 702Network Database 703.PP 704On most systems several 705files such as 706.CW /etc/hosts , 707.CW /etc/networks , 708.CW /etc/services , 709.CW /etc/hosts.equiv , 710.CW /etc/bootptab , 711and 712.CW /etc/named.d 713hold network information. 714Much time and effort is spent 715administering these files and keeping 716them mutually consistent. 717Tools attempt to 718automatically derive one or more of the files from 719information in other files but maintenance continues to be 720difficult and error prone. 721.PP 722Since we were writing an entirely new system, we were free to 723try a simpler approach. 724One database on a shared server contains all the information 725needed for network administration. 726Two ASCII files comprise the main database: 727.CW /lib/ndb/local 728contains locally administered information and 729.CW /lib/ndb/global 730contains information imported from elsewhere. 731The files contain sets of attribute/value pairs of the form 732.I attr\f(CW=\fPvalue , 733where 734.I attr 735and 736.I value 737are alphanumeric strings. 738Systems are described by multi-line entries; 739a header line at the left margin begins each entry followed by zero or more 740indented attribute/value pairs specifying 741names, addresses, properties, etc. 742For example, the entry for our CPU server 743specifies a domain name, an IP address, an Ethernet address, 744a Datakit address, a boot file, and supported protocols. 745.P1 746sys = helix 747 dom=helix.research.att.com 748 bootf=/mips/9power 749 ip=135.104.9.31 ether=0800690222f0 750 dk=nj/astro/helix 751 proto=il flavor=9cpu 752.P2 753If several systems share entries such as 754network mask and gateway, we specify that information 755with the network or subnetwork instead of the system. 756The following entries define a Class B IP network and 757a few subnets derived from it. 758The entry for the network specifies the IP mask, 759file system, and authentication server for all systems 760on the network. 761Each subnetwork specifies its default IP gateway. 762.P1 763ipnet=mh-astro-net ip=135.104.0.0 ipmask=255.255.255.0 764 fs=bootes.research.att.com 765 auth=1127auth 766ipnet=unix-room ip=135.104.117.0 767 ipgw=135.104.117.1 768ipnet=third-floor ip=135.104.51.0 769 ipgw=135.104.51.1 770ipnet=fourth-floor ip=135.104.52.0 771 ipgw=135.104.52.1 772.P2 773Database entries also define the mapping of service names 774to port numbers for TCP, UDP, and IL. 775.P1 776tcp=echo port=7 777tcp=discard port=9 778tcp=systat port=11 779tcp=daytime port=13 780.P2 781.PP 782All programs read the database directly so 783consistency problems are rare. 784However the database files can become large. 785Our global file, containing all information about 786both Datakit and Internet systems in AT&T, has 43,000 787lines. 788To speed searches, we build hash table files for each 789attribute we expect to search often. 790The hash file entries point to entries 791in the master files. 792Every hash file contains the modification time of its master 793file so we can avoid using an out-of-date hash table. 794Searches for attributes that aren't hashed or whose hash table 795is out-of-date still work, they just take longer. 796.NH 2 797Connection Server 798.PP 799On each system a user level connection server process, CS, translates 800symbolic names to addresses. 801CS uses information about available networks, the network database, and 802other servers (such as DNS) to translate names. 803CS is a file server serving a single file, 804.CW /net/cs . 805A client writes a symbolic name to 806.CW /net/cs 807then reads one line for each matching destination reachable 808from this system. 809The lines are of the form 810.I "filename message", 811where 812.I filename 813is the path of the clone file to open for a new connection and 814.I message 815is the string to write to it to make the connection. 816The following example illustrates this. 817.CW Ndb/csquery 818is a program that prompts for strings to write to 819.CW /net/cs 820and prints the replies. 821.P1 822% ndb/csquery 823> net!helix!9fs 824/net/il/clone 135.104.9.31!17008 825/net/dk/clone nj/astro/helix!9fs 826.P2 827.PP 828CS provides meta-name translation to perform complicated 829searches. 830The special network name 831.CW net 832selects any network in common between source and 833destination supporting the specified service. 834A host name of the form \f(CW$\fIattr\f1 835is the name of an attribute in the network database. 836The database search returns the value 837of the matching attribute/value pair 838most closely associated with the source host. 839Most closely associated is defined on a per network basis. 840For example, the symbolic name 841.CW tcp!$auth!rexauth 842causes CS to search for the 843.CW auth 844attribute in the database entry for the source system, then its 845subnetwork (if there is one) and then its network. 846.P1 847% ndb/csquery 848> net!$auth!rexauth 849/net/il/clone 135.104.9.34!17021 850/net/dk/clone nj/astro/p9auth!rexauth 851/net/il/clone 135.104.9.6!17021 852/net/dk/clone nj/astro/musca!rexauth 853.P2 854.PP 855Normally CS derives naming information from its database files. 856For domain names however, CS first consults another user level 857process, the domain name server (DNS). 858If no DNS is reachable, CS relies on its own tables. 859.PP 860Like CS, the domain name server is a user level process providing 861one file, 862.CW /net/dns . 863A client writes a request of the form 864.I "domain-name type" , 865where 866.I type 867is a domain name service resource record type. 868DNS performs a recursive query through the 869Internet domain name system producing one line 870per resource record found. The client reads 871.CW /net/dns 872to retrieve the records. 873Like other domain name servers, DNS caches information 874learned from the network. 875DNS is implemented as a multi-process shared memory application 876with separate processes listening for network and local requests. 877.NH 878Library routines 879.PP 880The section on protocol devices described the details 881of making and receiving connections across a network. 882The dance is straightforward but tedious. 883Library routines are provided to relieve 884the programmer of the details. 885.NH 2 886Connecting 887.PP 888The 889.CW dial 890library call establishes a connection to a remote destination. 891It 892returns an open file descriptor for the 893.CW data 894file in the connection directory. 895.P1 896int dial(char *dest, char *local, char *dir, int *cfdp) 897.P2 898.IP \f(CWdest\fP 10 899is the symbolic name/address of the destination. 900.IP \f(CWlocal\fP 10 901is the local address. 902Since most networks do not support this, it is 903usually zero. 904.IP \f(CWdir\fP 10 905is a pointer to a buffer to hold the path name of the protocol directory 906representing this connection. 907.CW Dial 908fills this buffer if the pointer is non-zero. 909.IP \f(CWcfdp\fP 10 910is a pointer to a file descriptor for the 911.CW ctl 912file of the connection. 913If the pointer is non-zero, 914.CW dial 915opens the control file and tucks the file descriptor here. 916.LP 917Most programs call 918.CW dial 919with a destination name and all other arguments zero. 920.CW Dial 921uses CS to 922translate the symbolic name to all possible destination addresses 923and attempts to connect to each in turn until one works. 924Specifying the special name 925.CW net 926in the network portion of the destination 927allows CS to pick a network/protocol in common 928with the destination for which the requested service is valid. 929For example, assume the system 930.CW research.att.com 931has the Datakit address 932.CW nj/astro/research 933and IP addresses 934.CW 135.104.117.5 935and 936.CW 129.11.4.1 . 937The call 938.P1 939fd = dial("net!research.att.com!login", 0, 0, 0, 0); 940.P2 941tries in succession to connect to 942.CW nj/astro/research!login 943on the Datakit and both 944.CW 135.104.117.5!513 945and 946.CW 129.11.4.1!513 947across the Internet. 948.PP 949.CW Dial 950accepts addresses instead of symbolic names. 951For example, the destinations 952.CW tcp!135.104.117.5!513 953and 954.CW tcp!research.att.com!login 955are equivalent 956references to the same machine. 957.NH 2 958Listening 959.PP 960A program uses 961four routines to listen for incoming connections. 962It first 963.CW announce() s 964its intention to receive connections, 965then 966.CW listen() s 967for calls and finally 968.CW accept() s 969or 970.CW reject() s 971them. 972.CW Announce 973returns an open file descriptor for the 974.CW ctl 975file of a connection and fills 976.CW dir 977with the 978path of the protocol directory 979for the announcement. 980.P1 981int announce(char *addr, char *dir) 982.P2 983.CW Addr 984is the symbolic name/address announced; 985if it does not contain a service, the announcement is for 986all services not explicitly announced. 987Thus, one can easily write the equivalent of the 988.CW inetd 989program without 990having to announce each separate service. 991An announcement remains in force until the control file is 992closed. 993.LP 994.CW Listen 995returns an open file descriptor for the 996.CW ctl 997file and fills 998.CW ldir 999with the path 1000of the protocol directory 1001for the received connection. 1002It is passed 1003.CW dir 1004from the announcement. 1005.P1 1006int listen(char *dir, char *ldir) 1007.P2 1008.LP 1009.CW Accept 1010and 1011.CW reject 1012are called with the control file descriptor and 1013.CW ldir 1014returned by 1015.CW listen. 1016Some networks such as Datakit accept a reason for a rejection; 1017networks such as IP ignore the third argument. 1018.P1 1019int accept(int ctl, char *ldir) 1020int reject(int ctl, char *ldir, char *reason) 1021.P2 1022.PP 1023The following code implements a typical TCP listener. 1024It announces itself, listens for connections, and forks a new 1025process for each. 1026The new process echoes data on the connection until the 1027remote end closes it. 1028The "*" in the symbolic name means the announcement is valid for 1029any addresses bound to the machine the program is run on. 1030.P1 1031.ta 8n 16n 24n 32n 40n 48n 56n 64n 1032int 1033echo_server(void) 1034{ 1035 int dfd, lcfd; 1036 char adir[40], ldir[40]; 1037 int n; 1038 char buf[256]; 1039 1040 afd = announce("tcp!*!echo", adir); 1041 if(afd < 0) 1042 return -1; 1043 1044 for(;;){ 1045 /* listen for a call */ 1046 lcfd = listen(adir, ldir); 1047 if(lcfd < 0) 1048 return -1; 1049 1050 /* fork a process to echo */ 1051 switch(fork()){ 1052 case 0: 1053 /* accept the call and open the data file */ 1054 dfd = accept(lcfd, ldir); 1055 if(dfd < 0) 1056 return -1; 1057 1058 /* echo until EOF */ 1059 while((n = read(dfd, buf, sizeof(buf))) > 0) 1060 write(dfd, buf, n); 1061 exits(0); 1062 case -1: 1063 perror("forking"); 1064 default: 1065 close(lcfd); 1066 break; 1067 } 1068 1069 } 1070} 1071.P2 1072.NH 1073User Level 1074.PP 1075Communication between Plan 9 machines is done almost exclusively in 1076terms of 9P messages. Only the two services 1077.CW cpu 1078and 1079.CW exportfs 1080are used. 1081The 1082.CW cpu 1083service is analogous to 1084.CW rlogin . 1085However, rather than emulating a terminal session 1086across the network, 1087.CW cpu 1088creates a process on the remote machine whose name space is an analogue of the window 1089in which it was invoked. 1090.CW Exportfs 1091is a user level file server which allows a piece of name space to be 1092exported from machine to machine across a network. It is used by the 1093.CW cpu 1094command to serve the files in the terminal's name space when they are 1095accessed from the 1096cpu server. 1097.PP 1098By convention, the protocol and device driver file systems are mounted in a 1099directory called 1100.CW /net . 1101Although the per-process name space allows users to configure an 1102arbitrary view of the system, in practice their profiles build 1103a conventional name space. 1104.NH 2 1105Exportfs 1106.PP 1107.CW Exportfs 1108is invoked by an incoming network call. 1109The 1110.I listener 1111(the Plan 9 equivalent of 1112.CW inetd ) 1113runs the profile of the user 1114requesting the service to construct a name space before starting 1115.CW exportfs . 1116After an initial protocol 1117establishes the root of the file tree being 1118exported, 1119the remote process mounts the connection, 1120allowing 1121.CW exportfs 1122to act as a relay file server. Operations in the imported file tree 1123are executed on the remote server and the results returned. 1124As a result 1125the name space of the remote machine appears to be exported into a 1126local file tree. 1127.PP 1128The 1129.CW import 1130command calls 1131.CW exportfs 1132on a remote machine, mounts the result in the local name space, 1133and 1134exits. 1135No local process is required to serve mounts; 11369P messages are generated by the kernel's mount driver and sent 1137directly over the network. 1138.PP 1139.CW Exportfs 1140must be multithreaded since the system calls 1141.CW open, 1142.CW read 1143and 1144.CW write 1145may block. 1146Plan 9 does not implement the 1147.CW select 1148system call but does allow processes to share file descriptors, 1149memory and other resources. 1150.CW Exportfs 1151and the configurable name space 1152provide a means of sharing resources between machines. 1153It is a building block for constructing complex name spaces 1154served from many machines. 1155.PP 1156The simplicity of the interfaces encourages naive users to exploit the potential 1157of a richly connected environment. 1158Using these tools it is easy to gateway between networks. 1159For example a terminal with only a Datakit connection can import from the server 1160.CW helix : 1161.P1 1162import -a helix /net 1163telnet ai.mit.edu 1164.P2 1165The 1166.CW import 1167command makes a Datakit connection to the machine 1168.CW helix 1169where 1170it starts an instance 1171.CW exportfs 1172to serve 1173.CW /net . 1174The 1175.CW import 1176command mounts the remote 1177.CW /net 1178directory after (the 1179.CW -a 1180option to 1181.CW import ) 1182the existing contents 1183of the local 1184.CW /net 1185directory. 1186The directory contains the union of the local and remote contents of 1187.CW /net . 1188Local entries supersede remote ones of the same name so 1189networks on the local machine are chosen in preference 1190to those supplied remotely. 1191However, unique entries in the remote directory are now visible in the local 1192.CW /net 1193directory. 1194All the networks connected to 1195.CW helix , 1196not just Datakit, 1197are now available in the terminal. The effect on the name space is shown by the following 1198example: 1199.P1 1200philw-gnot% ls /net 1201/net/cs 1202/net/dk 1203philw-gnot% import -a musca /net 1204philw-gnot% ls /net 1205/net/cs 1206/net/cs 1207/net/dk 1208/net/dk 1209/net/dns 1210/net/ether 1211/net/il 1212/net/tcp 1213/net/udp 1214.P2 1215.NH 2 1216Ftpfs 1217.PP 1218We decided to make our interface to FTP 1219a file system rather than the traditional command. 1220Our command, 1221.I ftpfs, 1222dials the FTP port of a remote system, prompts for login and password, sets image mode, 1223and mounts the remote file system onto 1224.CW /n/ftp . 1225Files and directories are cached to reduce traffic. 1226The cache is updated whenever a file is created. 1227Ftpfs works with TOPS-20, VMS, and various Unix flavors 1228as the remote system. 1229.NH 1230Cyclone Fiber Links 1231.PP 1232The file servers and CPU servers are connected by 1233high-bandwidth 1234point-to-point links. 1235A link consists of two VME cards connected by a pair of optical 1236fibers. 1237The VME cards use 33MHz Intel 960 processors and AMD's TAXI 1238fiber transmitter/receivers to drive the lines at 125 Mbit/sec. 1239Software in the VME card reduces latency by copying messages from system memory 1240to fiber without intermediate buffering. 1241.NH 1242Performance 1243.PP 1244We measured both latency and throughput 1245of reading and writing bytes between two processes 1246for a number of different paths. 1247Measurements were made on two- and four-CPU SGI Power Series processors. 1248The CPUs are 25 MHz MIPS 3000s. 1249The latency is measured as the round trip time 1250for a byte sent from one process to another and 1251back again. 1252Throughput is measured using 16k writes from 1253one process to another. 1254.DS C 1255.TS 1256box, tab(:); 1257c s s 1258c | c | c 1259l | n | n. 1260Table 1 - Performance 1261_ 1262test:throughput:latency 1263:MBytes/sec:millisec 1264_ 1265pipes:8.15:.255 1266_ 1267IL/ether:1.02:1.42 1268_ 1269URP/Datakit:0.22:1.75 1270_ 1271Cyclone:3.2:0.375 1272.TE 1273.DE 1274.NH 1275Conclusion 1276.PP 1277The representation of all resources as file systems 1278coupled with an ASCII interface has proved more powerful 1279than we had originally imagined. 1280Resources can be used by any computer in our networks 1281independent of byte ordering or CPU type. 1282The connection server provides an elegant means 1283of decoupling tools from the networks they use. 1284Users successfully use Plan 9 without knowing the 1285topology of the system or the networks they use. 1286More information about 9P can be found in the Section 5 of the Plan 9 Programmer's 1287Manual, Volume I. 1288.NH 1289References 1290.LP 1291[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey, 1292``Plan 9 from Bell Labs'', 1293.I 1294UKUUG Proc. of the Summer 1990 Conf. , 1295London, England, 12961990. 1297.LP 1298[Needham] R. Needham, ``Names'', in 1299.I 1300Distributed systems, 1301.R 1302S. Mullender, ed., 1303Addison Wesley, 1989. 1304.LP 1305[Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'', 1306.I 1307UKUUG Proc. of the Summer 1990 Conf. , 1308.R 1309London, England, 1990. 1310.LP 1311[Met80] R. Metcalfe, D. Boggs, C. Crane, E. Taf and J. Hupp, ``The 1312Ethernet Local Network: Three reports'', 1313.I 1314CSL-80-2, 1315.R 1316XEROX Palo Alto Research Center, February 1980. 1317.LP 1318[Fra80] A. G. Fraser, ``Datakit - A Modular Network for Synchronous 1319and Asynchronous Traffic'', 1320.I 1321Proc. Int'l Conf. on Communication, 1322.R 1323Boston, June 1980. 1324.LP 1325[Pet89a] L. Peterson, ``RPC in the X-Kernel: Evaluating new Design Techniques'', 1326.I 1327Proc. Twelfth Symp. on Op. Sys. Princ., 1328.R 1329Litchfield Park, AZ, December 1990. 1330.LP 1331[Rit84a] D. M. Ritchie, ``A Stream Input-Output System'', 1332.I 1333AT&T Bell Laboratories Technical Journal, 68(8), 1334.R 1335October 1984. 1336