xref: /plan9-contrib/sys/doc/net/net.ms (revision 219b2ee8daee37f4aad58d63f21287faa8e4ffdc)
1.TL
2The Organization of Networks in Plan 9
3.AU
4Dave Presotto
5Phil Winterbottom
6.sp
7presotto,philw@plan9.att.com
8.AB
9.FS
10Originally appeared in
11.I
12Proc. of the Winter 1993 USENIX Conf.,
13.R
14pp. 271-280,
15San Diego, CA
16.FE
17In a distributed system networks are of paramount importance. This
18paper describes the implementation, design philosophy, and organization
19of network support in Plan 9. Topics include network requirements
20for distributed systems, our kernel implementation, network naming, user interfaces,
21and performance. We also observe that much of this organization is relevant to
22current systems.
23.AE
24.NH
25Introduction
26.PP
27Plan 9 [Pike90] is a general-purpose, multi-user, portable distributed system
28implemented on a variety of computers and networks.
29What distinguishes Plan 9 is its organization.
30The goals of this organization were to
31reduce administration
32and to promote resource sharing. One of the keys to its success as a distributed
33system is the organization and management of its networks.
34.PP
35A Plan 9 system comprises file servers, CPU servers and terminals.
36The file servers and CPU servers are typically centrally
37located multiprocessor machines with large memories and
38high speed interconnects.
39A variety of workstation-class machines
40serve as terminals
41connected to the central servers using several networks and protocols.
42The architecture of the system demands a hierarchy of network
43speeds matching the needs of the components.
44Connections between file servers and CPU servers are high-bandwidth point-to-point
45fiber links.
46Connections from the servers fan out to local terminals
47using medium speed networks
48such as Ethernet [Met80] and Datakit [Fra80].
49Low speed connections via the Internet and
50the AT&T backbone serve users in Oregon and Illinois.
51Basic Rate ISDN data service and 9600 baud serial lines provide slow
52links to users at home.
53.PP
54Since CPU servers and terminals use the same kernel,
55users may choose to run programs locally on
56their terminals or remotely on CPU servers.
57The organization of Plan 9 hides the details of system connectivity
58allowing both users and administrators to configure their environment
59to be as distributed or centralized as they wish.
60Simple commands support the
61construction of a locally represented name space
62spanning many machines and networks.
63At work, users tend to use their terminals like workstations,
64running interactive programs locally and
65reserving the CPU servers for data or compute intensive jobs
66such as compiling and computing chess endgames.
67At home or when connected over
68a slow network, users tend to do most work on the CPU server to minimize
69traffic on the slow links.
70The goal of the network organization is to provide the same
71environment to the user wherever resources are used.
72.NH
73Kernel Network Support
74.PP
75Networks play a central role in any distributed system. This is particularly
76true in Plan 9 where most resources are provided by servers external to the kernel.
77The importance of the networking code within the kernel
78is reflected by its size;
79of 25,000 lines of kernel code, 12,500 are network and protocol related.
80Networks are continually being added and the fraction of code
81devoted to communications
82is growing.
83Moreover, the network code is complex.
84Protocol implementations consist almost entirely of
85synchronization and dynamic memory management, areas demanding
86subtle error recovery
87strategies.
88The kernel currently supports Datakit, point-to-point fiber links,
89an Internet (IP) protocol suite and ISDN data service.
90The variety of networks and machines
91has raised issues not addressed by other systems running on commercial
92hardware supporting only Ethernet or FDDI.
93.NH 2
94The File System protocol
95.PP
96A central idea in Plan 9 is the representation of a resource as a hierarchical
97file system.
98Each process assembles a view of the system by building a
99.I "name space
100[Needham] connecting its resources.
101File systems need not represent disc files; in fact, most Plan 9 file systems have no
102permanent storage.
103A typical file system dynamically represents
104some resource like a set of network connections or the process table.
105Communication between the kernel, device drivers, and local or remote file servers uses a
106protocol called 9P. The protocol consists of 17 messages
107describing operations on files and directories.
108Kernel resident device and protocol drivers use a procedural version
109of the protocol while external file servers use an RPC form.
110Nearly all traffic between Plan 9 systems consists
111of 9P messages.
1129P relies on several properties of the underlying transport protocol.
113It assumes messages arrive reliably and in sequence and
114that delimiters between messages
115are preserved.
116When a protocol does not meet these
117requirements (for example, TCP does not preserve delimiters)
118we provide mechanisms to marshal messages before handing them
119to the system.
120.PP
121A kernel data structure, the
122.I channel ,
123is a handle to a file server.
124Operations on a channel generate the following 9P messages.
125The
126.CW session
127and
128.CW attach
129messages authenticate a connection, established by means external to 9P,
130and validate its user.
131The result is an authenticated
132channel
133referencing the root of the
134server.
135The
136.CW clone
137message makes a new channel identical to an existing channel, much like
138the
139.CW dup
140system call.
141A
142channel
143may be moved to a file on the server using a
144.CW walk
145message to descend each level in the hierarchy.
146The
147.CW stat
148and
149.CW wstat
150messages read and write the attributes of the file referenced by a channel.
151The
152.CW open
153message prepares a channel for subsequent
154.CW read
155and
156.CW write
157messages to access the contents of the file.
158.CW Create
159and
160.CW remove
161perform the actions implied by their names on the file
162referenced by the channel.
163The
164.CW clunk
165message discards a channel without affecting the file.
166.PP
167A kernel resident file server called the
168.I "mount driver"
169converts the procedural version of 9P into RPCs.
170The
171.I mount
172system call provides a file descriptor, which can be
173a pipe to a user process or a network connection to a remote machine, to
174be associated with the mount point.
175After a mount, operations
176on the file tree below the mount point are sent as messages to the file server.
177The
178mount
179driver manages buffers, packs and unpacks parameters from
180messages, and demultiplexes among processes using the file server.
181.NH 2
182Kernel Organization
183.PP
184The network code in the kernel is divided into three layers: hardware interface,
185protocol processing, and program interface.
186A device driver typically uses streams to connect the two interface layers.
187Additional stream modules may be pushed on
188a device to process protocols.
189Each device driver is a kernel-resident file system.
190Simple device drivers serve a single level
191directory containing just a few files;
192for example, we represent each UART
193by a data and a control file.
194.P1
195cpu% cd /dev
196cpu% ls -l eia*
197--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1
198--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1ctl
199--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2
200--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2ctl
201cpu%
202.P2
203The control file is used to control the device;
204writing the string
205.CW b1200
206to
207.CW /dev/eia1ctl
208sets the line to 1200 baud.
209.PP
210Multiplexed devices present
211a more complex interface structure.
212For example, the LANCE Ethernet driver
213serves a two level file tree (Figure 1)
214providing
215.IP \(bu
216device control and configuration
217.IP \(bu
218user-level protocols like ARP
219.IP \(bu
220diagnostic interfaces for snooping software.
221.LP
222The top directory contains a
223.CW clone
224file and a directory for each connection, numbered
225.CW 1
226to
227.CW n .
228Each connection directory corresponds to an Ethernet packet type.
229Opening the
230.CW clone
231file finds an unused connection directory
232and opens its
233.CW ctl
234file.
235Reading the control file returns the ASCII connection number; the user
236process can use this value to construct the name of the proper
237connection directory.
238In each connection directory files named
239.CW ctl ,
240.CW data ,
241.CW stats ,
242and
243.CW type
244provide access to the connection.
245Writing the string
246.CW "connect 2048"
247to the
248.CW ctl
249file sets the packet type to 2048
250and
251configures the connection to receive
252all IP packets sent to the machine.
253Subsequent reads of the file
254.CW type
255yield the string
256.CW 2048 .
257The
258.CW data
259file accesses the media;
260reading it
261returns the
262next packet of the selected type.
263Writing the file
264queues a packet for transmission after
265appending a packet header containing the source address and packet type.
266The
267.CW stats
268file returns ASCII text containing the interface address,
269packet input/output counts, error statistics, and general information
270about the state of the interface.
271.so tree.pout
272.PP
273If several connections on an interface
274are configured for a particular packet type, each receives a
275copy of the incoming packets.
276The special packet type
277.CW -1
278selects all packets.
279Writing the strings
280.CW promiscuous
281and
282.CW connect
283.CW -1
284to the
285.CW ctl
286file
287configures a conversation to receive all packets on the Ethernet.
288.PP
289Although the driver interface may seem elaborate,
290the representation of a device as a set of files using ASCII strings for
291communication has several advantages.
292Any mechanism supporting remote access to files immediately
293allows a remote machine to use our interfaces as gateways.
294Using ASCII strings to control the interface avoids byte order problems and
295ensures a uniform representation for
296devices on the same machine and even allows devices to be accessed remotely.
297Representing dissimilar devices by the same set of files allows common tools
298to serve
299several networks or interfaces.
300Programs like
301.CW stty
302are replaced by
303.CW echo
304and shell redirection.
305.NH 2
306Protocol devices
307.PP
308Network connections are represented as pseudo-devices called protocol devices.
309Protocol device drivers exist for the Datakit URP protocol and for each of the
310Internet IP protocols TCP, UDP, and IL.
311IL, described below, is a new communication protocol used by Plan 9 for
312transmitting file system RPC's.
313All protocol devices look identical so user programs contain no
314network-specific code.
315.PP
316Each protocol device driver serves a directory structure
317similar to that of the Ethernet driver.
318The top directory contains a
319.CW clone
320file and a directory for each connection numbered
321.CW 0
322to
323.CW n .
324Each connection directory contains files to control one
325connection and to send and receive information.
326A TCP connection directory looks like this:
327.P1
328cpu% cd /net/tcp/2
329cpu% ls -l
330--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 ctl
331--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 data
332--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 listen
333--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 local
334--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 remote
335--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 status
336cpu% cat local remote status
337135.104.9.31 5012
338135.104.53.11 564
339tcp/2 1 Established connect
340cpu%
341.P2
342The files
343.CW local ,
344.CW remote ,
345and
346.CW status
347supply information about the state of the connection.
348The
349.CW data
350and
351.CW ctl
352files
353provide access to the process end of the stream implementing the protocol.
354The
355.CW listen
356file is used to accept incoming calls from the network.
357.PP
358The following steps establish a connection.
359.IP 1)
360The clone device of the
361appropriate protocol directory is opened to reserve an unused connection.
362.IP 2)
363The file descriptor returned by the open points to the
364.CW ctl
365file of the new connection.
366Reading that file descriptor returns an ASCII string containing
367the connection number.
368.IP 3)
369A protocol/network specific ASCII address string is written to the
370.CW ctl
371file.
372.IP 4)
373The path of the
374.CW data
375file is constructed using the connection number.
376When the
377.CW data
378file is opened the connection is established.
379.LP
380A process can read and write this file descriptor
381to send and receive messages from the network.
382If the process opens the
383.CW listen
384file it blocks until an incoming call is received.
385An address string written to the
386.CW ctl
387file before the listen selects the
388ports or services the process is prepared to accept.
389When an incoming call is received, the open completes
390and returns a file descriptor
391pointing to the
392.CW ctl
393file of the new connection.
394Reading the
395.CW ctl
396file yields a connection number used to construct the path of the
397.CW data
398file.
399A connection remains established while any of the files in the connection directory
400are referenced or until a close is received from the network.
401.NH 2
402Streams
403.PP
404A
405.I stream
406[Rit84a][Presotto] is a bidirectional channel connecting a
407physical or pseudo-device to user processes.
408The user processes insert and remove data at one end of the stream.
409Kernel processes acting on behalf of a device insert data at
410the other end.
411Asynchronous communications channels such as pipes,
412TCP conversations, Datakit conversations, and RS232 lines are implemented using
413streams.
414.PP
415A stream comprises a linear list of
416.I "processing modules" .
417Each module has both an upstream (toward the process) and
418downstream (toward the device)
419.I "put routine" .
420Calling the put routine of the module on either end of the stream
421inserts data into the stream.
422Each module calls the succeeding one to send data up or down the stream.
423.PP
424An instance of a processing module is represented by a pair of
425.I queues ,
426one for each direction.
427The queues point to the put procedures and can be used
428to queue information traveling along the stream.
429Some put routines queue data locally and send it along the stream at some
430later time, either due to a subsequent call or an asynchronous
431event such as a retransmission timer or a device interrupt.
432Processing modules create helper kernel processes to
433provide a context for handling asynchronous events.
434For example, a helper kernel process awakens periodically
435to perform any necessary TCP retransmissions.
436The use of kernel processes instead of serialized run-to-completion service routines
437differs from the implementation of Unix streams.
438Unix service routines cannot
439use any blocking kernel resource and they lack a local long-lived state.
440Helper kernel processes solve these problems and simplify the stream code.
441.PP
442There is no implicit synchronization in our streams.
443Each processing module must ensure that concurrent processes using the stream
444are synchronized.
445This maximizes concurrency but introduces the
446possibility of deadlock.
447However, deadlocks are easily avoided by careful programming; to
448date they have not caused us problems.
449.PP
450Information is represented by linked lists of kernel structures called
451.I blocks .
452Each block contains a type, some state flags, and pointers to
453an optional buffer.
454Block buffers can hold either data or control information, i.e., directives
455to the processing modules.
456Blocks and block buffers are dynamically allocated from kernel memory.
457.NH 3
458User Interface
459.PP
460A stream is represented at user level as two files,
461.CW ctl
462and
463.CW data .
464The actual names can be changed by the device driver using the stream,
465as we saw earlier in the example of the UART driver.
466The first process to open either file creates the stream automatically.
467The last close destroys it.
468Writing to the
469.CW data
470file copies the data into kernel blocks
471and passes them to the downstream put routine of the first processing module.
472A write of less than 32K is guaranteed to be contained by a single block.
473Concurrent writes to the same stream are not synchronized, although the
47432K block size assures atomic writes for most protocols.
475The last block written is flagged with a delimiter
476to alert downstream modules that care about write boundaries.
477In most cases the first put routine calls the second, the second
478calls the third, and so on until the data is output.
479As a consequence, most data is output without context switching.
480.PP
481Reading from the
482.CW data
483file returns data queued at the top of the stream.
484The read terminates when the read count is reached
485or when the end of a delimited block is encountered.
486A per stream read lock ensures only one process
487can read from a stream at a time and guarantees
488that the bytes read were contiguous bytes from the
489stream.
490.PP
491Like UNIX streams [Rit84a],
492Plan 9 streams can be dynamically configured.
493The stream system intercepts and interprets
494the following control blocks:
495.IP "\f(CWpush\fP \fIname\fR" 15
496adds an instance of the processing module
497.I name
498to the top of the stream.
499.IP \f(CWpop\fP 15
500removes the top module of the stream.
501.IP \f(CWhangup\fP 15
502sends a hangup message
503up the stream from the device end.
504.LP
505Other control blocks are module-specific and are interpreted by each
506processing module
507as they pass.
508.PP
509The convoluted syntax and semantics of the UNIX
510.CW ioctl
511system call convinced us to leave it out of Plan 9.
512Instead,
513.CW ioctl
514is replaced by the
515.CW ctl
516file.
517Writing to the
518.CW ctl
519file
520is identical to writing to a
521.CW data
522file except the blocks are of type
523.I control .
524A processing module parses each control block it sees.
525Commands in control blocks are ASCII strings, so
526byte ordering is not an issue when one system
527controls streams in a name space implemented on another processor.
528The time to parse control blocks is not important, since control
529operations are rare.
530.NH 3
531Device Interface
532.PP
533The module at the downstream end of the stream is part of a device interface.
534The particulars of the interface vary with the device.
535Most device interfaces consist of an interrupt routine, an output
536put routine, and a kernel process.
537The output put routine stages data for the
538device and starts the device if it is stopped.
539The interrupt routine wakes up the kernel process whenever
540the device has input to be processed or needs more output staged.
541The kernel process puts information up the stream or stages more data for output.
542The division of labor among the different pieces varies depending on
543how much must be done at interrupt level.
544However, the interrupt routine may not allocate blocks or call
545a put routine since both actions require a process context.
546.NH 3
547Multiplexing
548.PP
549The conversations using a protocol device must be
550multiplexed onto a single physical wire.
551We push a multiplexer processing module
552onto the physical device stream to group the conversations.
553The device end modules on the conversations add the necessary header
554onto downstream messages and then put them to the module downstream
555of the multiplexer.
556The multiplexing module looks at each message moving up its stream and
557puts it to the correct conversation stream after stripping
558the header controlling the demultiplexing.
559.PP
560This is similar to the Unix implementation of multiplexer streams.
561The major difference is that we have no general structure that
562corresponds to a multiplexer.
563Each attempt to produce a generalized multiplexer created a more complicated
564structure and underlined the basic difficulty of generalizing this mechanism.
565We now code each multiplexer from scratch and favor simplicity over
566generality.
567.NH 3
568Reflections
569.PP
570Despite five year's experience and the efforts of many programmers,
571we remain dissatisfied with the stream mechanism.
572Performance is not an issue;
573the time to process protocols and drive
574device interfaces continues to dwarf the
575time spent allocating, freeing, and moving blocks
576of data.
577However the mechanism remains inordinately
578complex.
579Much of the complexity results from our efforts
580to make streams dynamically configurable, to
581reuse processing modules on different devices
582and to provide kernel synchronization
583to ensure data structures
584don't disappear under foot.
585This is particularly irritating since we seldom use these properties.
586.PP
587Streams remain in our kernel because we are unable to
588devise a better alternative.
589Larry Peterson's X-kernel [Pet89a]
590is the closest contender but
591doesn't offer enough advantage to switch.
592If we were to rewrite the streams code, we would probably statically
593allocate resources for a large fixed number of conversations and burn
594memory in favor of less complexity.
595.NH
596The IL Protocol
597.PP
598None of the standard IP protocols is suitable for transmission of
5999P messages over an Ethernet or the Internet.
600TCP has a high overhead and does not preserve delimiters.
601UDP, while cheap, does not provide reliable sequenced delivery.
602Early versions of the system used a custom protocol that was
603efficient but unsatisfactory for internetwork transmission.
604When we implemented IP, TCP, and UDP we looked around for a suitable
605replacement with the following properties:
606.IP \(bu
607Reliable datagram service with sequenced delivery
608.IP \(bu
609Runs over IP
610.IP \(bu
611Low complexity, high performance
612.IP \(bu
613Adaptive timeouts
614.LP
615None met our needs so a new protocol was designed.
616IL is a lightweight protocol designed to be encapsulated by IP.
617It is a connection-based protocol
618providing reliable transmission of sequenced messages between machines.
619No provision is made for flow control since the protocol is designed to transport RPC
620messages between client and server.
621A small outstanding message window prevents too
622many incoming messages from being buffered;
623messages outside the window are discarded
624and must be retransmitted.
625Connection setup uses a two way handshake to generate
626initial sequence numbers at each end of the connection;
627subsequent data messages increment the
628sequence numbers allowing
629the receiver to resequence out of order messages.
630In contrast to other protocols, IL does not do blind retransmission.
631If a message is lost and a timeout occurs, a query message is sent.
632The query message is a small control message containing the current
633sequence numbers as seen by the sender.
634The receiver responds to a query by retransmitting missing messages.
635This allows the protocol to behave well in congested networks,
636where blind retransmission would cause further
637congestion.
638Like TCP, IL has adaptive timeouts.
639A round-trip timer is used
640to calculate acknowledge and retransmission times in terms of the network speed.
641This allows the protocol to perform well on both the Internet and on local Ethernets.
642.PP
643In keeping with the minimalist design of the rest of the kernel, IL is small.
644The entire protocol is 847 lines of code, compared to 2200 lines for TCP.
645IL is our protocol of choice.
646.NH
647Network Addressing
648.PP
649A uniform interface to protocols and devices is not sufficient to
650support the transparency we require.
651Since each network uses a different
652addressing scheme,
653the ASCII strings written to a control file have no common format.
654As a result, every tool must know the specifics of the networks it
655is capable of addressing.
656Moreover, since each machine supplies a subset
657of the available networks, each user must be aware of the networks supported
658by every terminal and server machine.
659This is obviously unacceptable.
660.PP
661Several possible solutions were considered and rejected; one deserves
662more discussion.
663We could have used a user-level file server
664to represent the network name space as a Plan 9 file tree.
665This global naming scheme has been implemented in other distributed systems.
666The file hierarchy provides paths to
667directories representing network domains.
668Each directory contains
669files representing the names of the machines in that domain;
670an example might be the path
671.CW /net/name/usa/edu/mit/ai .
672Each machine file contains information like the IP address of the machine.
673We rejected this representation for several reasons.
674First, it is hard to devise a hierarchy encompassing all representations
675of the various network addressing schemes in a uniform manner.
676Datakit and Ethernet address strings have nothing in common.
677Second, the address of a machine is
678often only a small part of the information required to connect to a service on
679the machine.
680For example, the IP protocols require symbolic service names to be mapped into
681numeric port numbers, some of which are privileged and hence special.
682Information of this sort is hard to represent in terms of file operations.
683Finally, the size and number of the networks being represented burdens users with
684an unacceptably large amount of information about the organization of the network
685and its connectivity.
686In this case the Plan 9 representation of a
687resource as a file is not appropriate.
688.PP
689If tools are to be network independent, a third-party server must resolve
690network names.
691A server on each machine, with local knowledge, can select the best network
692for any particular destination machine or service.
693Since the network devices present a common interface,
694the only operation which differs between networks is name resolution.
695A symbolic name must be translated to
696the path of the clone file of a protocol
697device and an ASCII address string to write to the
698.CW ctl
699file.
700A connection server (CS) provides this service.
701.NH 2
702Network Database
703.PP
704On most systems several
705files such as
706.CW /etc/hosts ,
707.CW /etc/networks ,
708.CW /etc/services ,
709.CW /etc/hosts.equiv ,
710.CW /etc/bootptab ,
711and
712.CW /etc/named.d
713hold network information.
714Much time and effort is spent
715administering these files and keeping
716them mutually consistent.
717Tools attempt to
718automatically derive one or more of the files from
719information in other files but maintenance continues to be
720difficult and error prone.
721.PP
722Since we were writing an entirely new system, we were free to
723try a simpler approach.
724One database on a shared server contains all the information
725needed for network administration.
726Two ASCII files comprise the main database:
727.CW /lib/ndb/local
728contains locally administered information and
729.CW /lib/ndb/global
730contains information imported from elsewhere.
731The files contain sets of attribute/value pairs of the form
732.I attr\f(CW=\fPvalue ,
733where
734.I attr
735and
736.I value
737are alphanumeric strings.
738Systems are described by multi-line entries;
739a header line at the left margin begins each entry followed by zero or more
740indented attribute/value pairs specifying
741names, addresses, properties, etc.
742For example, the entry for our CPU server
743specifies a domain name, an IP address, an Ethernet address,
744a Datakit address, a boot file, and supported protocols.
745.P1
746sys = helix
747	dom=helix.research.att.com
748	bootf=/mips/9power
749	ip=135.104.9.31 ether=0800690222f0
750	dk=nj/astro/helix
751	proto=il flavor=9cpu
752.P2
753If several systems share entries such as
754network mask and gateway, we specify that information
755with the network or subnetwork instead of the system.
756The following entries define a Class B IP network and
757a few subnets derived from it.
758The entry for the network specifies the IP mask,
759file system, and authentication server for all systems
760on the network.
761Each subnetwork specifies its default IP gateway.
762.P1
763ipnet=mh-astro-net ip=135.104.0.0 ipmask=255.255.255.0
764	fs=bootes.research.att.com
765	auth=1127auth
766ipnet=unix-room ip=135.104.117.0
767	ipgw=135.104.117.1
768ipnet=third-floor ip=135.104.51.0
769	ipgw=135.104.51.1
770ipnet=fourth-floor ip=135.104.52.0
771	ipgw=135.104.52.1
772.P2
773Database entries also define the mapping of service names
774to port numbers for TCP, UDP, and IL.
775.P1
776tcp=echo	port=7
777tcp=discard	port=9
778tcp=systat	port=11
779tcp=daytime	port=13
780.P2
781.PP
782All programs read the database directly so
783consistency problems are rare.
784However the database files can become large.
785Our global file, containing all information about
786both Datakit and Internet systems in AT&T, has 43,000
787lines.
788To speed searches, we build hash table files for each
789attribute we expect to search often.
790The hash file entries point to entries
791in the master files.
792Every hash file contains the modification time of its master
793file so we can avoid using an out-of-date hash table.
794Searches for attributes that aren't hashed or whose hash table
795is out-of-date still work, they just take longer.
796.NH 2
797Connection Server
798.PP
799On each system a user level connection server process, CS, translates
800symbolic names to addresses.
801CS uses information about available networks, the network database, and
802other servers (such as DNS) to translate names.
803CS is a file server serving a single file,
804.CW /net/cs .
805A client writes a symbolic name to
806.CW /net/cs
807then reads one line for each matching destination reachable
808from this system.
809The lines are of the form
810.I "filename message",
811where
812.I filename
813is the path of the clone file to open for a new connection and
814.I message
815is the string to write to it to make the connection.
816The following example illustrates this.
817.CW Ndb/csquery
818is a program that prompts for strings to write to
819.CW /net/cs
820and prints the replies.
821.P1
822% ndb/csquery
823> net!helix!9fs
824/net/il/clone 135.104.9.31!17008
825/net/dk/clone nj/astro/helix!9fs
826.P2
827.PP
828CS provides meta-name translation to perform complicated
829searches.
830The special network name
831.CW net
832selects any network in common between source and
833destination supporting the specified service.
834A host name of the form \f(CW$\fIattr\f1
835is the name of an attribute in the network database.
836The database search returns the value
837of the matching attribute/value pair
838most closely associated with the source host.
839Most closely associated is defined on a per network basis.
840For example, the symbolic name
841.CW tcp!$auth!rexauth
842causes CS to search for the
843.CW auth
844attribute in the database entry for the source system, then its
845subnetwork (if there is one) and then its network.
846.P1
847% ndb/csquery
848> net!$auth!rexauth
849/net/il/clone 135.104.9.34!17021
850/net/dk/clone nj/astro/p9auth!rexauth
851/net/il/clone 135.104.9.6!17021
852/net/dk/clone nj/astro/musca!rexauth
853.P2
854.PP
855Normally CS derives naming information from its database files.
856For domain names however, CS first consults another user level
857process, the domain name server (DNS).
858If no DNS is reachable, CS relies on its own tables.
859.PP
860Like CS, the domain name server is a user level process providing
861one file,
862.CW /net/dns .
863A client writes a request of the form
864.I "domain-name type" ,
865where
866.I type
867is a domain name service resource record type.
868DNS performs a recursive query through the
869Internet domain name system producing one line
870per resource record found.  The client reads
871.CW /net/dns
872to retrieve the records.
873Like other domain name servers, DNS caches information
874learned from the network.
875DNS is implemented as a multi-process shared memory application
876with separate processes listening for network and local requests.
877.NH
878Library routines
879.PP
880The section on protocol devices described the details
881of making and receiving connections across a network.
882The dance is straightforward but tedious.
883Library routines are provided to relieve
884the programmer of the details.
885.NH 2
886Connecting
887.PP
888The
889.CW dial
890library call establishes a connection to a remote destination.
891It
892returns an open file descriptor for the
893.CW data
894file in the connection directory.
895.P1
896int  dial(char *dest, char *local, char *dir, int *cfdp)
897.P2
898.IP \f(CWdest\fP 10
899is the symbolic name/address of the destination.
900.IP \f(CWlocal\fP 10
901is the local address.
902Since most networks do not support this, it is
903usually zero.
904.IP \f(CWdir\fP 10
905is a pointer to a buffer to hold the path name of the protocol directory
906representing this connection.
907.CW Dial
908fills this buffer if the pointer is non-zero.
909.IP \f(CWcfdp\fP 10
910is a pointer to a file descriptor for the
911.CW ctl
912file of the connection.
913If the pointer is non-zero,
914.CW dial
915opens the control file and tucks the file descriptor here.
916.LP
917Most programs call
918.CW dial
919with a destination name and all other arguments zero.
920.CW Dial
921uses CS to
922translate the symbolic name to all possible destination addresses
923and attempts to connect to each in turn until one works.
924Specifying the special name
925.CW net
926in the network portion of the destination
927allows CS to pick a network/protocol in common
928with the destination for which the requested service is valid.
929For example, assume the system
930.CW research.att.com
931has the Datakit address
932.CW nj/astro/research
933and IP addresses
934.CW 135.104.117.5
935and
936.CW 129.11.4.1 .
937The call
938.P1
939fd = dial("net!research.att.com!login", 0, 0, 0, 0);
940.P2
941tries in succession to connect to
942.CW nj/astro/research!login
943on the Datakit and both
944.CW 135.104.117.5!513
945and
946.CW 129.11.4.1!513
947across the Internet.
948.PP
949.CW Dial
950accepts addresses instead of symbolic names.
951For example, the destinations
952.CW tcp!135.104.117.5!513
953and
954.CW tcp!research.att.com!login
955are equivalent
956references to the same machine.
957.NH 2
958Listening
959.PP
960A program uses
961four routines to listen for incoming connections.
962It first
963.CW announce() s
964its intention to receive connections,
965then
966.CW listen() s
967for calls and finally
968.CW accept() s
969or
970.CW reject() s
971them.
972.CW Announce
973returns an open file descriptor for the
974.CW ctl
975file of a connection and fills
976.CW dir
977with the
978path of the protocol directory
979for the announcement.
980.P1
981int  announce(char *addr, char *dir)
982.P2
983.CW Addr
984is the symbolic name/address announced;
985if it does not contain a service, the announcement is for
986all services not explicitly announced.
987Thus, one can easily write the equivalent of the
988.CW inetd
989program without
990having to announce each separate service.
991An announcement remains in force until the control file is
992closed.
993.LP
994.CW Listen
995returns an open file descriptor for the
996.CW ctl
997file and fills
998.CW ldir
999with the path
1000of the protocol directory
1001for the received connection.
1002It is passed
1003.CW dir
1004from the announcement.
1005.P1
1006int  listen(char *dir, char *ldir)
1007.P2
1008.LP
1009.CW Accept
1010and
1011.CW reject
1012are called with the control file descriptor and
1013.CW ldir
1014returned by
1015.CW listen.
1016Some networks such as Datakit accept a reason for a rejection;
1017networks such as IP ignore the third argument.
1018.P1
1019int  accept(int ctl, char *ldir)
1020int  reject(int ctl, char *ldir, char *reason)
1021.P2
1022.PP
1023The following code implements a typical TCP listener.
1024It announces itself, listens for connections, and forks a new
1025process for each.
1026The new process echoes data on the connection until the
1027remote end closes it.
1028The "*" in the symbolic name means the announcement is valid for
1029any addresses bound to the machine the program is run on.
1030.P1
1031.ta 8n 16n 24n 32n 40n 48n 56n 64n
1032int
1033echo_server(void)
1034{
1035	int dfd, lcfd;
1036	char adir[40], ldir[40];
1037	int n;
1038	char buf[256];
1039
1040	afd = announce("tcp!*!echo", adir);
1041	if(afd < 0)
1042		return -1;
1043
1044	for(;;){
1045		/* listen for a call */
1046		lcfd = listen(adir, ldir);
1047		if(lcfd < 0)
1048			return -1;
1049
1050		/* fork a process to echo */
1051		switch(fork()){
1052		case 0:
1053			/* accept the call and open the data file */
1054			dfd = accept(lcfd, ldir);
1055			if(dfd < 0)
1056				return -1;
1057
1058			/* echo until EOF */
1059			while((n = read(dfd, buf, sizeof(buf))) > 0)
1060				write(dfd, buf, n);
1061			exits(0);
1062		case -1:
1063			perror("forking");
1064		default:
1065			close(lcfd);
1066			break;
1067		}
1068
1069	}
1070}
1071.P2
1072.NH
1073User Level
1074.PP
1075Communication between Plan 9 machines is done almost exclusively in
1076terms of 9P messages. Only the two services
1077.CW cpu
1078and
1079.CW exportfs
1080are used.
1081The
1082.CW cpu
1083service is analogous to
1084.CW rlogin .
1085However, rather than emulating a terminal session
1086across the network,
1087.CW cpu
1088creates a process on the remote machine whose name space is an analogue of the window
1089in which it was invoked.
1090.CW Exportfs
1091is a user level file server which allows a piece of name space to be
1092exported from machine to machine across a network. It is used by the
1093.CW cpu
1094command to serve the files in the terminal's name space when they are
1095accessed from the
1096cpu server.
1097.PP
1098By convention, the protocol and device driver file systems are mounted in a
1099directory called
1100.CW /net .
1101Although the per-process name space allows users to configure an
1102arbitrary view of the system, in practice their profiles build
1103a conventional name space.
1104.NH 2
1105Exportfs
1106.PP
1107.CW Exportfs
1108is invoked by an incoming network call.
1109The
1110.I listener
1111(the Plan 9 equivalent of
1112.CW inetd )
1113runs the profile of the user
1114requesting the service to construct a name space before starting
1115.CW exportfs .
1116After an initial protocol
1117establishes the root of the file tree being
1118exported,
1119the remote process mounts the connection,
1120allowing
1121.CW exportfs
1122to act as a relay file server. Operations in the imported file tree
1123are executed on the remote server and the results returned.
1124As a result
1125the name space of the remote machine appears to be exported into a
1126local file tree.
1127.PP
1128The
1129.CW import
1130command calls
1131.CW exportfs
1132on a remote machine, mounts the result in the local name space,
1133and
1134exits.
1135No local process is required to serve mounts;
11369P messages are generated by the kernel's mount driver and sent
1137directly over the network.
1138.PP
1139.CW Exportfs
1140must be multithreaded since the system calls
1141.CW open,
1142.CW read
1143and
1144.CW write
1145may block.
1146Plan 9 does not implement the
1147.CW select
1148system call but does allow processes to share file descriptors,
1149memory and other resources.
1150.CW Exportfs
1151and the configurable name space
1152provide a means of sharing resources between machines.
1153It is a building block for constructing complex name spaces
1154served from many machines.
1155.PP
1156The simplicity of the interfaces encourages naive users to exploit the potential
1157of a richly connected environment.
1158Using these tools it is easy to gateway between networks.
1159For example a terminal with only a Datakit connection can import from the server
1160.CW helix :
1161.P1
1162import -a helix /net
1163telnet ai.mit.edu
1164.P2
1165The
1166.CW import
1167command makes a Datakit connection to the machine
1168.CW helix
1169where
1170it starts an instance
1171.CW exportfs
1172to serve
1173.CW /net .
1174The
1175.CW import
1176command mounts the remote
1177.CW /net
1178directory after (the
1179.CW -a
1180option to
1181.CW import )
1182the existing contents
1183of the local
1184.CW /net
1185directory.
1186The directory contains the union of the local and remote contents of
1187.CW /net .
1188Local entries supersede remote ones of the same name so
1189networks on the local machine are chosen in preference
1190to those supplied remotely.
1191However, unique entries in the remote directory are now visible in the local
1192.CW /net
1193directory.
1194All the networks connected to
1195.CW helix ,
1196not just Datakit,
1197are now available in the terminal. The effect on the name space is shown by the following
1198example:
1199.P1
1200philw-gnot% ls /net
1201/net/cs
1202/net/dk
1203philw-gnot% import -a musca /net
1204philw-gnot% ls /net
1205/net/cs
1206/net/cs
1207/net/dk
1208/net/dk
1209/net/dns
1210/net/ether
1211/net/il
1212/net/tcp
1213/net/udp
1214.P2
1215.NH 2
1216Ftpfs
1217.PP
1218We decided to make our interface to FTP
1219a file system rather than the traditional command.
1220Our command,
1221.I ftpfs,
1222dials the FTP port of a remote system, prompts for login and password, sets image mode,
1223and mounts the remote file system onto
1224.CW /n/ftp .
1225Files and directories are cached to reduce traffic.
1226The cache is updated whenever a file is created.
1227Ftpfs works with TOPS-20, VMS, and various Unix flavors
1228as the remote system.
1229.NH
1230Cyclone Fiber Links
1231.PP
1232The file servers and CPU servers are connected by
1233high-bandwidth
1234point-to-point links.
1235A link consists of two VME cards connected by a pair of optical
1236fibers.
1237The VME cards use 33MHz Intel 960 processors and AMD's TAXI
1238fiber transmitter/receivers to drive the lines at 125 Mbit/sec.
1239Software in the VME card reduces latency by copying messages from system memory
1240to fiber without intermediate buffering.
1241.NH
1242Performance
1243.PP
1244We measured both latency and throughput
1245of reading and writing bytes between two processes
1246for a number of different paths.
1247Measurements were made on two- and four-CPU SGI Power Series processors.
1248The CPUs are 25 MHz MIPS 3000s.
1249The latency is measured as the round trip time
1250for a byte sent from one process to another and
1251back again.
1252Throughput is measured using 16k writes from
1253one process to another.
1254.DS C
1255.TS
1256box, tab(:);
1257c s s
1258c | c | c
1259l | n | n.
1260Table 1 - Performance
1261_
1262test:throughput:latency
1263:MBytes/sec:millisec
1264_
1265pipes:8.15:.255
1266_
1267IL/ether:1.02:1.42
1268_
1269URP/Datakit:0.22:1.75
1270_
1271Cyclone:3.2:0.375
1272.TE
1273.DE
1274.NH
1275Conclusion
1276.PP
1277The representation of all resources as file systems
1278coupled with an ASCII interface has proved more powerful
1279than we had originally imagined.
1280Resources can be used by any computer in our networks
1281independent of byte ordering or CPU type.
1282The connection server provides an elegant means
1283of decoupling tools from the networks they use.
1284Users successfully use Plan 9 without knowing the
1285topology of the system or the networks they use.
1286More information about 9P can be found in the Section 5 of the Plan 9 Programmer's
1287Manual, Volume I.
1288.NH
1289References
1290.LP
1291[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1292``Plan 9 from Bell Labs'',
1293.I
1294UKUUG Proc. of the Summer 1990 Conf. ,
1295London, England,
12961990.
1297.LP
1298[Needham] R. Needham, ``Names'', in
1299.I
1300Distributed systems,
1301.R
1302S. Mullender, ed.,
1303Addison Wesley, 1989.
1304.LP
1305[Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
1306.I
1307UKUUG Proc. of the Summer 1990 Conf. ,
1308.R
1309London, England, 1990.
1310.LP
1311[Met80] R. Metcalfe, D. Boggs, C. Crane, E. Taf and J. Hupp, ``The
1312Ethernet Local Network: Three reports'',
1313.I
1314CSL-80-2,
1315.R
1316XEROX Palo Alto Research Center, February 1980.
1317.LP
1318[Fra80] A. G. Fraser, ``Datakit - A Modular Network for Synchronous
1319and Asynchronous Traffic'',
1320.I
1321Proc. Int'l Conf. on Communication,
1322.R
1323Boston, June 1980.
1324.LP
1325[Pet89a] L. Peterson, ``RPC in the X-Kernel: Evaluating new Design Techniques'',
1326.I
1327Proc. Twelfth Symp. on Op. Sys. Princ.,
1328.R
1329Litchfield Park, AZ, December 1990.
1330.LP
1331[Rit84a] D. M. Ritchie, ``A Stream Input-Output System'',
1332.I
1333AT&T Bell Laboratories Technical Journal, 68(8),
1334.R
1335October 1984.
1336