xref: /plan9/sys/doc/net/net.ms (revision 08d9be5d6ac53f09e0668042eb5820fbcaf7c8a2)
1.HTML "The Organization of Networks in Plan 9
2.TL
3The Organization of Networks in Plan 9
4.AU
5Dave Presotto
6Phil Winterbottom
7.sp
8presotto,philw@plan9.bell-labs.com
9.AB
10.FS
11Originally appeared in
12.I
13Proc. of the Winter 1993 USENIX Conf.,
14.R
15pp. 271-280,
16San Diego, CA
17.FE
18In a distributed system networks are of paramount importance. This
19paper describes the implementation, design philosophy, and organization
20of network support in Plan 9. Topics include network requirements
21for distributed systems, our kernel implementation, network naming, user interfaces,
22and performance. We also observe that much of this organization is relevant to
23current systems.
24.AE
25.NH
26Introduction
27.PP
28Plan 9 [Pike90] is a general-purpose, multi-user, portable distributed system
29implemented on a variety of computers and networks.
30What distinguishes Plan 9 is its organization.
31The goals of this organization were to
32reduce administration
33and to promote resource sharing. One of the keys to its success as a distributed
34system is the organization and management of its networks.
35.PP
36A Plan 9 system comprises file servers, CPU servers and terminals.
37The file servers and CPU servers are typically centrally
38located multiprocessor machines with large memories and
39high speed interconnects.
40A variety of workstation-class machines
41serve as terminals
42connected to the central servers using several networks and protocols.
43The architecture of the system demands a hierarchy of network
44speeds matching the needs of the components.
45Connections between file servers and CPU servers are high-bandwidth point-to-point
46fiber links.
47Connections from the servers fan out to local terminals
48using medium speed networks
49such as Ethernet [Met80] and Datakit [Fra80].
50Low speed connections via the Internet and
51the AT&T backbone serve users in Oregon and Illinois.
52Basic Rate ISDN data service and 9600 baud serial lines provide slow
53links to users at home.
54.PP
55Since CPU servers and terminals use the same kernel,
56users may choose to run programs locally on
57their terminals or remotely on CPU servers.
58The organization of Plan 9 hides the details of system connectivity
59allowing both users and administrators to configure their environment
60to be as distributed or centralized as they wish.
61Simple commands support the
62construction of a locally represented name space
63spanning many machines and networks.
64At work, users tend to use their terminals like workstations,
65running interactive programs locally and
66reserving the CPU servers for data or compute intensive jobs
67such as compiling and computing chess endgames.
68At home or when connected over
69a slow network, users tend to do most work on the CPU server to minimize
70traffic on the slow links.
71The goal of the network organization is to provide the same
72environment to the user wherever resources are used.
73.NH
74Kernel Network Support
75.PP
76Networks play a central role in any distributed system. This is particularly
77true in Plan 9 where most resources are provided by servers external to the kernel.
78The importance of the networking code within the kernel
79is reflected by its size;
80of 25,000 lines of kernel code, 12,500 are network and protocol related.
81Networks are continually being added and the fraction of code
82devoted to communications
83is growing.
84Moreover, the network code is complex.
85Protocol implementations consist almost entirely of
86synchronization and dynamic memory management, areas demanding
87subtle error recovery
88strategies.
89The kernel currently supports Datakit, point-to-point fiber links,
90an Internet (IP) protocol suite and ISDN data service.
91The variety of networks and machines
92has raised issues not addressed by other systems running on commercial
93hardware supporting only Ethernet or FDDI.
94.NH 2
95The File System protocol
96.PP
97A central idea in Plan 9 is the representation of a resource as a hierarchical
98file system.
99Each process assembles a view of the system by building a
100.I "name space
101[Needham] connecting its resources.
102File systems need not represent disc files; in fact, most Plan 9 file systems have no
103permanent storage.
104A typical file system dynamically represents
105some resource like a set of network connections or the process table.
106Communication between the kernel, device drivers, and local or remote file servers uses a
107protocol called 9P. The protocol consists of 17 messages
108describing operations on files and directories.
109Kernel resident device and protocol drivers use a procedural version
110of the protocol while external file servers use an RPC form.
111Nearly all traffic between Plan 9 systems consists
112of 9P messages.
1139P relies on several properties of the underlying transport protocol.
114It assumes messages arrive reliably and in sequence and
115that delimiters between messages
116are preserved.
117When a protocol does not meet these
118requirements (for example, TCP does not preserve delimiters)
119we provide mechanisms to marshal messages before handing them
120to the system.
121.PP
122A kernel data structure, the
123.I channel ,
124is a handle to a file server.
125Operations on a channel generate the following 9P messages.
126The
127.CW session
128and
129.CW attach
130messages authenticate a connection, established by means external to 9P,
131and validate its user.
132The result is an authenticated
133channel
134referencing the root of the
135server.
136The
137.CW clone
138message makes a new channel identical to an existing channel, much like
139the
140.CW dup
141system call.
142A
143channel
144may be moved to a file on the server using a
145.CW walk
146message to descend each level in the hierarchy.
147The
148.CW stat
149and
150.CW wstat
151messages read and write the attributes of the file referenced by a channel.
152The
153.CW open
154message prepares a channel for subsequent
155.CW read
156and
157.CW write
158messages to access the contents of the file.
159.CW Create
160and
161.CW remove
162perform the actions implied by their names on the file
163referenced by the channel.
164The
165.CW clunk
166message discards a channel without affecting the file.
167.PP
168A kernel resident file server called the
169.I "mount driver"
170converts the procedural version of 9P into RPCs.
171The
172.I mount
173system call provides a file descriptor, which can be
174a pipe to a user process or a network connection to a remote machine, to
175be associated with the mount point.
176After a mount, operations
177on the file tree below the mount point are sent as messages to the file server.
178The
179mount
180driver manages buffers, packs and unpacks parameters from
181messages, and demultiplexes among processes using the file server.
182.NH 2
183Kernel Organization
184.PP
185The network code in the kernel is divided into three layers: hardware interface,
186protocol processing, and program interface.
187A device driver typically uses streams to connect the two interface layers.
188Additional stream modules may be pushed on
189a device to process protocols.
190Each device driver is a kernel-resident file system.
191Simple device drivers serve a single level
192directory containing just a few files;
193for example, we represent each UART
194by a data and a control file.
195.P1
196cpu% cd /dev
197cpu% ls -l eia*
198--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1
199--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1ctl
200--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2
201--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2ctl
202cpu%
203.P2
204The control file is used to control the device;
205writing the string
206.CW b1200
207to
208.CW /dev/eia1ctl
209sets the line to 1200 baud.
210.PP
211Multiplexed devices present
212a more complex interface structure.
213For example, the LANCE Ethernet driver
214serves a two level file tree (Figure 1)
215providing
216.IP \(bu
217device control and configuration
218.IP \(bu
219user-level protocols like ARP
220.IP \(bu
221diagnostic interfaces for snooping software.
222.LP
223The top directory contains a
224.CW clone
225file and a directory for each connection, numbered
226.CW 1
227to
228.CW n .
229Each connection directory corresponds to an Ethernet packet type.
230Opening the
231.CW clone
232file finds an unused connection directory
233and opens its
234.CW ctl
235file.
236Reading the control file returns the ASCII connection number; the user
237process can use this value to construct the name of the proper
238connection directory.
239In each connection directory files named
240.CW ctl ,
241.CW data ,
242.CW stats ,
243and
244.CW type
245provide access to the connection.
246Writing the string
247.CW "connect 2048"
248to the
249.CW ctl
250file sets the packet type to 2048
251and
252configures the connection to receive
253all IP packets sent to the machine.
254Subsequent reads of the file
255.CW type
256yield the string
257.CW 2048 .
258The
259.CW data
260file accesses the media;
261reading it
262returns the
263next packet of the selected type.
264Writing the file
265queues a packet for transmission after
266appending a packet header containing the source address and packet type.
267The
268.CW stats
269file returns ASCII text containing the interface address,
270packet input/output counts, error statistics, and general information
271about the state of the interface.
272.so tree.pout
273.PP
274If several connections on an interface
275are configured for a particular packet type, each receives a
276copy of the incoming packets.
277The special packet type
278.CW -1
279selects all packets.
280Writing the strings
281.CW promiscuous
282and
283.CW connect
284.CW -1
285to the
286.CW ctl
287file
288configures a conversation to receive all packets on the Ethernet.
289.PP
290Although the driver interface may seem elaborate,
291the representation of a device as a set of files using ASCII strings for
292communication has several advantages.
293Any mechanism supporting remote access to files immediately
294allows a remote machine to use our interfaces as gateways.
295Using ASCII strings to control the interface avoids byte order problems and
296ensures a uniform representation for
297devices on the same machine and even allows devices to be accessed remotely.
298Representing dissimilar devices by the same set of files allows common tools
299to serve
300several networks or interfaces.
301Programs like
302.CW stty
303are replaced by
304.CW echo
305and shell redirection.
306.NH 2
307Protocol devices
308.PP
309Network connections are represented as pseudo-devices called protocol devices.
310Protocol device drivers exist for the Datakit URP protocol and for each of the
311Internet IP protocols TCP, UDP, and IL.
312IL, described below, is a new communication protocol used by Plan 9 for
313transmitting file system RPC's.
314All protocol devices look identical so user programs contain no
315network-specific code.
316.PP
317Each protocol device driver serves a directory structure
318similar to that of the Ethernet driver.
319The top directory contains a
320.CW clone
321file and a directory for each connection numbered
322.CW 0
323to
324.CW n .
325Each connection directory contains files to control one
326connection and to send and receive information.
327A TCP connection directory looks like this:
328.P1
329cpu% cd /net/tcp/2
330cpu% ls -l
331--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 ctl
332--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 data
333--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 listen
334--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 local
335--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 remote
336--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 status
337cpu% cat local remote status
338135.104.9.31 5012
339135.104.53.11 564
340tcp/2 1 Established connect
341cpu%
342.P2
343The files
344.CW local ,
345.CW remote ,
346and
347.CW status
348supply information about the state of the connection.
349The
350.CW data
351and
352.CW ctl
353files
354provide access to the process end of the stream implementing the protocol.
355The
356.CW listen
357file is used to accept incoming calls from the network.
358.PP
359The following steps establish a connection.
360.IP 1)
361The clone device of the
362appropriate protocol directory is opened to reserve an unused connection.
363.IP 2)
364The file descriptor returned by the open points to the
365.CW ctl
366file of the new connection.
367Reading that file descriptor returns an ASCII string containing
368the connection number.
369.IP 3)
370A protocol/network specific ASCII address string is written to the
371.CW ctl
372file.
373.IP 4)
374The path of the
375.CW data
376file is constructed using the connection number.
377When the
378.CW data
379file is opened the connection is established.
380.LP
381A process can read and write this file descriptor
382to send and receive messages from the network.
383If the process opens the
384.CW listen
385file it blocks until an incoming call is received.
386An address string written to the
387.CW ctl
388file before the listen selects the
389ports or services the process is prepared to accept.
390When an incoming call is received, the open completes
391and returns a file descriptor
392pointing to the
393.CW ctl
394file of the new connection.
395Reading the
396.CW ctl
397file yields a connection number used to construct the path of the
398.CW data
399file.
400A connection remains established while any of the files in the connection directory
401are referenced or until a close is received from the network.
402.NH 2
403Streams
404.PP
405A
406.I stream
407[Rit84a][Presotto] is a bidirectional channel connecting a
408physical or pseudo-device to user processes.
409The user processes insert and remove data at one end of the stream.
410Kernel processes acting on behalf of a device insert data at
411the other end.
412Asynchronous communications channels such as pipes,
413TCP conversations, Datakit conversations, and RS232 lines are implemented using
414streams.
415.PP
416A stream comprises a linear list of
417.I "processing modules" .
418Each module has both an upstream (toward the process) and
419downstream (toward the device)
420.I "put routine" .
421Calling the put routine of the module on either end of the stream
422inserts data into the stream.
423Each module calls the succeeding one to send data up or down the stream.
424.PP
425An instance of a processing module is represented by a pair of
426.I queues ,
427one for each direction.
428The queues point to the put procedures and can be used
429to queue information traveling along the stream.
430Some put routines queue data locally and send it along the stream at some
431later time, either due to a subsequent call or an asynchronous
432event such as a retransmission timer or a device interrupt.
433Processing modules create helper kernel processes to
434provide a context for handling asynchronous events.
435For example, a helper kernel process awakens periodically
436to perform any necessary TCP retransmissions.
437The use of kernel processes instead of serialized run-to-completion service routines
438differs from the implementation of Unix streams.
439Unix service routines cannot
440use any blocking kernel resource and they lack a local long-lived state.
441Helper kernel processes solve these problems and simplify the stream code.
442.PP
443There is no implicit synchronization in our streams.
444Each processing module must ensure that concurrent processes using the stream
445are synchronized.
446This maximizes concurrency but introduces the
447possibility of deadlock.
448However, deadlocks are easily avoided by careful programming; to
449date they have not caused us problems.
450.PP
451Information is represented by linked lists of kernel structures called
452.I blocks .
453Each block contains a type, some state flags, and pointers to
454an optional buffer.
455Block buffers can hold either data or control information, i.e., directives
456to the processing modules.
457Blocks and block buffers are dynamically allocated from kernel memory.
458.NH 3
459User Interface
460.PP
461A stream is represented at user level as two files,
462.CW ctl
463and
464.CW data .
465The actual names can be changed by the device driver using the stream,
466as we saw earlier in the example of the UART driver.
467The first process to open either file creates the stream automatically.
468The last close destroys it.
469Writing to the
470.CW data
471file copies the data into kernel blocks
472and passes them to the downstream put routine of the first processing module.
473A write of less than 32K is guaranteed to be contained by a single block.
474Concurrent writes to the same stream are not synchronized, although the
47532K block size assures atomic writes for most protocols.
476The last block written is flagged with a delimiter
477to alert downstream modules that care about write boundaries.
478In most cases the first put routine calls the second, the second
479calls the third, and so on until the data is output.
480As a consequence, most data is output without context switching.
481.PP
482Reading from the
483.CW data
484file returns data queued at the top of the stream.
485The read terminates when the read count is reached
486or when the end of a delimited block is encountered.
487A per stream read lock ensures only one process
488can read from a stream at a time and guarantees
489that the bytes read were contiguous bytes from the
490stream.
491.PP
492Like UNIX streams [Rit84a],
493Plan 9 streams can be dynamically configured.
494The stream system intercepts and interprets
495the following control blocks:
496.IP "\f(CWpush\fP \fIname\fR" 15
497adds an instance of the processing module
498.I name
499to the top of the stream.
500.IP \f(CWpop\fP 15
501removes the top module of the stream.
502.IP \f(CWhangup\fP 15
503sends a hangup message
504up the stream from the device end.
505.LP
506Other control blocks are module-specific and are interpreted by each
507processing module
508as they pass.
509.PP
510The convoluted syntax and semantics of the UNIX
511.CW ioctl
512system call convinced us to leave it out of Plan 9.
513Instead,
514.CW ioctl
515is replaced by the
516.CW ctl
517file.
518Writing to the
519.CW ctl
520file
521is identical to writing to a
522.CW data
523file except the blocks are of type
524.I control .
525A processing module parses each control block it sees.
526Commands in control blocks are ASCII strings, so
527byte ordering is not an issue when one system
528controls streams in a name space implemented on another processor.
529The time to parse control blocks is not important, since control
530operations are rare.
531.NH 3
532Device Interface
533.PP
534The module at the downstream end of the stream is part of a device interface.
535The particulars of the interface vary with the device.
536Most device interfaces consist of an interrupt routine, an output
537put routine, and a kernel process.
538The output put routine stages data for the
539device and starts the device if it is stopped.
540The interrupt routine wakes up the kernel process whenever
541the device has input to be processed or needs more output staged.
542The kernel process puts information up the stream or stages more data for output.
543The division of labor among the different pieces varies depending on
544how much must be done at interrupt level.
545However, the interrupt routine may not allocate blocks or call
546a put routine since both actions require a process context.
547.NH 3
548Multiplexing
549.PP
550The conversations using a protocol device must be
551multiplexed onto a single physical wire.
552We push a multiplexer processing module
553onto the physical device stream to group the conversations.
554The device end modules on the conversations add the necessary header
555onto downstream messages and then put them to the module downstream
556of the multiplexer.
557The multiplexing module looks at each message moving up its stream and
558puts it to the correct conversation stream after stripping
559the header controlling the demultiplexing.
560.PP
561This is similar to the Unix implementation of multiplexer streams.
562The major difference is that we have no general structure that
563corresponds to a multiplexer.
564Each attempt to produce a generalized multiplexer created a more complicated
565structure and underlined the basic difficulty of generalizing this mechanism.
566We now code each multiplexer from scratch and favor simplicity over
567generality.
568.NH 3
569Reflections
570.PP
571Despite five year's experience and the efforts of many programmers,
572we remain dissatisfied with the stream mechanism.
573Performance is not an issue;
574the time to process protocols and drive
575device interfaces continues to dwarf the
576time spent allocating, freeing, and moving blocks
577of data.
578However the mechanism remains inordinately
579complex.
580Much of the complexity results from our efforts
581to make streams dynamically configurable, to
582reuse processing modules on different devices
583and to provide kernel synchronization
584to ensure data structures
585don't disappear under foot.
586This is particularly irritating since we seldom use these properties.
587.PP
588Streams remain in our kernel because we are unable to
589devise a better alternative.
590Larry Peterson's X-kernel [Pet89a]
591is the closest contender but
592doesn't offer enough advantage to switch.
593If we were to rewrite the streams code, we would probably statically
594allocate resources for a large fixed number of conversations and burn
595memory in favor of less complexity.
596.NH
597The IL Protocol
598.PP
599None of the standard IP protocols is suitable for transmission of
6009P messages over an Ethernet or the Internet.
601TCP has a high overhead and does not preserve delimiters.
602UDP, while cheap, does not provide reliable sequenced delivery.
603Early versions of the system used a custom protocol that was
604efficient but unsatisfactory for internetwork transmission.
605When we implemented IP, TCP, and UDP we looked around for a suitable
606replacement with the following properties:
607.IP \(bu
608Reliable datagram service with sequenced delivery
609.IP \(bu
610Runs over IP
611.IP \(bu
612Low complexity, high performance
613.IP \(bu
614Adaptive timeouts
615.LP
616None met our needs so a new protocol was designed.
617IL is a lightweight protocol designed to be encapsulated by IP.
618It is a connection-based protocol
619providing reliable transmission of sequenced messages between machines.
620No provision is made for flow control since the protocol is designed to transport RPC
621messages between client and server.
622A small outstanding message window prevents too
623many incoming messages from being buffered;
624messages outside the window are discarded
625and must be retransmitted.
626Connection setup uses a two way handshake to generate
627initial sequence numbers at each end of the connection;
628subsequent data messages increment the
629sequence numbers allowing
630the receiver to resequence out of order messages.
631In contrast to other protocols, IL does not do blind retransmission.
632If a message is lost and a timeout occurs, a query message is sent.
633The query message is a small control message containing the current
634sequence numbers as seen by the sender.
635The receiver responds to a query by retransmitting missing messages.
636This allows the protocol to behave well in congested networks,
637where blind retransmission would cause further
638congestion.
639Like TCP, IL has adaptive timeouts.
640A round-trip timer is used
641to calculate acknowledge and retransmission times in terms of the network speed.
642This allows the protocol to perform well on both the Internet and on local Ethernets.
643.PP
644In keeping with the minimalist design of the rest of the kernel, IL is small.
645The entire protocol is 847 lines of code, compared to 2200 lines for TCP.
646IL is our protocol of choice.
647.NH
648Network Addressing
649.PP
650A uniform interface to protocols and devices is not sufficient to
651support the transparency we require.
652Since each network uses a different
653addressing scheme,
654the ASCII strings written to a control file have no common format.
655As a result, every tool must know the specifics of the networks it
656is capable of addressing.
657Moreover, since each machine supplies a subset
658of the available networks, each user must be aware of the networks supported
659by every terminal and server machine.
660This is obviously unacceptable.
661.PP
662Several possible solutions were considered and rejected; one deserves
663more discussion.
664We could have used a user-level file server
665to represent the network name space as a Plan 9 file tree.
666This global naming scheme has been implemented in other distributed systems.
667The file hierarchy provides paths to
668directories representing network domains.
669Each directory contains
670files representing the names of the machines in that domain;
671an example might be the path
672.CW /net/name/usa/edu/mit/ai .
673Each machine file contains information like the IP address of the machine.
674We rejected this representation for several reasons.
675First, it is hard to devise a hierarchy encompassing all representations
676of the various network addressing schemes in a uniform manner.
677Datakit and Ethernet address strings have nothing in common.
678Second, the address of a machine is
679often only a small part of the information required to connect to a service on
680the machine.
681For example, the IP protocols require symbolic service names to be mapped into
682numeric port numbers, some of which are privileged and hence special.
683Information of this sort is hard to represent in terms of file operations.
684Finally, the size and number of the networks being represented burdens users with
685an unacceptably large amount of information about the organization of the network
686and its connectivity.
687In this case the Plan 9 representation of a
688resource as a file is not appropriate.
689.PP
690If tools are to be network independent, a third-party server must resolve
691network names.
692A server on each machine, with local knowledge, can select the best network
693for any particular destination machine or service.
694Since the network devices present a common interface,
695the only operation which differs between networks is name resolution.
696A symbolic name must be translated to
697the path of the clone file of a protocol
698device and an ASCII address string to write to the
699.CW ctl
700file.
701A connection server (CS) provides this service.
702.NH 2
703Network Database
704.PP
705On most systems several
706files such as
707.CW /etc/hosts ,
708.CW /etc/networks ,
709.CW /etc/services ,
710.CW /etc/hosts.equiv ,
711.CW /etc/bootptab ,
712and
713.CW /etc/named.d
714hold network information.
715Much time and effort is spent
716administering these files and keeping
717them mutually consistent.
718Tools attempt to
719automatically derive one or more of the files from
720information in other files but maintenance continues to be
721difficult and error prone.
722.PP
723Since we were writing an entirely new system, we were free to
724try a simpler approach.
725One database on a shared server contains all the information
726needed for network administration.
727Two ASCII files comprise the main database:
728.CW /lib/ndb/local
729contains locally administered information and
730.CW /lib/ndb/global
731contains information imported from elsewhere.
732The files contain sets of attribute/value pairs of the form
733.I attr\f(CW=\fPvalue ,
734where
735.I attr
736and
737.I value
738are alphanumeric strings.
739Systems are described by multi-line entries;
740a header line at the left margin begins each entry followed by zero or more
741indented attribute/value pairs specifying
742names, addresses, properties, etc.
743For example, the entry for our CPU server
744specifies a domain name, an IP address, an Ethernet address,
745a Datakit address, a boot file, and supported protocols.
746.P1
747sys=helix
748	dom=helix.research.bell-labs.com
749	bootf=/mips/9power
750	ip=135.104.9.31 ether=0800690222f0
751	dk=nj/astro/helix
752	proto=il flavor=9cpu
753.P2
754If several systems share entries such as
755network mask and gateway, we specify that information
756with the network or subnetwork instead of the system.
757The following entries define a Class B IP network and
758a few subnets derived from it.
759The entry for the network specifies the IP mask,
760file system, and authentication server for all systems
761on the network.
762Each subnetwork specifies its default IP gateway.
763.P1
764ipnet=mh-astro-net ip=135.104.0.0 ipmask=255.255.255.0
765	fs=bootes.research.bell-labs.com
766	auth=1127auth
767ipnet=unix-room ip=135.104.117.0
768	ipgw=135.104.117.1
769ipnet=third-floor ip=135.104.51.0
770	ipgw=135.104.51.1
771ipnet=fourth-floor ip=135.104.52.0
772	ipgw=135.104.52.1
773.P2
774Database entries also define the mapping of service names
775to port numbers for TCP, UDP, and IL.
776.P1
777tcp=echo	port=7
778tcp=discard	port=9
779tcp=systat	port=11
780tcp=daytime	port=13
781.P2
782.PP
783All programs read the database directly so
784consistency problems are rare.
785However the database files can become large.
786Our global file, containing all information about
787both Datakit and Internet systems in AT&T, has 43,000
788lines.
789To speed searches, we build hash table files for each
790attribute we expect to search often.
791The hash file entries point to entries
792in the master files.
793Every hash file contains the modification time of its master
794file so we can avoid using an out-of-date hash table.
795Searches for attributes that aren't hashed or whose hash table
796is out-of-date still work, they just take longer.
797.NH 2
798Connection Server
799.PP
800On each system a user level connection server process, CS, translates
801symbolic names to addresses.
802CS uses information about available networks, the network database, and
803other servers (such as DNS) to translate names.
804CS is a file server serving a single file,
805.CW /net/cs .
806A client writes a symbolic name to
807.CW /net/cs
808then reads one line for each matching destination reachable
809from this system.
810The lines are of the form
811.I "filename message",
812where
813.I filename
814is the path of the clone file to open for a new connection and
815.I message
816is the string to write to it to make the connection.
817The following example illustrates this.
818.CW Ndb/csquery
819is a program that prompts for strings to write to
820.CW /net/cs
821and prints the replies.
822.P1
823% ndb/csquery
824> net!helix!9fs
825/net/il/clone 135.104.9.31!17008
826/net/dk/clone nj/astro/helix!9fs
827.P2
828.PP
829CS provides meta-name translation to perform complicated
830searches.
831The special network name
832.CW net
833selects any network in common between source and
834destination supporting the specified service.
835A host name of the form \f(CW$\fIattr\f1
836is the name of an attribute in the network database.
837The database search returns the value
838of the matching attribute/value pair
839most closely associated with the source host.
840Most closely associated is defined on a per network basis.
841For example, the symbolic name
842.CW tcp!$auth!rexauth
843causes CS to search for the
844.CW auth
845attribute in the database entry for the source system, then its
846subnetwork (if there is one) and then its network.
847.P1
848% ndb/csquery
849> net!$auth!rexauth
850/net/il/clone 135.104.9.34!17021
851/net/dk/clone nj/astro/p9auth!rexauth
852/net/il/clone 135.104.9.6!17021
853/net/dk/clone nj/astro/musca!rexauth
854.P2
855.PP
856Normally CS derives naming information from its database files.
857For domain names however, CS first consults another user level
858process, the domain name server (DNS).
859If no DNS is reachable, CS relies on its own tables.
860.PP
861Like CS, the domain name server is a user level process providing
862one file,
863.CW /net/dns .
864A client writes a request of the form
865.I "domain-name type" ,
866where
867.I type
868is a domain name service resource record type.
869DNS performs a recursive query through the
870Internet domain name system producing one line
871per resource record found.  The client reads
872.CW /net/dns
873to retrieve the records.
874Like other domain name servers, DNS caches information
875learned from the network.
876DNS is implemented as a multi-process shared memory application
877with separate processes listening for network and local requests.
878.NH
879Library routines
880.PP
881The section on protocol devices described the details
882of making and receiving connections across a network.
883The dance is straightforward but tedious.
884Library routines are provided to relieve
885the programmer of the details.
886.NH 2
887Connecting
888.PP
889The
890.CW dial
891library call establishes a connection to a remote destination.
892It
893returns an open file descriptor for the
894.CW data
895file in the connection directory.
896.P1
897int  dial(char *dest, char *local, char *dir, int *cfdp)
898.P2
899.IP \f(CWdest\fP 10
900is the symbolic name/address of the destination.
901.IP \f(CWlocal\fP 10
902is the local address.
903Since most networks do not support this, it is
904usually zero.
905.IP \f(CWdir\fP 10
906is a pointer to a buffer to hold the path name of the protocol directory
907representing this connection.
908.CW Dial
909fills this buffer if the pointer is non-zero.
910.IP \f(CWcfdp\fP 10
911is a pointer to a file descriptor for the
912.CW ctl
913file of the connection.
914If the pointer is non-zero,
915.CW dial
916opens the control file and tucks the file descriptor here.
917.LP
918Most programs call
919.CW dial
920with a destination name and all other arguments zero.
921.CW Dial
922uses CS to
923translate the symbolic name to all possible destination addresses
924and attempts to connect to each in turn until one works.
925Specifying the special name
926.CW net
927in the network portion of the destination
928allows CS to pick a network/protocol in common
929with the destination for which the requested service is valid.
930For example, assume the system
931.CW research.bell-labs.com
932has the Datakit address
933.CW nj/astro/research
934and IP addresses
935.CW 135.104.117.5
936and
937.CW 129.11.4.1 .
938The call
939.P1
940fd = dial("net!research.bell-labs.com!login", 0, 0, 0, 0);
941.P2
942tries in succession to connect to
943.CW nj/astro/research!login
944on the Datakit and both
945.CW 135.104.117.5!513
946and
947.CW 129.11.4.1!513
948across the Internet.
949.PP
950.CW Dial
951accepts addresses instead of symbolic names.
952For example, the destinations
953.CW tcp!135.104.117.5!513
954and
955.CW tcp!research.bell-labs.com!login
956are equivalent
957references to the same machine.
958.NH 2
959Listening
960.PP
961A program uses
962four routines to listen for incoming connections.
963It first
964.CW announce() s
965its intention to receive connections,
966then
967.CW listen() s
968for calls and finally
969.CW accept() s
970or
971.CW reject() s
972them.
973.CW Announce
974returns an open file descriptor for the
975.CW ctl
976file of a connection and fills
977.CW dir
978with the
979path of the protocol directory
980for the announcement.
981.P1
982int  announce(char *addr, char *dir)
983.P2
984.CW Addr
985is the symbolic name/address announced;
986if it does not contain a service, the announcement is for
987all services not explicitly announced.
988Thus, one can easily write the equivalent of the
989.CW inetd
990program without
991having to announce each separate service.
992An announcement remains in force until the control file is
993closed.
994.LP
995.CW Listen
996returns an open file descriptor for the
997.CW ctl
998file and fills
999.CW ldir
1000with the path
1001of the protocol directory
1002for the received connection.
1003It is passed
1004.CW dir
1005from the announcement.
1006.P1
1007int  listen(char *dir, char *ldir)
1008.P2
1009.LP
1010.CW Accept
1011and
1012.CW reject
1013are called with the control file descriptor and
1014.CW ldir
1015returned by
1016.CW listen.
1017Some networks such as Datakit accept a reason for a rejection;
1018networks such as IP ignore the third argument.
1019.P1
1020int  accept(int ctl, char *ldir)
1021int  reject(int ctl, char *ldir, char *reason)
1022.P2
1023.PP
1024The following code implements a typical TCP listener.
1025It announces itself, listens for connections, and forks a new
1026process for each.
1027The new process echoes data on the connection until the
1028remote end closes it.
1029The "*" in the symbolic name means the announcement is valid for
1030any addresses bound to the machine the program is run on.
1031.P1
1032.ta 8n 16n 24n 32n 40n 48n 56n 64n
1033int
1034echo_server(void)
1035{
1036	int dfd, lcfd;
1037	char adir[40], ldir[40];
1038	int n;
1039	char buf[256];
1040
1041	afd = announce("tcp!*!echo", adir);
1042	if(afd < 0)
1043		return -1;
1044
1045	for(;;){
1046		/* listen for a call */
1047		lcfd = listen(adir, ldir);
1048		if(lcfd < 0)
1049			return -1;
1050
1051		/* fork a process to echo */
1052		switch(fork()){
1053		case 0:
1054			/* accept the call and open the data file */
1055			dfd = accept(lcfd, ldir);
1056			if(dfd < 0)
1057				return -1;
1058
1059			/* echo until EOF */
1060			while((n = read(dfd, buf, sizeof(buf))) > 0)
1061				write(dfd, buf, n);
1062			exits(0);
1063		case -1:
1064			perror("forking");
1065		default:
1066			close(lcfd);
1067			break;
1068		}
1069
1070	}
1071}
1072.P2
1073.NH
1074User Level
1075.PP
1076Communication between Plan 9 machines is done almost exclusively in
1077terms of 9P messages. Only the two services
1078.CW cpu
1079and
1080.CW exportfs
1081are used.
1082The
1083.CW cpu
1084service is analogous to
1085.CW rlogin .
1086However, rather than emulating a terminal session
1087across the network,
1088.CW cpu
1089creates a process on the remote machine whose name space is an analogue of the window
1090in which it was invoked.
1091.CW Exportfs
1092is a user level file server which allows a piece of name space to be
1093exported from machine to machine across a network. It is used by the
1094.CW cpu
1095command to serve the files in the terminal's name space when they are
1096accessed from the
1097cpu server.
1098.PP
1099By convention, the protocol and device driver file systems are mounted in a
1100directory called
1101.CW /net .
1102Although the per-process name space allows users to configure an
1103arbitrary view of the system, in practice their profiles build
1104a conventional name space.
1105.NH 2
1106Exportfs
1107.PP
1108.CW Exportfs
1109is invoked by an incoming network call.
1110The
1111.I listener
1112(the Plan 9 equivalent of
1113.CW inetd )
1114runs the profile of the user
1115requesting the service to construct a name space before starting
1116.CW exportfs .
1117After an initial protocol
1118establishes the root of the file tree being
1119exported,
1120the remote process mounts the connection,
1121allowing
1122.CW exportfs
1123to act as a relay file server. Operations in the imported file tree
1124are executed on the remote server and the results returned.
1125As a result
1126the name space of the remote machine appears to be exported into a
1127local file tree.
1128.PP
1129The
1130.CW import
1131command calls
1132.CW exportfs
1133on a remote machine, mounts the result in the local name space,
1134and
1135exits.
1136No local process is required to serve mounts;
11379P messages are generated by the kernel's mount driver and sent
1138directly over the network.
1139.PP
1140.CW Exportfs
1141must be multithreaded since the system calls
1142.CW open,
1143.CW read
1144and
1145.CW write
1146may block.
1147Plan 9 does not implement the
1148.CW select
1149system call but does allow processes to share file descriptors,
1150memory and other resources.
1151.CW Exportfs
1152and the configurable name space
1153provide a means of sharing resources between machines.
1154It is a building block for constructing complex name spaces
1155served from many machines.
1156.PP
1157The simplicity of the interfaces encourages naive users to exploit the potential
1158of a richly connected environment.
1159Using these tools it is easy to gateway between networks.
1160For example a terminal with only a Datakit connection can import from the server
1161.CW helix :
1162.P1
1163import -a helix /net
1164telnet ai.mit.edu
1165.P2
1166The
1167.CW import
1168command makes a Datakit connection to the machine
1169.CW helix
1170where
1171it starts an instance
1172.CW exportfs
1173to serve
1174.CW /net .
1175The
1176.CW import
1177command mounts the remote
1178.CW /net
1179directory after (the
1180.CW -a
1181option to
1182.CW import )
1183the existing contents
1184of the local
1185.CW /net
1186directory.
1187The directory contains the union of the local and remote contents of
1188.CW /net .
1189Local entries supersede remote ones of the same name so
1190networks on the local machine are chosen in preference
1191to those supplied remotely.
1192However, unique entries in the remote directory are now visible in the local
1193.CW /net
1194directory.
1195All the networks connected to
1196.CW helix ,
1197not just Datakit,
1198are now available in the terminal. The effect on the name space is shown by the following
1199example:
1200.P1
1201philw-gnot% ls /net
1202/net/cs
1203/net/dk
1204philw-gnot% import -a musca /net
1205philw-gnot% ls /net
1206/net/cs
1207/net/cs
1208/net/dk
1209/net/dk
1210/net/dns
1211/net/ether
1212/net/il
1213/net/tcp
1214/net/udp
1215.P2
1216.NH 2
1217Ftpfs
1218.PP
1219We decided to make our interface to FTP
1220a file system rather than the traditional command.
1221Our command,
1222.I ftpfs,
1223dials the FTP port of a remote system, prompts for login and password, sets image mode,
1224and mounts the remote file system onto
1225.CW /n/ftp .
1226Files and directories are cached to reduce traffic.
1227The cache is updated whenever a file is created.
1228Ftpfs works with TOPS-20, VMS, and various Unix flavors
1229as the remote system.
1230.NH
1231Cyclone Fiber Links
1232.PP
1233The file servers and CPU servers are connected by
1234high-bandwidth
1235point-to-point links.
1236A link consists of two VME cards connected by a pair of optical
1237fibers.
1238The VME cards use 33MHz Intel 960 processors and AMD's TAXI
1239fiber transmitter/receivers to drive the lines at 125 Mbit/sec.
1240Software in the VME card reduces latency by copying messages from system memory
1241to fiber without intermediate buffering.
1242.NH
1243Performance
1244.PP
1245We measured both latency and throughput
1246of reading and writing bytes between two processes
1247for a number of different paths.
1248Measurements were made on two- and four-CPU SGI Power Series processors.
1249The CPUs are 25 MHz MIPS 3000s.
1250The latency is measured as the round trip time
1251for a byte sent from one process to another and
1252back again.
1253Throughput is measured using 16k writes from
1254one process to another.
1255.DS C
1256.TS
1257box, tab(:);
1258c s s
1259c | c | c
1260l | n | n.
1261Table 1 - Performance
1262_
1263test:throughput:latency
1264:MBytes/sec:millisec
1265_
1266pipes:8.15:.255
1267_
1268IL/ether:1.02:1.42
1269_
1270URP/Datakit:0.22:1.75
1271_
1272Cyclone:3.2:0.375
1273.TE
1274.DE
1275.NH
1276Conclusion
1277.PP
1278The representation of all resources as file systems
1279coupled with an ASCII interface has proved more powerful
1280than we had originally imagined.
1281Resources can be used by any computer in our networks
1282independent of byte ordering or CPU type.
1283The connection server provides an elegant means
1284of decoupling tools from the networks they use.
1285Users successfully use Plan 9 without knowing the
1286topology of the system or the networks they use.
1287More information about 9P can be found in the Section 5 of the Plan 9 Programmer's
1288Manual, Volume I.
1289.NH
1290References
1291.LP
1292[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1293``Plan 9 from Bell Labs'',
1294.I
1295UKUUG Proc. of the Summer 1990 Conf. ,
1296London, England,
12971990.
1298.LP
1299[Needham] R. Needham, ``Names'', in
1300.I
1301Distributed systems,
1302.R
1303S. Mullender, ed.,
1304Addison Wesley, 1989.
1305.LP
1306[Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
1307.I
1308UKUUG Proc. of the Summer 1990 Conf. ,
1309.R
1310London, England, 1990.
1311.LP
1312[Met80] R. Metcalfe, D. Boggs, C. Crane, E. Taf and J. Hupp, ``The
1313Ethernet Local Network: Three reports'',
1314.I
1315CSL-80-2,
1316.R
1317XEROX Palo Alto Research Center, February 1980.
1318.LP
1319[Fra80] A. G. Fraser, ``Datakit - A Modular Network for Synchronous
1320and Asynchronous Traffic'',
1321.I
1322Proc. Int'l Conf. on Communication,
1323.R
1324Boston, June 1980.
1325.LP
1326[Pet89a] L. Peterson, ``RPC in the X-Kernel: Evaluating new Design Techniques'',
1327.I
1328Proc. Twelfth Symp. on Op. Sys. Princ.,
1329.R
1330Litchfield Park, AZ, December 1990.
1331.LP
1332[Rit84a] D. M. Ritchie, ``A Stream Input-Output System'',
1333.I
1334AT&T Bell Laboratories Technical Journal, 68(8),
1335.R
1336October 1984.
1337