xref: /inferno-os/doc/styx.ms (revision 46439007cf417cbd9ac8049bb4122c890097a0fa)
1.ds TM \u\s-2TM\s+2\d
2.nr dT 6
3.nr XT 6
4.TL
5The Styx Architecture for Distributed Systems
6.AU
7Rob Pike
8Dennis M. Ritchie
9.AI
10Computing Science Research Center
11Lucent Technologies, Bell Labs
12Murray Hill, New Jersey
13USA
14.FS
15.FA
16Originally appeared in
17.I "Bell Labs Technical Journal" ,
18Vol. 4,
19No. 2,
20April-June 1999,
21pp. 146-152.
22.br
23Copyright © 1999 Lucent Technologies Inc.  All rights reserved.
24.FE
25.AB
26A distributed system is constructed from a set of relatively
27independent components that form a unified, but geographically and
28functionally diverse entity.  Examples include networked operating
29systems, Internet services, the national telephone
30switching system, and in general
31all the technology using today's diverse digital
32networks.  Nevertheless, distributed systems remain difficult
33to design, build, and maintain, primarily because of the lack
34of a clean, perspicuous interconnection model for the
35components.
36.LP
37Our experience with two distributed operating systems,
38Plan 9 and Inferno, encourages us to propose such a model.
39These systems depend on, advocate, and generally push to the
40limit a fruitful idea: to present their
41resources as files in a hierarchical name space.
42The objects appearing as files may represent stored data, but may
43also be devices, dynamic information sources, interfaces to services,
44and control points.  The approach unifies and provides basic naming,
45structuring, and access control mechanisms for all system resources.
46A simple underlying network protocol, Styx, forms
47the core of the architecture by presenting a common
48language for communication within the system.
49.LP
50Even within non-distributed systems, the presentation of services
51as files advantageously extends a familiar scheme for naming, classifying,
52and connecting to system resources.
53More important, the approach provides a natural way to build
54distributed systems, by using well-known technology for attaching
55remote file systems.
56If resources are represented as files,
57and there are remote file systems, one has
58a distributed system: resources available in one place
59are usable from another.
60.AE
61.SH
62Introduction
63.LP
64The Styx protocol is a variant of a protocol called
65.I 9P
66that
67was developed for the Plan 9 operating system[9man].
68For simplicity, we will use the name
69Styx throughout this paper; the difference concerns only the initialization of
70a connection.
71.LP
72The original idea behind Styx was to encode file operations between
73client programs and the file system,
74to be translated into messages for transmission on a computer network.
75Using this technology,
76Plan 9 separates the file server\(ema central repository for
77permanent file storage\(emboth from the CPU server\(ema large
78shared-memory multiprocessor\(emand from the user terminals.
79This physical separation of function was central to the original
80design of the system;
81what was unexpected was how well the model could be used to
82solve a wide variety of problems not usually thought of as
83file system issues.
84.LP
85The breakthrough was to realize that by representing
86a computing resource as a form of file system,
87many of the difficulties of making that resource available
88across the network would disappear naturally, because
89Styx could export the resource transparently.
90For example,
91the Plan 9 window system,
92.CW 8½
93[Pike91],
94is implemented as a dynamic file server that publishes
95files with names like
96.CW /dev/mouse
97and
98.CW /dev/screen
99to provide access to the local hardware.
100The
101.CW /dev/mouse
102file, for instance,
103may be opened and read like a regular file, in the manner of UNIX\*(TM device
104files, but under
105.CW 8½
106it is multiplexed: each client program has a private
107.CW /dev/mouse
108file that returns mouse events only when the client's window
109is the active one on the display.
110This design provides a clean, simple mechanism for controlling
111access to the mouse.
112Its real strength, though, is that the representation of the window system's
113resources as files allows Styx to make those resources available across the
114network.
115For example, an interactive graphics program may be run on a CPU server
116simply by having
117.CW 8½
118serve the appropriate files to that machine.
119.LP
120Note that although the resources published by Styx behave like files\(emthey
121have file names, file permissions, and file access methods\(emthey do not
122need to exist as standard files on disk.
123The
124.CW /dev/mouse
125file is accessed by standard file I/O mechanisms but is nonetheless a
126transient object fabricated dynamically by a running program;
127it has no permanent existence.
128.LP
129By following this approach throughout the system, Plan 9 achieves
130a remarkable degree of transparency in the distribution of resources[PPTTW93].
131Besides interactive graphics, services such as debugging, maintenance,
132file backup, and even access to the underlying network hardware
133can be made available across the network using Styx, permitting
134the construction of distributed applications and services
135using nothing more sophisticated than file I/O.
136.SH
137The Styx protocol
138.LP
139Styx's place in the world is analogous to
140Sun NFS[RFC][NFS] or Microsoft CIFS[CIFS], although it is simpler and easier to implement
141[Welc94].
142Furthermore, NFS and CIFS are designed for sharing regular disk files; NFS in particular
143is intimately tied to the implementation and caching strategy
144of the underlying UNIX file system.
145Unlike Styx, NFS and CIFS are clumsier at exporting dynamic device-like
146files such as
147.CW /dev/mouse .
148.LP
149Styx provides a view of a hierarchical, tree-shaped
150file system name space[Nee89], together with access information about
151the files (permissions, sizes, dates) and the means to read and write
152the files.
153Its users (that is, the people who write application programs),
154don't see the protocol itself; instead they see files that they
155read and write, and that provide information or change information.
156.LP
157In use, a Styx
158.I client
159is an entity on one machine that establishes communication with
160another entity, the
161.I server ,
162on the same or another machine.
163The client mechanisms may be built into the operating system, as they
164are in Plan 9 or Inferno[INF1][INF2], or into application libraries;
165a server may be part of the operating system, or just as often
166may be application code on a separate server machine.  In any case, the
167client and server entities
168communicate by exchanging messages, and the effect is that the client
169sees a hierarchical file system that exists on the server.
170The Styx protocol is the specification of the messages that are exchanged.
171.LP
172At one level, Styx consists of messages of 13 types for
173.RS
174.IP \(bu
175Starting communication (attaching to a file system);
176.IP \(bu
177Navigating the file system (that is, specifying and
178gaining a handle for a named file);
179.IP \(bu
180Reading and writing a file; and
181.IP \(bu
182Performing file status inquiries and changes
183.RE
184.LP
185However, application writers simply code requests to open, read, or write
186files; a library or the operating system translates the requests
187into the necessary byte sequences transmitted over a communication
188channel.  The Styx protocol proper specifies the interpretation of these
189byte sequences.  It fits, approximately, at the OSI Session Layer level
190of the ISO standard classification.
191Its specification is independent of most details of machine architecture
192and it has been successfully used among machines of varying instruction
193sets and data layout.
194The protocol is summarized in Table 1.
195.KF
196.TS
197center box;
198l l
199--
200lfCW l.
201Name	Description
202attach	Authenticate user of connection; return FID
203clone	Duplicate FID
204walk	Advance FID one level of name hierarchy
205open	Check permissions for file I/O
206create	Create new file
207read	Read contents of file
208write	Write contents of file
209close	Discard FID
210remove	Remove file
211stat	Report file state: permissions, etc.
212wstat	Modify file state
213error	Return error condition for failed operation
214flush	Disregard outstanding I/O requests
215.TE
216.ce 100
217.ps -1
218Table 1. Summary of Styx messages.
219.ps
220.ce 0
221.KE
222.LP
223In use, an operation such as
224.P1
225open("/usr/rob/.profile", O_READ);
226.P2
227is translated by the underlying system into a sequence of Styx messages.
228After establishing the initial connection to the
229file server, an
230.CW attach
231message authenticates the user (the person or agent accessing the files) and
232returns an object called a
233.CW FID
234(file ID) that represents the root of the hierarchy on the server.
235When the
236.CW open()
237operation is executed, it proceeds as follows.
238.RS
239.IP \(bu
240A
241.CW clone
242message duplicates the root
243.CW FID ,
244returning a new
245.CW FID
246that can navigate the hierarchy without losing the connection to the root.
247.IP \(bu
248The new
249.CW FID
250is then moved to the file
251.CW /usr/rob/.profile
252by a sequence of
253.CW walk
254messages that step along, one path component at a time
255.CW usr , (
256.CW rob ,
257.CW .profile ).
258.IP \(bu
259Finally, an
260.CW open
261message checks that the user has permission to read the file,
262permitting subsequent
263.CW read
264and
265.CW write
266operations (messages) on the
267.CW FID .
268.IP \(bu
269Once I/O is completed, the
270.CW close
271message will release the
272.CW FID .
273.RE
274.LP
275At a lower level, implementations of Styx depend only on a reliable,
276byte-stream Transport communications layer. For example, it runs over either
277TCP/IP, the standard transmission control protocol
278and Internet protocol,
279or Internet link (IL), which is a sequenced, reliable datagram protocol
280using IP packets.
281It is worth emphasizing, though, that the model does not require the
282existence of a network to join the components; Styx runs fine
283over a Unix pipe or even using shared memory.
284The strength of the approach is not so much how it works over a network
285as that its behavior over a network is identical to its behavior locally.
286.SH
287Architectural approach
288.LP
289Styx, as a file system protocol, is merely a component in a
290more encompassing approach
291to system design: the presentation of resources as files.
292This approach will be discussed using a sequence of examples.
293.SH
294.I "Example: networking
295.LP
296As an example, access to a TCP/IP network in Inferno and Plan 9 systems
297appears as a piece of a file system, with (abbreviated) structure
298as follows[PrWi93]:
299.P1
300/net/
301	dns/
302	tcp/
303		clone
304		stats
305		0/
306			ctl
307			status
308			data
309			listen
310		1/
311			...
312		...
313	ether0/
314		0/
315			ctl
316			status
317			...
318		1/
319			...
320	...
321.P2
322This represents a file system structure in which one can name, read, and write `files' with
323names like
324.CW /net/dns ,
325.CW /net/tcp/clone ,
326.CW /net/tcp/0/ctl
327and so on;
328there are directories of files
329.CW /net/tcp
330and
331.CW /net/ether0 .
332On the machine that actually has the network interface, all of these
333things that look like files are constructed by the kernel drivers that maintain
334the TCP/IP stack; they are not real files on a disk.
335Operations on the `files' turn into operations sent to the device drivers.
336.LP
337Suppose an application wishes to establish a connection over TCP/IP to
338.CW www.bell-labs.com .
339The first task is to translate the domain name
340.CW www.bell-labs.com
341to a numerical internet address; this is a complicated process, generally
342involving communicating with local and remote Domain Name Servers.
343In the Styx model, this is done by opening the file
344.CW /dev/dns
345and writing the literal string
346.CW www.bell-labs.com
347on the file; then the same file is read.
348It will return the string
349.CW 204.178.16.5
350as a sequence of 12 characters.
351.LP
352Once the numerical Internet address is acquired, the connection must be established;
353this is done by opening
354.CW /net/tcp/clone
355and reading from it a string that specifies a directory like
356.CW /net/tcp/43 ,
357which represents a new, unique TCP/IP channel.
358To establish the connection,
359write a message like
360.CW "connect 204.178.16.5
361on the control file for that connection,
362.CW /net/tcp/43/ctl .
363Subsequently, communication with
364.CW www.bell-labs.com
365is done by reading and
366writing on the file
367.CW /net/tcp/43/data .
368.LP
369There are several things to note about this approach.
370.RS
371.IP \(bu
372All the interface points look like files, and are
373accessed by the same I/O mechanisms already available in
374programming languages like C, C++, or Java. However, they do not
375correspond to ordinary data files on disk, but instead are creations
376of a middleware code layer.
377.IP \(bu
378Communication across the interface, by convention, uses printable character strings where
379feasible instead of binary information.  This means that the syntax
380of communication does not depend on CPU architecture or language details.
381.IP \(bu
382Because the interface, as in this example with
383.CW /net
384as the interface with networking facilities, looks like a piece of a
385hierarchical file system, it can easily and nearly automatically
386be exported to a remote machine and used from afar.
387.RE
388.LP
389In particular, the Styx implementation encourages a natural way of providing
390controlled access to networks.
391Lucent, like many organizations, has an internal network not
392accessible to the international Internet, and has a few
393gateways between the inside and outside networks.
394Only the gateway machines are connected to both, and they implement
395the administrative controls for safety and security.
396The advantage of the Styx model is the ease with which
397the outside Internet can be used from inside.
398If the
399.CW /net
400file tree described above is provided on a gateway machine,
401it can be used as a remote file system from machines on the
402inside.  This is safe, because this connection is one-way:
403inside machines can see the external network interfaces,
404but outside machines cannot see the inside.
405.SH
406.I "Example: debugging
407.LP
408A similar approach, borrowed and generalized from the UNIX
409system [Kill], is useful for controlling and discovering the status
410of the running processes in the operating system.
411Here a directory
412.CW /proc
413contains a subdirectory for each process running on the
414system; the names of the subdirectories correspond to
415process IDs:
416.P1
417/proc/
418	1/
419		status
420		ctl
421		fd
422		text
423		mem
424		...
425	2/
426		status
427		ctl
428		...
429	...
430.P2
431The file names in the process directories refer to various aspects
432of the corresponding process:
433.CW status
434contains information about the state of the process;
435.CW ctl ,
436when written, performs operations like pausing, restarting,
437or killing the process;
438.CW fd
439names and describes the files open in the process;
440.CW text
441and
442.CW mem
443represent the program code and the data respectively.
444.LP
445Where possible, the information and control are again
446represented as text strings.  For example, one line
447from the
448.CW status
449file of a typical process might be
450.DS
451.CW "samterm dmr Read 0 20 2478910 0 0 ...
452.DE
453which shows the name of the program, the owner, its state, and several numbers
454representing CPU time in various categories.
455.LP
456Once again, the approach provides several payoffs.
457Because process information is represented in file form,
458remote debugging (debugging programs on another machine)
459is possible immediately by remote-mounting the
460.CW /proc
461tree on another machine.
462The machine-independent representation of information means
463that most operations work properly even if the remote machine
464uses a different CPU architecture from the one doing the
465debugging.
466Most of the programs that deal
467with status and control contain no machine-dependent parts
468and are completely portable.
469(A few are not, however: no attempt is made to render the
470memory data or instructions in machine-independent form.)
471.SH
472.I "Example: PathStar\*(TM Access Server
473.LP
474The data shelf of Lucent's PathStar Access Server[PATH] uses Styx to connect
475the line cards and other devices on the shelf to the control computer.
476In fact, Styx is the protocol for high-level communication on the backplane.
477.LP
478The file system hierarchy served by the control computer includes a structure
479like this:
480.P1
481/trip/
482	config
483	admin/
484		ospfctl
485		...
486	boot/
487		0/
488			ctl
489			eeprom
490			memory
491			msg
492			pack
493			alarm
494			...
495		1/
496			...
497/net/
498	...
499.P2
500The directories under
501.CW /net
502are similar to those in Plan 9 or Inferno; they form the interface to the
503external IP network.
504The
505.CW /trip
506hierarchy represents the control structure of the shelf.
507.LP
508The subdirectories under
509.CW /trip/boot
510each provide access to one of the line cards or other devices in the shelf.
511For example, to initialize a card one writes the text string
512.CW reset
513to the
514.CW ctl
515file of the card, while bootstrapping is done by copying the control
516software for the card into the
517.CW memory
518file and writing a
519.CW reset
520message to
521.CW ctl .
522Once the line card is running,
523the other files present an interface to the higher-level structure of the device:
524.CW pack
525is the port through which IP packets are transferred to and from the card,
526.CW alarm
527may be read to discover outstanding conditions on the card, and so on.
528.LP
529All this structure is exported from the shelf using Styx.
530The external element management software (EMS) controls and monitors the
531shelf using Styx operations.
532For example, the EMS may read
533.CW /trip/boot/7/alarm
534and discover a diagnostic condition.
535By reading and writing the other files under
536.CW /trip/boot/7/ ,
537the card may be taken off line, diagnosed, and perhaps reset or substituted,
538all from the system running the EMS, which may be elsewhere in the network.
539.LP
540Another example is the implementation of SNMP in the PathStar Access Server.
541The functionality of SNMP is usually distributed through the various components
542of a network, but here it is a straightforward adaption process,
543running anywhere in the network, that translates SNMP requests to Styx
544operations in the network element.
545Besides dramatically simplifying the implementation, the natural
546ability for aggregation permits
547a single process to provide SNMP access to an arbitrarily complex network subsystem.
548Yet the structure is secure: the file-oriented nature of the operations make it
549easy to establish standard authentication and security controls to guarantee
550that only trusted parties have access to the SNMP operations.
551.LP
552There are local benefits to this architecture, as well.
553Styx provides a single point in the design where control can be separated
554from the details of the underlying fabric, isolating both from changes in the
555other.  Components become more adaptable: software can be upgraded
556without worrying about hidden dependencies on the hardware,
557and new hardware may be installed without updating the control
558software above.
559.SH
560Security issues
561.LP
562Styx provides several security mechanisms for
563discouraging hostile or accidental actions that injure the integrity
564of a system.
565.LP
566The underlying file-communication protocol includes
567user and group identifiers that a server may check against
568other authentication.
569For example, a server may check, on a request to open a file,
570that the user ID associated with the request is permitted to
571perform the operation.
572This mechanism is familiar from general-purpose operating
573systems, and its use is well-known.
574It depends on passwords or stronger mechanisms for authenticating
575the identity of clients.
576.LP
577The Styx approach of providing remote resources
578as file systems over a network encourages genuinely secure access
579to the resources in a way transparent to applications, so that
580authentication transactions need not be provided as part of each.
581For example, in Inferno, the negotiation of an initial connection
582between client and server may include installation of any of
583several encrypting or message-digesting protocols that
584supervise the channel.
585All application use of the resources provided by the server
586is then protected against interference, and the server
587has strong assurance that its facilities are being used in
588an authorized way.
589This is relevant both for general-purpose file servers,
590and, in the telephony field, is especially useful for safe
591remote administration.
592.SH
593Summary
594.LP
595Presentation of resources as a piece of a possibly remote file system
596is an attractive way of creating distributed systems that treads a
597path between two extremes:
598.IP 1
599All communication with other parts of the system is by
600explicit messages sent between components.
601This communication differs in style from applications' use
602of local resources.
603.IP 2
604All communication is by means of
605closely shared resources: the CPU-addressable memory in
606various parts is made directly available across a big network;
607applications can read and write far-away objects exactly as
608they do those on the same motherboard as their own CPU.
609.LP
610Something like the first of these extremes is usually more evident
611in today's systems, although either the operating system or software
612layered upon it usually paper over some of the rough spots.
613The second remains more difficult to approach, because
614networks (especially big ones like the Internet) are not very
615reliable, and because
616the machines on them are diverse in processor architecture
617and in installed software.
618.LP
619The design plan described and advocated in this paper
620lies between the two extremes.
621It has these advantages:
622.IP \(bu
623.I "A simple, familiar programming model for reading and writing named files" .
624File systems have well-defined naming, access, and permissions structures.
625.IP \(bu
626.I "Platform and language independence" .
627Underlying access to resources is
628at the file level, which is provided nearly everywhere, instead
629of depending on facilities available only with particular languages
630or operating systems.
631C++ or Java classes, and C libraries can be constructed
632to access the facilities.
633.IP \(bu
634.I "A hierarchical naming and access control structure" .
635This encourages clean
636and well-structured design of resource naming and access.
637.IP \(bu
638.I "Easy testing and debugging" .
639By using well-specified, narrow interfaces
640at the file level, it is straightforward to observe the communication
641between distributed entities.
642.IP \(bu
643.I "Low cost" .
644Support software, at both client and server,
645can be written in a few thousand lines
646of code, and will occupy only small space in products.
647.LP
648This approach to building systems is successful in the general-purpose
649systems Plan 9 and Inferno;
650it has also been used to construct systems specialized for telephony, such
651as Mantra[MAN] and the PathStar Access Server.
652It supplies a coherent, extensible structure both to the internal communications
653within a single system and external communication between heterogeneous
654components of a large digital network.
655.LP
656.SH
657References
658.nr PS -1
659.nr VS -1
660.IP [NFS] 11
661R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and
662B. Lyon,
663``Design and Implementation of the Sun Network File System'',
664.I "Proc. Summer 1985 USENIX Conf." ,
665Portland, Oregon, June 1985,
666pp. 119-130.
667.IP [RFC] 11
668Internet RFC 1094.
669.IP [9man] 11
670.I "Plan 9 Programmer's Manual" ,
671Second Edition,
672Vol. 1 and 2,
673Bell Laboratories,
674Murray Hill, N.J.,
6751995.
676.IP [Kill84] 11
677T. J. Killian,
678``Processes as Files'',
679.I "Proc. Summer 1984 USENIX Conf." ,
680June 1984, Salt Lake City, Utah, June 1984, pp. 203-207.
681.IP [Pike91] 11
682R. Pike,
683``8½, the Plan 9 Window System'',
684.I "Proc. Summer 1991 USENIX Conf." ,
685Nashville TN, June 1991, pp. 257-265.
686.IP "[PPTTW93] " 11
687R. Pike, D.L. Presotto, K. Thompson, H. Trickey, and P. Winterbottom, ``The Use of Name Spaces in Plan 9'',
688.I "Op. Sys. Rev." ,
689Vol. 27, No. 2, April 1993, pp. 72-76.
690.IP [PrWi93] 11
691D. L. Presotto and P. Winterbottom,
692``The Organization of Networks in Plan 9'',
693.I "Proc. Winter 1993 USENIX Conf." ,
694San Diego, Calif., Jan. 1993, pp. 43-50.
695.IP [Nee89] 11
696R. Needham, ``Names'', in
697.I "Distributed systems" ,
698edited by S. Mullender,
699Addison-Wesley,
700Reading, Mass., 1989, pp. 89-101.
701.IP [CIFS]
702Paul Leach and Dan Perry, ``CIFS: A Common Internet File System'', Nov. 1996,
703.I "http://www.microsoft.com/mind/1196/cifs.htm" .
704.IP [INF1]
705.I "Inferno Programmer's Manual",
706Third Edition,
707Vol. 1 and 2, Vita Nuova Holdings Limited, York, England, 2000.
708.IP [INF2]
709S.M. Dorward, R. Pike, D. L. Presotto, D. M. Ritchie, H. Trickey,
710and P. Winterbottom, ``The Inferno Operating System'',
711.I "Bell Labs Technical Journal"
712Vol. 2,
713No. 1,
714Winter 1997.
715.IP [MAN]
716R. A. Lakshmi-Ratan,
717``The Lucent Technologies Softswitch\-Realizing the Promise of Convergence'',
718.I "Bell Labs Technical Journal" ,
719Vol. 4,
720No. 2,
721April-June 1999,
722pp. 174-196.
723.IP [PATH]
724J. M. Fossaceca, J. D. Sandoz, and P. Winterbottom,
725``The PathStar Access Server: Facilitating Carrier-Scale Packet Telephony'',
726.I "Bell Labs Technical Journal" ,
727Vol. 3,
728No. 4,
729October-December 1998,
730pp. 86-102.
731.IP [Welc94]
732B. Welch,
733``A Comparison of Three Distributed File System Architectures: Vnode, Sprite, and Plan 9'',
734.I "Computing Systems" ,
735Vol. 7, No. 2, pp. 175-199 (1994).
736.nr PS +1
737.nr VS +1
738