1.if \nv .rm CM
2.de UX
3.ie \\n(UX \s-1UNIX\s0\\$1
4.el \{\
5\s-1UNIX\s0\\$1\(dg
6.FS
7\(dg \s-1UNIX\s0 is a registered trademark of AT&T.
8.FE
9.nr UX 1
10.\}
11..
12.TL
13Toward a Compatible Filesystem Interface
14.AU
15Michael J. Karels
16Marshall Kirk McKusick
17.AI
18Computer Systems Research Group
19Computer Science Division
20Department of Electrical Engineering and Computer Science
21University of California, Berkeley
22Berkeley, California  94720
23.AB
24.LP
25As network or remote filesystems have been implemented for
26.UX ,
27several stylized interfaces between the filesystem implementation
28and the rest of the kernel have been developed.
29Notable among these are Sun Microsystems' Virtual Filesystem interface (VFS)
30using vnodes, Digital Equipment's Generic File System (GFS) architecture,
31and AT&T's File System Switch (FSS).
32Each design attempts to isolate filesystem-dependent details
33below a generic interface and to provide a framework within which
34new filesystems may be incorporated.
35However, each of these interfaces is different from
36and incompatible with the others.
37Each of them addresses somewhat different design goals.
38Each was based on a different starting version of
39.UX ,
40targetted a different set of filesystems with varying characteristics,
41and uses a different set of primitive operations provided by the filesystem.
42The current study compares the various filesystem interfaces.
43Criteria for comparison include generality, completeness, robustness,
44efficiency and esthetics.
45Several of the underlying design issues are examined in detail.
46As a result of this comparison, a proposal for a new filesystem interface
47is advanced that includes the best features of the existing implementations.
48The proposal adopts the calling convention for name lookup introduced
49in 4.3BSD, but is otherwise closely related to Sun's VFS.
50A prototype implementation is now being developed at Berkeley.
51This proposal and the rationale underlying its development
52have been presented to major software vendors
53as an early step toward convergence on a compatible filesystem interface.
54.AE
55.SH
56Introduction
57.PP
58As network communications and workstation environments
59became common elements in
60.UX
61systems, several vendors of
62.UX
63systems have designed and built network file systems
64that allow client process on one
65.UX
66machine to access files on a server machine.
67Examples include Sun's Network File System, NFS [Sandberg85],
68AT&T's recently-announced Remote File Sharing, RFS [Rifkin86],
69the LOCUS distributed filesystem [Walker85],
70and Masscomp's extended filesystem [Cole85].
71Other remote filesystems have been implemented in research or university groups
72for internal use, notably the network filesystem in the Eighth Edition
73.UX
74system [Weinberger84] and two different filesystems used at Carnegie-Mellon
75University [Satyanarayanan85].
76Numerous other remote file access methods have been devised for use
77within individual
78.UX
79processes,
80many of them by modifications to the C I/O library
81similar to those in the Newcastle Connection [Brownbridge82].
82.PP
83Multiple network filesystems may frequently
84be found in use within a single organization.
85These circumstances make it highly desirable to be able to transport filesystem
86implementations from one system to another.
87Such portability is considerably enhanced by the use of a stylized interface
88with carefully-defined entry points to separate the filesystem from the rest
89of the operating system.
90This interface should be similar to the interface between device drivers
91and the kernel.
92Although varying somewhat among the common versions of
93.UX ,
94the device driver interfaces are sufficiently similar that device drivers
95may be moved from one system to another without major problems.
96A clean, well-defined interface to the filesystem also allows a single
97system to support multiple local filesystem types.
98.PP
99For reasons such as these, several filesystem interfaces have been used
100when integrating new filesystems into the system.
101The best-known of these are Sun Microsystems' Virtual File System interface,
102VFS [Kleiman86], and AT&T's File System Switch, FSS.
103Another interface, known as the Generic File System, GFS,
104has been implemented for the ULTRIX\(dd
105.FS
106\(dd ULTRIX is a trademark of Digital Equipment Corp.
107.FE
108system by Digital [Rodriguez86].
109There are numerous differences among these designs.
110The differences may be understood from the varying philosophies
111and design goals of the groups involved, from the systems under which
112the implementations were done, and from the filesystems originally targetted
113by the designs.
114These differences are summarized in the following sections
115within the limitations of the published specifications.
116.SH
117Design goals
118.PP
119There are several design goals which, in varying degrees,
120have driven the various designs.
121Each attempts to divide the filesystem into a filesystem-type-independent
122layer and individual filesystem implementations.
123The division between these layers occurs at somewhat different places
124in these systems, reflecting different views of the diversity and types
125of the filesystems that may be accommodated.
126Compatibility with existing local filesystems has varying importance;
127at the user-process level, each attempts to be completely transparent
128except for a few filesystem-related system management programs.
129The AT&T interface also makes a major effort to retain familiar internal
130system interfaces, and even to retain object-file-level binary compatibility
131with operating system modules such as device drivers.
132Both Sun and DEC were willing to change internal data structures and interfaces
133so that other operating system modules might require recompilation
134or source-code modification.
135.PP
136AT&T's interface both allows and requires filesystems to support the full
137and exact semantics of their previous filesystem,
138including interruptions of system calls on slow operations.
139System calls that deal with remote files are encapsulated
140with their environment and sent to a server where execution continues.
141The system call may be aborted by either client or server, returning
142control to the client.
143Most system calls that descend into the file-system dependent layer
144of a filesystem other than the standard local filesystem do not return
145to the higher-level kernel calling routines.
146Instead, the filesystem-dependent code completes the requested
147operation and then executes a non-local goto (\fIlongjmp\fP) to exit the
148system call.
149These efforts to avoid modification of main-line kernel code
150indicate a far greater emphasis on internal compatibility than on modularity,
151clean design, or efficiency.
152.PP
153In contrast, the Sun VFS interface makes major modifications to the internal
154interfaces in the kernel, with a very clear separation
155of filesystem-independent and -dependent data structures and operations.
156The semantics of the filesystem are largely retained for local operations,
157although this is achieved at some expense where it does not fit the internal
158structuring well.
159The filesystem implementations are not required to support the same
160semantics as local
161.UX
162filesystems.
163Several historical features of
164.UX
165filesystem behavior are difficult to achieve using the VFS interface,
166including the atomicity of file and link creation and the use of open files
167whose names have been removed.
168.PP
169A major design objective of Sun's network filesystem,
170statelessness,
171permeates the VFS interface.
172No locking may be done in the filesystem-independent layer,
173and locking in the filesystem-dependent layer may occur only during
174a single call into that layer.
175.PP
176A final design goal of most implementors is performance.
177For remote filesystems,
178this goal tends to be in conflict with the goals of complete semantic
179consistency, compatibility and modularity.
180Sun has chosen performance over modularity in some areas,
181but has emphasized clean separation of the layers within the filesystem
182at the expense of performance.
183Although the performance of RFS is yet to be seen,
184AT&T seems to have considered compatibility far more important than modularity
185or performance.
186.SH
187Differences among filesystem interfaces
188.PP
189The existing filesystem interfaces may be characterized
190in several ways.
191Each system is centered around a few data structures or objects,
192along with a set of primitives for performing operations upon these objects.
193In the original
194.UX
195filesystem [Ritchie74],
196the basic object used by the filesystem is the inode, or index node.
197The inode contains all of the information about a file except its name:
198its type, identification, ownership, permissions, timestamps and location.
199Inodes are identified by the filesystem device number and the index within
200the filesystem.
201The major entry points to the filesystem are \fInamei\fP,
202which translates a filesystem pathname into the underlying inode,
203and \fIiget\fP, which locates an inode by number and installs it in the in-core
204inode table.
205\fINamei\fP performs name translation by iterative lookup
206of each component name in its directory to find its inumber,
207then using \fIiget\fP to return the actual inode.
208If the last component has been reached, this inode is returned;
209otherwise, the inode describes the next directory to be searched.
210The inode returned may be used in various ways by the caller;
211it may be examined, the file may be read or written,
212types and access may be checked, and fields may be modified.
213Modified inodes are automatically written back the the filesystem
214on disk when the last reference is released with \fIiput\fP.
215Although the details are considerably different,
216the same general scheme is used in the faster filesystem in 4.2BSD
217.UX
218[Mckusick85].
219.PP
220Both the AT&T interface and, to a lesser extent, the DEC interface
221attempt to preserve the inode-oriented interface.
222Each modify the inode to allow different varieties of the structure
223for different filesystem types by separating the filesystem-dependent
224parts of the inode into a separate structure or one arm of a union.
225Both interfaces allow operations
226equivalent to the \fInamei\fP and \fIiget\fP operations
227of the old filesystem to be performed in the filesystem-independent
228layer, with entry points to the individual filesystem implementations to support
229the type-specific parts of these operations.  Implicit in this interface
230is that files may be conveniently be named by and located using a single
231index within a filesystem.
232The GFS provides specific entry points to the filesystems
233to change most file properties rather than allowing arbitrary changes
234to be made to the generic part of the inode.
235.PP
236In contrast, the Sun VFS interface replaces the inode as the primary object
237with the vnode.
238The vnode contains no filesystem-dependent fields except the pointer
239to the set of operations implemented by the filesystem.
240Properties of a vnode that might be transient, such as the ownership,
241permissions, size and timestamps, are maintained by the lower layer.
242These properties may be presented in a generic format upon request;
243callers are expected not to hold this information for any length of time,
244as they may not be up-to-date later on.
245The vnode operations do not include a corollary for \fIiget\fP;
246the only external interface for obtaining vnodes for specific files
247is the name lookup operation.
248(Separate procedures are provided outside of this interface
249that obtain a ``file handle'' for a vnode which may be given
250to a client by a server, such that the vnode may be retrieved
251upon later presentation of the file handle.)
252.SH
253Name translation issues
254.PP
255Each of the systems described include a mechanism for performing
256pathname-to-internal-representation translation.
257The style of the name translation function is very different in all
258three systems.
259As described above, the AT&T and DEC systems retain the \fInamei\fP function.
260The two are quite different, however, as the ULTRIX interface uses
261the \fInamei\fP calling convention introduced in 4.3BSD.
262The parameters and context for the name lookup operation
263are collected in a \fInameidata\fP structure which is passed to \fInamei\fP
264for operation.
265Intent to create or delete the named file is declared in advance,
266so that the final directory scan in \fInamei\fP may retain information
267such as the offset in the directory at which the modification will be made.
268Filesystems that use such mechanisms to avoid redundant work
269must therefore lock the directory to be modified so that it may not
270be modified by another process before completion.
271In the System V filesystem, as in previous versions of
272.UX ,
273this information is stored in the per-process \fIuser\fP structure
274by \fInamei\fP for use by a low-level routine called after performing
275the actual creation or deletion of the file itself.
276In 4.3BSD and in the GFS interface, these side effects of \fInamei\fP
277are stored in the \fInameidata\fP structure given as argument to \fInamei\fP,
278which is also presented to the routine implementing file creation or deletion.
279.PP
280The ULTRIX \fInamei\fP routine is responsible for the generic
281parts of the name translation process, such as copying the name into
282an internal buffer, validating it, interpolating
283the contents of symbolic links, and indirecting at mount points.
284As in 4.3BSD, the name is copied into the buffer in a single call,
285according to the location of the name.
286After determining the type of the filesystem at the start of translation
287(the current directory or root directory), it calls the filesystem's
288\fInamei\fP entry with the same structure it received from its caller.
289The filesystem-specific routine translates the name, component by component,
290as long as no mount points are reached.
291It may return after any number of components have been processed.
292\fINamei\fP performs any processing at mount points, then calls
293the correct translation routine for the next filesystem.
294Network filesystems may pass the remaining pathname to a server for translation,
295or they may look up the pathname components one at a time.
296The former strategy would be more efficient,
297but the latter scheme allows mount points within a remote filesystem
298without server knowledge of all client mounts.
299.PP
300The AT&T \fInamei\fP interface is presumably the same as that in previous
301.UX
302systems, accepting the name of a routine to fetch pathname characters
303and an operation (one of: lookup, lookup for creation, or lookup for deletion).
304It translates, component by component, as before.
305If it detects that a mount point crosses to a remote filesystem,
306it passes the remainder of the pathname to the remote server.
307A pathname-oriented request other than open may be completed
308within the \fInamei\fP call,
309avoiding return to the (unmodified) system call handler
310that called \fInamei\fP.
311.PP
312In contrast to the first two systems, Sun's VFS interface has replaced
313\fInamei\fP with \fIlookupname\fP.
314This routine simply calls a new pathname-handling module to allocate
315a pathname buffer and copy in the pathname (copying a character per call),
316then calls \fIlookuppn\fP.
317\fILookuppn\fP performs the iteration over the directories leading
318to the destination file; it copies each pathname component to a local buffer,
319then calls the filesystem \fIlookup\fP entry to locate the vnode
320for that file in the current directory.
321Per-filesystem \fIlookup\fP routines may translate only one component
322per call.
323For creation and deletion of new files, the lookup operation is unmodified;
324the lookup of the final component only serves to check for the existence
325of the file.
326The subsequent creation or deletion call, if any, must repeat the final
327name translation and associated directory scan.
328For new file creation in particular, this is rather inefficient,
329as file creation requires two complete scans of the directory.
330.PP
331Several of the important performance improvements in 4.3BSD
332were related to the name translation process [McKusick85][Leffler84].
333The following changes were made:
334.IP 1. 4
335A system-wide cache of recent translations is maintained.
336The cache is separate from the inode cache, so that multiple names
337for a file may be present in the cache.
338The cache does not hold ``hard'' references to the inodes,
339so that the normal reference pattern is not disturbed.
340.IP 2.
341A per-process cache is kept of the directory and offset
342at which the last successful name lookup was done.
343This allows sequential lookups of all the entries in a directory to be done
344in linear time.
345.IP 3.
346The entire pathname is copied into a kernel buffer in a single operation,
347rather than using two subroutine calls per character.
348.IP 4.
349A pool of pathname buffers are held by \fInamei\fP, avoiding allocation
350overhead.
351.LP
352All of these performance improvements from 4.3BSD are well worth using
353within a more generalized filesystem framework.
354The generalization of the structure may otherwise make an already-expensive
355function even more costly.
356Most of these improvements are present in the GFS system, as it derives
357from the beta-test version of 4.3BSD.
358The Sun system uses a name-translation cache generally like that in 4.3BSD.
359The name cache is a filesystem-independent facility provided for the use
360of the filesystem-specific lookup routines.
361The Sun cache, like that first used at Berkeley but unlike that in 4.3,
362holds a ``hard'' reference to the vnode (increments the reference count).
363The ``soft'' reference scheme in 4.3BSD cannot be used with the current
364NFS implementation, as NFS allocates vnodes dynamically and frees them
365when the reference count returns to zero rather than caching them.
366As a result, fewer names may be held in the cache
367than (local filesystem) vnodes, and the cache distorts the normal reference
368patterns otherwise seen by the LRU cache.
369As the name cache references overflow the local filesystem inode table,
370the name cache must be purged to make room in the inode table.
371Also, to determine whether a vnode is in use (for example,
372before mounting upon it), the cache must be flushed to free any
373cache reference.
374These problems should be corrected
375by the use of the soft cache reference scheme.
376.PP
377A final observation on the efficiency of name translation in the current
378Sun VFS architecture is that the number of subroutine calls used
379by a multi-component name lookup is dramatically larger
380than in the other systems.
381The name lookup scheme in GFS suffers from this problem much less,
382at no expense in violation of layering.
383.PP
384A final problem to be considered is synchronization and consistency.
385As the filesystem operations are more stylized and broken into separate
386entry points for parts of operations, it is more difficult to guarantee
387consistency throughout an operation and/or to synchronize with other
388processes using the same filesystem objects.
389The Sun interface suffers most severely from this,
390as it forbids the filesystems from locking objects across calls
391to the filesystem.
392It is possible that a file may be created between the time that a lookup
393is performed and a subsequent creation is requested.
394Perhaps more strangely, after a lookup fails to find the target
395of a creation attempt, the actual creation might find that the target
396now exists and is a symbolic link.
397The call will either fail unexpectedly, as the target is of the wrong type,
398or the generic creation routine will have to note the error
399and restart the operation from the lookup.
400This problem will always exist in a stateless filesystem,
401but the VFS interface forces all filesystems to share the problem.
402This restriction against locking between calls also
403forces duplication of work during file creation and deletion.
404This is considered unacceptable.
405.SH
406Support facilities and other interactions
407.PP
408Several support facilities are used by the current
409.UX
410filesystem and require generalization for use by other filesystem types.
411For filesystem implementations to be portable,
412it is desirable that these modified support facilities
413should also have a uniform interface and
414behave in a consistent manner in target systems.
415A prominent example is the filesystem buffer cache.
416The buffer cache in a standard (System V or 4.3BSD)
417.UX
418system contains physical disk blocks with no reference to the files containing
419them.
420This works well for the local filesystem, but has obvious problems
421for remote filesystems.
422Sun has modified the buffer cache routines to describe buffers by vnode
423rather than by device.
424For remote files, the vnode used is that of the file, and the block
425numbers are virtual data blocks.
426For local filesystems, a vnode for the block device is used for cache reference,
427and the block numbers are filesystem physical blocks.
428Use of per-file cache description does not easily accommodate
429caching of indirect blocks, inode blocks, superblocks or cylinder group blocks.
430However, the vnode describing the block device for the cache
431is one created internally,
432rather than the vnode for the device looked up when mounting,
433and it is located by searching a private list of vnodes
434rather than by holding it in the mount structure.
435Although the Sun modification makes it possible to use the buffer
436cache for data blocks of remote files, a better generalization
437of the buffer cache is needed.
438.PP
439The RFS filesystem used by AT&T does not currently cache data blocks
440on client systems, thus the buffer cache is probably unmodified.
441The form of the buffer cache in ULTRIX is unknown to us.
442.PP
443Another subsystem that has a large interaction with the filesystem
444is the virtual memory system.
445The virtual memory system must read data from the filesystem
446to satisfy fill-on-demand page faults.
447For efficiency, this read call is arranged to place the data directly
448into the physical pages assigned to the process (a ``raw'' read) to avoid
449copying the data.
450Although the read operation normally bypasses the filesystem buffer cache,
451consistency must be maintained by checking the buffer cache and copying
452or flushing modified data not yet stored on disk.
453The 4.2BSD virtual memory system, like that of Sun and ULTRIX,
454maintains its own cache of reusable text pages.
455This creates additional complications.
456As the virtual memory systems are redesigned, these problems should be
457resolved by reading through the buffer cache, then mapping the cached
458data into the user address space.
459If the buffer cache or the process pages are changed while the other reference
460remains, the data would have to be copied (``copy-on-write'').
461.PP
462In the meantime, the current virtual memory systems must be used
463with the new filesystem framework.
464Both the Sun and AT&T filesystem interfaces
465provide entry points to the filesystem for optimization of the virtual
466memory system by performing logical-to-physical block number translation
467when setting up a fill-on-demand image for a process.
468The VFS provides a vnode operation analogous to the \fIbmap\fP function of the
469.UX
470filesystem.
471Given a vnode and logical block number, it returns a vnode and block number
472which may be read to obtain the data.
473If the filesystem is local, it returns the private vnode for the block device
474and the physical block number.
475As the \fIbmap\fP operations are all performed at one time, during process
476startup, any indirect blocks for the file will remain in the cache
477after they are once read.
478In addition, the interface provides a \fIstrategy\fP entry that may be used
479for ``raw'' reads from a filesystem device,
480used to read data blocks into an address space without copying.
481This entry uses a buffer header (\fIbuf\fP structure)
482to describe the I/O operation
483instead of a \fIuio\fP structure.
484The buffer-style interface is the same as that used by disk drivers internally.
485This difference allows the current \fIuio\fP primitives to be avoided,
486as they copy all data to/from the current user process address space.
487Instead, for local filesystems these operations could be done internally
488with the standard raw disk read routines,
489which use a \fIuio\fP interface.
490When loading from a remote filesystems,
491the data will be received in a network buffer.
492If network buffers are suitably aligned,
493the data may be mapped into the process address space by a page swap
494without copying.
495In either case, it should be possible to use the standard filesystem
496read entry from the virtual memory system.
497.PP
498Other issues that must be considered in devising a portable
499filesystem implementation include kernel memory allocation,
500the implicit use of user-structure global context,
501which may create problems with reentrancy,
502the style of the system call interface,
503and the conventions for synchronization
504(sleep/wakeup, handling of interrupted system calls, semaphores).
505.SH
506The Berkeley Proposal
507.PP
508The Sun VFS interface has been most widely used of the three described here.
509It is also the most general of the three, in that filesystem-specific
510data and operations are best separated from the generic layer.
511Although it has several disadvantages which were described above,
512most of them may be corrected with minor changes to the interface
513(and, in a few areas, philosophical changes).
514The DEC GFS has other advantages, in particular the use of the 4.3BSD
515\fInamei\fP interface and optimizations.
516It allows single or multiple components of a pathname
517to be translated in a single call to the specific filesystem
518and thus accommodates filesystems with either preference.
519The FSS is least well understood, as there is little public information
520about the interface.
521However, the design goals are the least consistent with those of the Berkeley
522research groups.
523Accordingly, a new filesystem interface has been devised to avoid
524some of the problems in the other systems.
525The proposed interface derives directly from Sun's VFS,
526but, like GFS, uses a 4.3BSD-style name lookup interface.
527Additional context information has been moved from the \fIuser\fP structure
528to the \fInameidata\fP structure so that name translation may be independent
529of the global context of a user process.
530This is especially desired in any system where kernel-mode servers
531operate as light-weight or interrupt-level processes,
532or where a server may store or cache context for several clients.
533This calling interface has the additional advantage
534that the call parameters need not all be pushed onto the stack for each call
535through the filesystem interface,
536and they may be accessed using short offsets from a base pointer
537(unlike global variables in the \fIuser\fP structure).
538.PP
539The proposed filesystem interface is described very tersely here.
540For the most part, data structures and procedures are analogous
541to those used by VFS, and only the changes will be be treated here.
542See [Kleiman86] for complete descriptions of the vfs and vnode operations
543in Sun's interface.
544.PP
545The central data structure for name translation is the \fInameidata\fP
546structure.
547The same structure is used to pass parameters to \fInamei\fP,
548to pass these same parameters to filesystem-specific lookup routines,
549to communicate completion status from the lookup routines back to \fInamei\fP,
550and to return completion status to the calling routine.
551For creation or deletion requests, the parameters to the filesystem operation
552to complete the request are also passed in this same structure.
553The form of the \fInameidata\fP structure is:
554.br
555.ne 2i
556.ID
557.nf
558.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
559/*
560 * Encapsulation of namei parameters.
561 * One of these is located in the u. area to
562 * minimize space allocated on the kernel stack
563 * and to retain per-process context.
564 */
565struct nameidata {
566		/* arguments to namei and related context: */
567	caddr_t	ni_dirp;		/* pathname pointer */
568	enum	uio_seg ni_seg;		/* location of pathname */
569	short	ni_nameiop;		/* see below */
570	struct	vnode *ni_cdir;		/* current directory */
571	struct	vnode *ni_rdir;		/* root directory, if not normal root */
572	struct	ucred *ni_cred;		/* credentials */
573
574		/* shared between namei, lookup routines and commit routines: */
575	caddr_t	ni_pnbuf;		/* pathname buffer */
576	char	*ni_ptr;		/* current location in pathname */
577	int	ni_pathlen;		/* remaining chars in path */
578	short	ni_more;		/* more left to translate in pathname */
579	short	ni_loopcnt;		/* count of symlinks encountered */
580
581		/* results: */
582	struct	vnode *ni_vp;		/* vnode of result */
583	struct	vnode *ni_dvp;		/* vnode of intermediate directory */
584
585		/* side effects: */
586/* BEGIN UFS SPECIFIC */
587	off_t	ni_endoff;		/* end of useful stuff in directory */
588	struct diroffcache {		/* last successful directory search */
589		struct	vnode *nc_prevdir;	/* terminal directory */
590		long	nc_id;			/* directory's unique id */
591		off_t	nc_prevoffset;		/* where last entry found */
592	} ni_nc;
593	struct ndirinfo {		/* saved info for new dir entry */
594		struct	iovec nd_iovec;		/* pointed to by ni_iov */
595		struct	uio nd_uio;		/* directory I/O parameters */
596		struct	direct nd_dent;		/* current directory entry */
597	} ni_nd;
598/* END UFS SPECIFIC */
599};
600.DE
601.DS
602.ta \w'#define\0\0'u +\w'WANTPARENT\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
603/*
604 * namei operations and modifiers
605 */
606#define	LOOKUP	0	/* perform name lookup only */
607#define	CREATE	1	/* setup for file creation */
608#define	DELETE	2	/* setup for file deletion */
609#define	WANTPARENT	0x10	/* return parent directory vnode also */
610#define	NOCACHE	0x20	/* name must not be left in cache */
611#define	FOLLOW	0x40	/* follow symbolic links */
612#define	NOFOLLOW	0x0	/* don't follow symbolic links (pseudo) */
613.DE
614As in current systems other than Sun's VFS, \fInamei\fP is called
615with an operation request, one of LOOKUP, CREATE or DELETE.
616For a LOOKUP, the operation is exactly like the lookup in VFS.
617CREATE and DELETE allow the filesystem to ensure consistency
618by locking the parent inode (private to the filesystem),
619and (for the local filesystem) to avoid duplicate directory scans
620by storing the new directory entry and its offset in the directory
621in the \fIndirinfo\fP structure.
622This is intended to be opaque to the filesystem-independent levels.
623Not all lookups for creation or deletion are actually followed
624by the intended operation; permission may be denied, the filesystem
625may be read-only, etc.
626Therefore, an entry point to the filesystem is provided
627to abort a creation or deletion operation
628and allow release of any locked internal data.
629After a \fInamei\fP with a CREATE or DELETE flag, the pathname pointer
630is set to point to the last filename component.
631Filesystems that choose to implement creation or deletion entirely
632within the subsequent call to a create or delete entry
633are thus free to do so.
634.PP
635The \fInameidata\fP is used to store context used during name translation.
636The current and root directories for the translation are stored here.
637For the local filesystem, the per-process directory offset cache
638is also kept here.
639A file server could leave the directory offset cache empty,
640could use a single cache for all clients,
641or could hold caches for several recent clients.
642.PP
643Several other data structures are used in the filesystem operations.
644One is the \fIucred\fP structure which describes a client's credentials
645to the filesystem.
646This is modified slightly from the Sun structure;
647the ``accounting'' group ID has been merged into the groups array.
648The actual number of groups in the array is given explicitly
649to avoid use of a reserved group ID as a terminator.
650Also, typedefs introduced in 4.3BSD for user and group ID's have been used.
651The \fIucred\fP structure is thus:
652.DS
653.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
654/*
655 * Credentials.
656 */
657struct ucred {
658	u_short	cr_ref;			/* reference count */
659	uid_t	cr_uid;			/* effective user id */
660	short	cr_ngroups;		/* number of groups */
661	gid_t	cr_groups[NGROUPS];	/* groups */
662	/*
663	 * The following either should not be here,
664	 * or should be treated as opaque.
665	 */
666	uid_t   cr_ruid;		/* real user id */
667	gid_t   cr_svgid;		/* saved set-group id */
668};
669.DE
670.PP
671A final structure used by the filesystem interface is the \fIuio\fP
672structure mentioned earlier.
673This structure describes the source or destination of an I/O
674operation, with provision for scatter/gather I/O.
675It is used in the read and write entries to the filesystem.
676The \fIuio\fP structure presented here is modified from the one
677used in 4.2BSD to specify the location of each vector of the operation
678(user or kernel space)
679and to allow an alternate function to be used to implement the data movement.
680The alternate function might perform page remapping rather than a copy,
681for example.
682.DS
683.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
684/*
685 * Description of an I/O operation which potentially
686 * involves scatter-gather, with individual sections
687 * described by iovec, below.  uio_resid is initially
688 * set to the total size of the operation, and is
689 * decremented as the operation proceeds.  uio_offset
690 * is incremented by the amount of each operation.
691 * uio_iov is incremented and uio_iovcnt is decremented
692 * after each vector is processed.
693 */
694struct uio {
695	struct	iovec *uio_iov;
696	int	uio_iovcnt;
697	off_t	uio_offset;
698	int	uio_resid;
699	enum	uio_rw uio_rw;
700};
701
702enum	uio_rw { UIO_READ, UIO_WRITE };
703.DE
704.DS
705.ta .5i +\w'caddr_t\0\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
706/*
707 * Description of a contiguous section of an I/O operation.
708 * If iov_op is non-null, it is called to implement the copy
709 * operation, possibly by remapping, with the call
710 *	(*iov_op)(from, to, count);
711 * where from and to are caddr_t and count is int.
712 * Otherwise, the copy is done in the normal way,
713 * treating base as a user or kernel virtual address
714 * according to iov_segflg.
715 */
716struct iovec {
717	caddr_t	iov_base;
718	int	iov_len;
719	enum	uio_seg iov_segflg;
720	int	(*iov_op)();
721};
722.DE
723.DS
724.ta .5i +\w'UIO_USERISPACE\0\0\0\0\0'u
725/*
726 * Segment flag values.
727 */
728enum	uio_seg {
729	UIO_USERSPACE,		/* from user data space */
730	UIO_SYSSPACE,		/* from system space */
731	UIO_USERISPACE		/* from user I space */
732};
733.DE
734.SH
735File and filesystem operations
736.PP
737With the introduction of the data structures used by the filesystem
738operations, the complete list of filesystem entry points may be listed.
739As noted, they derive mostly from the Sun VFS interface.
740Lines marked with \fB+\fP are additions to the Sun definitions;
741lines marked with \fB!\fP are modified from VFS.
742.PP
743The structure describing the externally-visible features of a mounted
744filesystem, \fIvfs\fP, is:
745.DS
746.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
747/*
748 * Structure per mounted file system.
749 * Each mounted file system has an array of
750 * operations and an instance record.
751 * The file systems are put on a doubly linked list.
752 */
753struct vfs {
754	struct vfs	*vfs_next;		/* next vfs in vfs list */
755\fB+\fP	struct vfs	*vfs_prev;		/* prev vfs in vfs list */
756	struct vfsops	*vfs_op;		/* operations on vfs */
757	struct vnode	*vfs_vnodecovered;	/* vnode we mounted on */
758	int	vfs_flag;		/* flags */
759\fB!\fP	int	vfs_bsize;		/* basic block size */
760\fB+\fP	int	vfs_tsize;		/* optimal transfer size */
761\fB!\fP	uid_t	vfs_exroot;		/* exported fs uid 0 mapping */
762	short	vfs_exflags;		/* exported fs flags */
763	caddr_t	vfs_data;		/* private data */
764};
765.DE
766.DS
767.ta \w'\fB+\fP 'u +\w'#define\0\0'u +\w'VFS_EXPORTED\0\0'u +\w'0x40\0\0\0\0\0'u
768	/*
769	 * vfs flags.
770	 * VFS_MLOCK lock the vfs so that name lookup cannot proceed past the vfs.
771	 * This keeps the subtree stable during mounts and unmounts.
772	 */
773	#define	VFS_RDONLY	0x01		/* read only vfs */
774\fB+\fP	#define	VFS_NOEXEC	0x02		/* can't exec from filesystem */
775	#define	VFS_MLOCK	0x04		/* lock vfs so that subtree is stable */
776	#define	VFS_MWAIT	0x08		/* someone is waiting for lock */
777	#define	VFS_NOSUID	0x10		/* don't honor setuid bits on vfs */
778	#define	VFS_EXPORTED	0x20		/* file system is exported (NFS) */
779
780	/*
781	 * exported vfs flags.
782	 */
783	#define	EX_RDONLY	0x01		/* exported read only */
784.DE
785.LP
786The operations supported by the filesystem-specific layer
787on an individual filesystem are:
788.DS
789.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
790/*
791 * Operations supported on virtual file system.
792 */
793struct vfsops {
794\fB!\fP	int	(*vfs_mount)(		/* vfs, path, data, datalen */ );
795\fB!\fP	int	(*vfs_unmount)(		/* vfs, forcibly */ );
796\fB+\fP	int	(*vfs_mountroot)();
797	int	(*vfs_root)(		/* vfs, vpp */ );
798	int	(*vfs_statfs)(		/* vfs, sbp */ );
799\fB!\fP	int	(*vfs_sync)(		/* vfs, waitfor */ );
800\fB+\fP	struct vnode *	(*vfs_fhtovp)(	/* vfs, fh */ );
801};
802.DE
803.LP
804The \fIvfs_statfs\fP entry returns a structure of the form:
805.DS
806.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
807/*
808 * file system statistics
809 */
810struct statfs {
811\fB!\fP	short	f_type;			/* type of filesystem */
812\fB+\fP	short	f_flags;		/* copy of vfs (mount) flags */
813	long	f_bsize;		/* fundamental file system block size */
814\fB+\fP	long	f_tsize;		/* optimal transfer block size */
815	long	f_blocks;		/* total data blocks in file system */
816	long	f_bfree;		/* free blocks in fs */
817	long	f_bavail;		/* free blocks avail to non-superuser */
818	long	f_files;		/* total file nodes in file system */
819	long	f_ffree;		/* free file nodes in fs */
820	fsid_t	f_fsid;			/* file system id */
821	long	f_spare[7];		/* spare for later */
822};
823
824typedef long fsid_t[2];			/* file system id type */
825.DE
826.LP
827Finally, the external form of a filesystem object, the \fIvnode\fP, is:
828.DS
829.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
830/*
831 * vnode types. VNON means no type.
832 */
833enum vtype 	{ VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK };
834
835struct vnode {
836	u_short	v_flag;			/* vnode flags (see below) */
837	u_short	v_count;		/* reference count */
838	u_short	v_shlockc;		/* count of shared locks */
839	u_short	v_exlockc;		/* count of exclusive locks */
840	struct vfs	*v_vfsmountedhere;	/* ptr to vfs mounted here */
841	struct vfs	*v_vfsp;		/* ptr to vfs we are in */
842	struct vnodeops	*v_op;			/* vnode operations */
843\fB+\fP	struct text	*v_text;		/* text/mapped region */
844	enum vtype	v_type;			/* vnode type */
845	caddr_t	v_data;			/* private data for fs */
846};
847.DE
848.DS
849.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
850/*
851 * vnode flags.
852 */
853#define	VROOT	0x01	/* root of its file system */
854#define	VTEXT	0x02	/* vnode is a pure text prototype */
855#define	VEXLOCK	0x10	/* exclusive lock */
856#define	VSHLOCK	0x20	/* shared lock */
857#define	VLWAIT	0x40	/* proc is waiting on shared or excl. lock */
858.DE
859.LP
860The operations supported by the filesystems on individual \fIvnode\fP\^s
861are:
862.DS
863.ta .5i +\w'int\0\0\0\0\0'u  +\w'(*vn_getattr)(\0\0\0\0\0'u
864/*
865 * Operations on vnodes.
866 */
867struct vnodeops {
868\fB!\fP	int	(*vn_lookup)(		/* ndp */ );
869\fB!\fP	int	(*vn_create)(		/* ndp, vap, fflags */ );
870\fB!\fP	int	(*vn_open)(		/* vp, fflags, cred */ );
871	int	(*vn_close)(		/* vp, fflags, cred */ );
872	int	(*vn_access)(		/* vp, mode, cred */ );
873	int	(*vn_getattr)(		/* vp, vap, cred */ );
874	int	(*vn_setattr)(		/* vp, vap, cred */ );
875
876\fB+\fP	int	(*vn_read)(		/* vp, uiop, ioflag, cred */ );
877\fB+\fP	int	(*vn_write)(		/* vp, uiop, ioflag, cred */ );
878\fB!\fP	int	(*vn_ioctl)(		/* vp, com, data, fflag, cred */ );
879	int	(*vn_select)(		/* vp, which, cred */ );
880\fB+\fP	int	(*vn_mmap)(		/* vp, ..., cred */ );
881	int	(*vn_fsync)(		/* vp, cred */ );
882\fB+\fP	off_t	(*vn_seek)(		/* vp, old off, new off, whence */ );
883
884\fB!\fP	int	(*vn_remove)(		/* ndp */ );
885\fB!\fP	int	(*vn_link)(		/* vp, ndp */ );
886\fB!\fP	int	(*vn_rename)(		/* src ndp, target ndp */ );
887\fB!\fP	int	(*vn_mkdir)(		/* ndp, vap */ );
888\fB!\fP	int	(*vn_rmdir)(		/* ndp */ );
889\fB!\fP	int	(*vn_symlink)(		/* ndp, vap, nm */ );
890	int	(*vn_readdir)(		/* vp, uiop, ioflag, cred */ );
891	int	(*vn_readlink)(		/* vp, uiop, ioflag, cred */ );
892
893\fB+\fP	int	(*vn_abortop)(		/* ndp */ );
894\fB!\fP	int	(*vn_inactive)(		/* vp */ );
895\fB+\fP	int	(*vn_vptofh)(		/* vp, fhp */ );
896};
897.DE
898.DS
899.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0'u
900/*
901 * flags for ioflag
902 */
903#define	IO_UNIT	0x01		/* do io as atomic unit for VOP_RDWR */
904#define	IO_APPEND	0x02		/* append write for VOP_RDWR */
905#define	IO_SYNC	0x04		/* sync io for VOP_RDWR */
906.DE
907.LP
908The argument types listed in the comments following each operation are:
909.sp
910.IP ndp 10
911A pointer to a \fInameidata\fP structure.
912.IP vap
913A pointer to a \fIvattr\fP structure (vnode attributes; see below).
914.IP fflags
915File open flags, possibly including O_APPEND, O_CREAT, O_TRUNC and O_EXCL.
916.IP vp
917A pointer to a \fIvnode\fP previously obtained with \fIvn_lookup\fP.
918.IP cred
919A pointer to a \fIucred\fP credentials structure.
920.IP mode
921File access modes; one or more of F_OK, X_OK, W_OK, and R_OK.
922.IP uiop
923A pointer to a \fIuio\fP structure.
924.IP ioflag
925Any of the IO flags defined above.
926.IP com
927An \fIioctl\fP command, with type \fIunsigned long\fP.
928.IP data
929A pointer to a character buffer used to pass data to or from an \fIioctl\fP.
930.IP which
931One of FREAD, FWRITE or 0 (select for exceptional conditions).
932.IP off
933A file offset of type \fIoff_t\P.
934.IP whence
935One of L_SET, L_INCR, or L_XTND.
936.IP fhp
937A pointer to a file handle buffer.
938.sp
939.PP
940Several changes have been made to Sun's set of vnode operations.
941Most obviously, the \fIvn_lookup\fP receives a \fInameidata\fP structure
942containing its arguments and context as described.
943The same structure is also passed to one of the creation or deletion
944entries if the lookup operation is for CREATE or DELETE to complete
945an operation, or to the \fIvn_abortop\fP entry if no operation
946is undertaken.
947For filesystems that perform no locking between lookup for creation
948or deletion and the call to implement that action,
949the final pathname component may be left untranslated by the lookup
950routine.
951In any case, the pathname pointer points at the final name component,
952and the \fInameidata\fP contains a reference to the vnode of the parent
953directory.
954The interface is thus flexible enough to accommodate filesystems
955that are fully stateful or fully stateless, while avoiding redundant
956operations whenever possible.
957One operation remains problematical, the \fIvn_rename\fP call.
958It is tempting to look up the source of the rename for deletion
959and the target for creation.
960However, filesystems that lock directories during such lookups must avoid
961deadlock if the two paths cross.
962For that reason, the source is translated for LOOKUP only,
963with the WANTPARENT flag set;
964the target is then translated with an operation of CREATE.
965.PP
966In addition to the changes concerned with the \fInameidata\fP interface,
967several other changes were made in the vnode operations.
968The \fIvn_rdrw\fP entry was split into \fIvn_read\fP and \fIvn_write\fP;
969frequently, the read/write entry amounts to a routine that checks
970the direction flag, then calls either a read routine or a write routine.
971The two entries may be identical for any given filesystem;
972the direction flag is contained in the \fIuio\fP given as an argument.
973.PP
974All of the read and write operations use a \fIuio\fP to describe
975the file offset and buffer locations.
976All of these fields must be updated before return.
977In particular, the \fIvn_readdir\fP entry uses this
978to return a new file offset token for its current location.
979.PP
980Several new operations have been added.
981The first, \fIvn_seek\fP, is a concession to record-oriented files
982such as directories.
983It allows the filesystem to verify that a seek leaves a file at a sensible
984offset, or to return a new offset token relative to an earlier one.
985For most filesystems and files, this operation amounts to performing
986simple arithmetic.
987Another new entry point is \fIvn_mmap\fP, for use in mapping device memory
988into a user process address space.
989Its semantics are not yet decided.
990The final addition is the \fIvn_vptofh\fP entry.
991It is provided for the use of file servers, which need to obtain an opaque
992file handle to represent the current vnode for transmission to clients.
993This file handle may later be used to relocate the vnode using the vfs
994entry \fIvfs_fhtovp\fP.
995.PP
996The attributes of a vnode are not stored in the vnode,
997as they might change with time and may need to be read from a remote
998source.
999Attributes have the form:
1000.DS
1001.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
1002/*
1003 * Vnode attributes.  A field value of -1
1004 * represents a field whose value is unavailable
1005 * (getattr) or which is not to be changed (setattr).
1006 */
1007struct vattr {
1008	enum vtype	va_type;	/* vnode type (for create) */
1009	u_short	va_mode;	/* files access mode and type */
1010\fB!\fP	uid_t	va_uid;		/* owner user id */
1011\fB!\fP	gid_t	va_gid;		/* owner group id */
1012	long	va_fsid;	/* file system id (dev for now) */
1013\fB!\fP	long	va_fileid;	/* file id */
1014	short	va_nlink;	/* number of references to file */
1015	u_long	va_size;	/* file size in bytes (quad?) */
1016\fB+\fP	u_long	va_size1;	/* reserved if not quad */
1017	long	va_blocksize;	/* blocksize preferred for i/o */
1018	struct timeval	va_atime;	/* time of last access */
1019	struct timeval	va_mtime;	/* time of last modification */
1020	struct timeval	va_ctime;	/* time file changed */
1021	dev_t	va_rdev;	/* device the file represents */
1022	u_long	va_blocks;	/* bytes of disk space held by file */
1023\fB+\fP	u_long	va_blocks1;	/* reserved if va_blocks not a quad */
1024};
1025.DE
1026.SH
1027Conclusions
1028.PP
1029The Sun VFS filesystem interface is the most widely used generic
1030filesystem interface.
1031Of the interfaces examined, it creates the cleanest separation
1032between the filesystem-independent and -dependent layers and data structures.
1033It has several flaws, but it is felt that certain changes in the interface
1034can ameliorate most of them.
1035The interface proposed here includes those changes.
1036The proposed interface is now being implemented by the Computer Systems
1037Research Group at Berkeley.
1038If the design succeeds in improving the flexibility and performance
1039of the filesystem layering, it will be advanced as a model interface.
1040.SH
1041Acknowledgements
1042.PP
1043The filesystem interface described here is derived from Sun's VFS interface.
1044It also includes features similar to those of DEC's GFS interface.
1045We are indebted to members of the Sun and DEC system groups
1046for long discussions of the issues involved.
1047.br
1048.ne 2i
1049.SH
1050References
1051
1052.IP Brownbridge82 \w'Satyanarayanan85\0\0'u
1053Brownbridge, D.R., L.F. Marshall, B. Randell,
1054``The Newcastle Connection, or UNIXes of the World Unite!,''
1055\fISoftware\- Practice and Experience\fP, Vol. 12, pp. 1147-1162, 1982.
1056
1057.IP Cole85
1058Cole, C.T., P.B. Flinn, A.B. Atlas,
1059``An Implementation of an Extended File System for UNIX,''
1060\fIUsenix Conference Proceedings\fP,
1061pp. 131-150, June, 1985.
1062
1063.IP Kleiman86
1064``Vnodes: An Architecture for Multiple File System Types in Sun UNIX,''
1065\fIUsenix Conference Proceedings\fP,
1066pp. 238-247, June, 1986.
1067
1068.IP Leffler84
1069Leffler, S., M.K. McKusick, M. Karels,
1070``Measuring and Improving the Performance of 4.2BSD,''
1071\fIUsenix Conference Proceedings\fP, pp. 237-252, June, 1984.
1072
1073.IP McKusick84
1074McKusick, M.K., W.N. Joy, S.J. Leffler, R.S. Fabry,
1075``A Fast File System for UNIX,'' \fITransactions on Computer Systems\fP,
1076Vol. 2, pp. 181-197,
1077ACM, August, 1984.
1078
1079.IP McKusick85
1080McKusick, M.K., M. Karels, S. Leffler,
1081``Performance Improvements and Functional Enhancements in 4.3BSD,''
1082\fIUsenix Conference Proceedings\fP, pp. 519-531, June, 1985.
1083
1084.IP Rifkin86
1085Rifkin, A.P., M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh,
1086``RFS Architectural Overview,'' \fIUsenix Conference Proceedings\fP,
1087pp. 248-259, June, 1986.
1088
1089.IP Ritchie74
1090Ritchie, D.M. and K. Thompson, ``The Unix Time-Sharing System,''
1091\fICommunications of the ACM\fP, Vol. 17, pp. 365-375, July, 1974.
1092
1093.IP Rodriguez86
1094Rodriguez, R., M. Koehler, R. Hyde,
1095``The Generic File System,'' \fIUsenix Conference Proceedings\fP,
1096pp. 260-269, June, 1986.
1097
1098.IP Sandberg85
1099Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, B. Lyon,
1100``Design and Implementation of the Sun Network Filesystem,''
1101\fIUsenix Conference Proceedings\fP,
1102pp. 119-130, June, 1985.
1103
1104.IP Satyanarayanan85
1105Satyanarayanan, M., \fIet al.\fP,
1106``The ITC Distributed File System: Principles and Design,''
1107\fIProc. 10th Symposium on Operating Systems Principles\fP, pp. 35-50,
1108ACM, December, 1985.
1109
1110.IP Walker85
1111Walker, B.J. and S.H. Kiser, ``The LOCUS Distributed Filesystem,''
1112\fIThe LOCUS Distributed System Architecture\fP,
1113G.J. Popek and B.J. Walker, ed., The MIT Press, Cambridge, MA, 1985.
1114
1115.IP Weinberger84
1116Weinberger, P.J., ``The Version 8 Network File System,''
1117\fIUsenix Conference presentation\fP,
1118June, 1984.
1119