From cf75bd0d98d91debae974f213700ad77d21e267b Mon Sep 17 00:00:00 2001 From: phk Date: Mon, 18 Feb 2002 09:48:59 +0000 Subject: [PATCH] The DEVFS paper presented at BSDcon-euro 2001 and BSDcon-2002. --- share/doc/papers/Makefile | 2 +- share/doc/papers/devfs/Makefile | 8 + share/doc/papers/devfs/paper.me | 1277 +++++++++++++++++++++++++++++++ 3 files changed, 1286 insertions(+), 1 deletion(-) create mode 100644 share/doc/papers/devfs/Makefile create mode 100644 share/doc/papers/devfs/paper.me diff --git a/share/doc/papers/Makefile b/share/doc/papers/Makefile index f8538316ddfa..1be030cb4710 100644 --- a/share/doc/papers/Makefile +++ b/share/doc/papers/Makefile @@ -1,7 +1,7 @@ # $FreeBSD$ SUBDIR= beyond4.3 bufbio diskperf fsinterface jail kernmalloc kerntune \ - malloc newvm nqnfs px relengr sysperf \ + malloc newvm nqnfs px relengr sysperf devfs \ contents .include diff --git a/share/doc/papers/devfs/Makefile b/share/doc/papers/devfs/Makefile new file mode 100644 index 000000000000..6b0e6fed8071 --- /dev/null +++ b/share/doc/papers/devfs/Makefile @@ -0,0 +1,8 @@ +# $FreeBSD$ + +VOLUME= papers +DOC= devfs +SRCS= paper.me +MACROS= -me + +.include diff --git a/share/doc/papers/devfs/paper.me b/share/doc/papers/devfs/paper.me new file mode 100644 index 000000000000..34ffd1712e7f --- /dev/null +++ b/share/doc/papers/devfs/paper.me @@ -0,0 +1,1277 @@ +.\" format with ditroff -me +.\" $FreeBSD$ +.\" format made to look as a paper for the proceedings is to look +.\" (as specified in the text) +.if n \{ .po 0 +. ll 78n +. na +.\} +.if t \{ .po 1.0i +. ll 6.5i +. nr pp 10 \" text point size +. nr sp \n(pp+2 \" section heading point size +. nr ss 1.5v \" spacing before section headings +.\} +.nr tm 1i +.nr bm 1i +.nr fm 2v +.he '''' +.de bu +.ip \0\s-2\(bu\s+2 +.. +.lp +.rs +.ce 5 +.sp +.sz 14 +.b "Rethinking /dev and devices in the UNIX kernel" +.sz 12 +.sp +.i "Poul-Henning Kamp" +.sp .1 +.i "" +.i "The FreeBSD Project" +.i +.sp 1.5 +.b Abstract +.lp +An outstanding novelty in UNIX at its introduction was the notion +of ``a file is a file is a file and even a device is a file.'' +Going from ``hardware only changes when the DEC Field engineer is here'' +to ``my toaster has USB'' has put serious strain on the rather crude +implementation of the ``devices as files'' concept, an implementation which +has survived practically unchanged for 30 years in most UNIX variants. +Starting from a high-level view of devices and the semantics that +have grown around them over the years, this paper takes the audience on a +grand tour of the redesigned FreeBSD device-I/O system, +to convey an overview of how it all fits together, and to explain why +things ended up as they did, how to use the new features and +in particular how not to. +.sp +.if t \{ +.2c +.\} +.\" end boilerplate... paper starts here. +.sh 1 "Introduction" +.sp +There are really only two fundamental ways to conceptualise +I/O devices in an operating system: +The usual way and the UNIX way. +.lp +The usual way is to treat I/O devices as their own class of things, +possibly several classes of things, and provide APIs tailored +to the semantics of the devices. +In practice this means that a program must know what it is dealing +with, it has to interact with disks one way, tapes another and +rodents yet a third way, all of which are different from how it +interacts with a plain disk file. +.lp +The UNIX way has never been described better than in the very first +paper +published on UNIX by Ritchie and Thompson [Ritchie74]: +.(q +Special files constitute the most unusual feature of the UNIX filesystem. +Each supported I/O device is associated with at least one such file. +Special files are read and written just like ordinary disk files, +but requests to read or write result in activation of the associated device. +An entry for each special file resides in directory /dev, +although a link may be made to one of these files just as it may to an +ordinary file. +Thus, for example, to write on a magnetic tape one may write on the file /dev/mt. + +Special files exist for each communication line, each disk, each tape drive, +and for physical main memory. +Of course, the active disks and the memory special files are protected from indiscriminate access. + +There is a threefold advantage in treating I/O devices this way: +file and device I/O are as similar as possible; +file and device names have the same syntax and meaning, +so that a program expecting a file name as a parameter can be passed a device name; +finally, special files are subject to the same protection mechanism as regular files. +.)q +.lp +.\" (Why was this so special at the time?) +At the time, this was quite a strange concept; it was totally accepted +for instance, that neither the system administrator nor the users were +able to interact with a disk as a disk. +Operating systems simply +did not provide access to disk other than as a filesystem. +Most vendors did not even release a program to initialise a +disk-pack with a filesystem: selling pre-initialised and ``quality +tested'' disk-packs was quite a profitable business. +.lp +In many cases some kind of API for reading and +writing individual sectors on a disk pack +did exist in the operating system, +but more often than not +it was not listed in the public documentation. +.sh 2 "The traditional implementation" +.lp +.\" (Explain how opening /dev/lpt0 lands you in the right device driver) +The initial implementation used hardcoded inode numbers [Ritchie98]. +The console +device would be inode number 5, the paper-tape-punch number 6 and so on, +even if those inodes were also actual regular files in the filesystem. +.lp +For reasons one can only too vividly imagine, this was changed and +Thompson +[Thompson78] +describes how the implementation now used ``major and minor'' +device numbers to index though the devsw array to the correct device driver. +.lp +For all intents and purposes, this is the implementation which survives +in most UNIX-like systems even to this day. +Apart from the access control and timestamp information which is +found in all inodes, the special inodes in the filesystem contain only +one piece of information: the major and minor device numbers, often +logically OR'ed to one field. +.lp +When a program opens a special file, the kernel uses the major number +to find the entry points in the device driver, and passes the combined +major and minor numbers as a parameter to the device driver. +.sh 1 "The challenge" +.lp +Now, we did not talk much about where the special inodes came from +to begin with. +They were created by hand, using the +mknod(2) system call, usually through the mknod(8) program. +.lp +In those days a +computer had a very static hardware configuration\** +.(f +\** Unless your assigned field engineer was present on site. +.)f +and it certainly did not +change while the system was up and running, so creating device nodes +by hand was certainly an acceptable solution. +.lp +The first sign that this would not hold up as a solution came with +the advent of TCP/IP and the telnet(1) program, or more precisely +with the telnetd(8) daemon. +In order to support remote login a ``pseudo-tty'' device driver was implemented, +basically as tty driver which instead of hardware had another device which +would allow a process to ``act as hardware'' for the tty. +The telnetd(8) daemon would read and write data on the ``master'' side of +the pseudo-tty and the user would be running on the ``slave'' side, +which would act just like any other tty: you could change the erase +character if you wanted to and all the signals and all that stuff worked. +.lp +Obviously with a device requiring no hardware, you can compile as many +instances into the kernel as you like, as long as you do not use +too much memory. +As system after system was connected +to the ARPANet, ``increasing number of ptys'' became a regular task +for system administrators, and part of this task was to create +more special nodes in the filesystem. +.lp +Several UNIX vendors also noticed an issue when they sold minicomputers +in many different configurations: explaining to system administrators +just which special nodes they would need and how to create them were +a significant documentation hassle. Some opted for the simple solution +and pre-populated /dev with every conceivable device node, resulting +in a predictable slowdown on access to filenames in /dev. +.lp +System V UNIX provided a band-aid solution: +a special boot sequence would take effect if the kernel or +the hardware had changed since last reboot. +This boot procedure would +amongst other things create the necessary special files in the filesystem, +based on an intricate system of per device driver configuration files. +.lp +In the recent years, we have become used to hardware which changes +configuration at any time: people plug USB, Firewire and PCCard +devices into their computers. +These devices can be anything from modems and disks to GPS receivers +and fingerprint authentication hardware. +Suddenly maintaining the +correct set of special devices in ``/dev'' became a major headache. +.lp +Along the way, UNIX kernels had learned to deal with multiple filesystem +types [Heidemann91a] and a ``device-pseudo-filesystem'' was a pretty +obvious idea. +The device drivers have a pretty good idea which +devices they have found in the configuration, so all that is needed is +to present this information as a filesystem filled with just the right +special files. +Experience has shown that this like most other ``pseudo +filesystems'' sound a lot simpler in theory than in practice. +.sh 1 "Truly understanding devices" +.lp +Before we continue, we need to fully understand the +``device special file'' in UNIX. +.lp +First we need to realize that a special file has the nature of +a pointer from the filesystem into a different namespace; +a little understood fact with far reaching consequences. +.lp +One implication of this is that several special files can +exist in the filename namespace all pointing to the same device +but each having their own access and timestamp attributes: +.lp +.(b M +.vs -3 +\fC\s-3guest# ls -l /dev/fd0 /tmp/fd0 +crw-r----- 1 root operator 9, 0 Sep 27 19:21 /dev/fd0 +crw-rw-rw- 1 root wheel 9, 0 Sep 27 19:24 /tmp/fd0\fP\s+3 +.vs +3 +.)b +Obviously, the administrator needs to be on top of this: +one popular way to exploit an unguarded root prompt is +to create a replica of the special file /dev/kmem +in a location where it will not be noticed. +Since /dev/kmem gives access to the kernel memory, +gaining any particular +privilege can be arranged by suitably modifying the kernel's +data structures through the illicit special file. +.lp +When NFS appeared it opened a new avenue for this attack: +People may have root privilege on one machine but not another. +Since device nodes are not interpreted on the NFS server +but rather on the local computer, +a user with root privilege on a NFS client +computer can create a device node to his liking on a filesystem +mounted from an NFS server. +This device node can in turn be used to +circumvent the security of other computers which mount that filesystem, +including the server, unless they protect themselves by not +trusting any device entries on untrusted filesystem by mounting such +filesystems with the \fCnodev\fP mount-option. +.lp +The fact that the device itself does not actually exist inside the +filesystem which holds the special file makes it possible +to perform boot-strapping stunts in the spirit +of Baron Von Münchausen [raspe1785], +where a filesystem is (re)mounted using one of its own +device vnodes: +.(b M +.vs -3 +\fC\s-2guest# mount -o ro /dev/fd0 /mnt +guest# fsck /mnt/dev/fd0 +guest# mount -u -o rw /mnt/dev/fd0 /mnt\fP\s+2 +.vs +3 +.)b +.lp +Other interesting details are chroot(2) and jail(2) [Kamp2000] which +provide filesystem isolation for process-trees. +Whereas chroot(2) was not implemented as a security tool [Mckusick1999] +(although it has been widely used as such), the jail(2) security +facility in FreeBSD provides a pretty convincing ``virtual machine'' +where even the root privilege is isolated and restricted to the designated +area of the machine. +Obviously chroot(2) and jail(2) may require access to a well-defined +subset of devices like /dev/null, /dev/zero and /dev/tty, +whereas access to other devices such as /dev/kmem +or any disks could be used to compromise the integrity of the jail(2) +confinement. +.lp +For a long time FreeBSD, like almost all UNIX-like systems had two kinds +of devices, ``block'' and +``character'' special files, the difference being that ``block'' +devices would provide caching and alignment for disk device access. +This was one of those minor architectural mistakes which took +forever to correct. +.lp +The argument that block devices were a mistake is really very +very simple: Many devices other than disks have multiple modes +of access which you select by choosing which special file to use. +.lp +Pick any old timer and he will be able to recite painful +sagas about the crucial difference between the /dev/rmt +and /dev/nrmt devices for tape access.\** +.(f +\** Make absolutely sure you know the difference before you take +important data on a multi-file 9-track tape to remote locations. +.)f +.lp +Tapes, asynchronous ports, line printer ports and many other devices +have implemented submodes, selectable by the user +at a special filename level, but that has not earned them their +own special file types. +Only disks\** +.(f +\** Well, OK: and some 9-track tapes. +.)f +have enjoyed the privilege of getting an entire file type dedicated to a +a minor device mode. +.lp +Caching and alignment modes should have been enabled by setting +some bit in the minor device number on the disk special file, +not by polluting the filesystem code with another file type. +.lp +In FreeBSD block devices were not even implemented in a fashion +which would be of any use, since any write errors would never be +reported to the writing process. For this reason, and since no +applications +were found to be in existence which relied on block devices +and since historical usage was indeed historical [Mckusick2000], +block devices were removed from the FreeBSD system. +This greatly simlified the task of keeping track of open(2) +reference counts for disks and +removed much magic special-case code throughout. +.lp +.sh 1 "Files, sockets, pipes, SVID IPC and devices" +.sp +It is an instructive lesson in inconsistency to look at the +various types of ``things'' a process can access in UNIX-like +systems today. +.lp +First there are normal files, which are our reference yardstick here: +they are accessed with open(2), read(2), write(2), mmap(2), close(2) +and various other auxiliary system calls. +.lp +Sockets and pipes are also accessed via file handles but each has +its own namespace. That means you cannot open(2) a socket,\** +.(f +\** This is particularly bizarre in the case of UNIX domain sockets +which use the filesystem as their namespace and appear in directory +listings. +.)f +but you can read(2) and write(2) to it. +Sockets and pipes vector off at the file descriptor level and do +not get in touch with the vnode based part of the kernel at all. +.lp +Devices land somewhere in the middle between pipes and sockets on +one side and normal files on the other. +They use the filesystem +namespace, are implemented with vnodes, and can be operated +on like normal files, but don't actually live in the filesystem. +.lp +Devices are in fact special-cased all the way through the vnode system. +For one thing devices break the ``one file-one vnode'' +rule, making it necessary to chain all vnodes for the same +device together in +order to be able to find ``the canonical vnode for this device node'', +but more importantly, many operations have to be specifically denied +on special file vnodes since they do not make any sense. +.lp +For true inconsistency, consider the SVID IPC mechanisms - not +only do they not operate via file handles, +but they also sport a singularly +illconceived 32 bit numeric namespace and a dedicated set of +system calls for access. +.lp +Several people have convincingly argued that this is an inconsistent +mess, and have proposed and implemented more consistent operating systems +like the Plan9 from Bell Labs [Pike90a] [Pike92a]. +Unfortunately reality is that people are not interested in learning a new +operating system when the one they have is pretty darn good, and +consequently research into better and more consistent ways is +a pretty frustrating [Pike2000] but by no means irrelevant topic. +.sh 1 "Solving the /dev maintenance problem" +.lp +There are a number of obvious, simple but wrong ways one could +go about solving the ``/dev'' maintenance problem. +.lp +The very straightforward way is to hack the namei() kernel function +responsible for filename translation and lookup. +It is only a minor matter of programming to +add code to special-case any lookup which ends up in ``/dev''. +But this leads to problems: in the case of chroot(2) or jail(2), the +administrator will want to present only a subset of the available +devices in ``/dev'', so some kind of state will have to be kept per +chroot(2)/jail(2) about which devices are visible and +which devices are hidden, but no obvious location for this information +is available in the absence of a mount data structure. +.lp +It also leads to some unpleasant issues +because of the fact that ``/dev/foo'' is a synthesised directory +entry which may or may not actually be present on the filesystem +which seems to provide ``/dev''. +The vnodes either have to belong to a filesystem or they +must be special-cased throughout the vnode layer of the kernel. +.lp +Finally there is the simple matter of generality: +hardcoding the string "/dev" in the kernel is very general. +.lp +A cruder solution is to leave it to a daemon: make a special +device driver, have a daemon read messages from it and create and +destroy nodes in ``/dev'' in response to these messages. +.lp +The main drawback to this idea is that now we have added IPC +to the mix introducing new and interesting race conditions. +.lp +Otherwise this solution is a surprisingly effective, +but chroot(2)/jail(2) requirements prevents a simple implementation +and running a daemon per jail would become an administrative +nightmare. +.lp +Another pitfall of +this approach is that we are not able to remount the root filesystem +read-write at boot until we have a device node for the root device, +but if this node is missing we cannot create it with a daemon since +the root filesystem (and hence /dev) is read-only. +Adding a read-write memory-filesystem mount /dev to solve this problem +does not improve +the architectural qualities further and certainly the KISS principle has +been violated by now. +.lp +The final and in the end only satisfactory solution is to write a ``DEVFS'' +which mounts on ``/dev''. +.lp +The good news is that it does solve the problem with chroot(2) and jail(2): +just mount a DEVFS instance on the ``dev'' directory inside the filesystem +subtree where the chroot or jail lives. Having a mountpoint gives us +a convenient place to keep track of the local state of this DEVFS mount. +.lp +The bad news is that it takes a lot of cleanup and care to implement +a DEVFS into a UNIX kernel. +.sh 1 "DEVFS architectural decisions" +.lp +Before implementing a DEVFS, it is necessary to decide on a range +of corner cases in behaviour, and some of these choices have proved +surprisingly hard to settle for the FreeBSD project. +.sh 2 "The ``persistence'' issue" +.lp +When DEVFS in FreeBSD was initially presented at a BoF at the 1995 +USENIX Technical Conference in New Orleans, +a group of people demanded that it provide ``persistence'' +for administrative changes. +.lp +When trying to get a definition of ``persistence'', people can generally +agree that if the administrator changes the access control bits of +a device node, they want that mode to survive across reboots. +.lp +Once more tricky examples of the sort of manipulations one can do +on special files are proposed, people rapidly disagree about what +should be supported and what should not. +.lp +For instance, imagine a +system with one floppy drive which appears in DEVFS as ``/dev/fd0''. +Now the administrator, in order to get some badly written software +to run, links this to ``/dev/fd1'': +.(b M +\fC\s-2ln /dev/fd0 /dev/fd1\fP\s+2 +.)b +This works as expected and with persistence in DEVFS, the link is +still there after a reboot. +But what if after a reboot another floppy drive has been connected +to the system? +This drive would naturally have the name ``/dev/fd1'', +but this name is now occupied by the administrators hard link. +Should the link be broken? +Should the new floppy drive be called +``/dev/fd2''? Nobody can agree on anything but the ugliness of the +situation. +.lp +Given that we are no longer dependent on DEC Field engineers to +change all four wheels to see which one is flat, the basic assumption +that the machine has a constant hardware configuration is simply no +longer true. +The new assumption one should start from when analysing this +issue is that when the system boots, we cannot know what devices we +will find, and we can not know if the devices we do find +are the same ones we had when the system was last shut down. +.lp +And in fact, this is very much the case with laptops today: if I attach +my IOmega Zip drive to my laptop it appears like a SCSI disk named +``/dev/da0'', but so does the RAID-5 array attached to the PCI SCSI controller +installed in my laptop's docking station. If I change mode to ``a+rw'' +on the Zip drive, do I want that mode to apply to the RAID-5 as well? +Unlikely. +.lp +And what if we have persistent information about the mode of +device ``/dev/sio0'', but we boot and do not find any sio devices? +Do we keep the information in our device-persistence registry? +How long do we keep it? If I borrow a a modem card, +set the permissions to some non-standard value like 0666, +and then attach some other serial device a year from now - do I +want some old permissions changes to come back and haunt me, +just because they both happened to be ``/dev/sio0''? +Unlikely. +.lp +The fact that more people have laptop computers today than +five years ago, and the fact that nobody has been able to credibly +propose where a persistent DEVFS would actually store the +information about these things in the first place has settled the issue. +.lp +Persistence may be the right answer, but to the +wrong question: persistence is not a desirable property for a DEVFS +when the hardware configuration may change literally at any time. +.sh 2 "Who decides on the names?" +.lp +In a DEVFS-enabled system, the responsibility for creating nodes in +/dev shifts to the device drivers, and consequently the device +drivers get to choose the names of the device files. +In addition an initial value for owner, group and mode bits are +provided by the device driver. +.lp +But should it be possible to rename ``/dev/lpt0'' to ``/dev/myprinter''? +While the obvious affirmative answer is easy to arrive at, it leaves +a lot to be desired once the implications are unmasked. +.lp +Most device drivers know their own name and use it purposefully in +their debug and log messages to identify themselves. +Furthermore, the ``NewBus'' [NewBus] infrastructure facility, +which ties hardware to device drivers, identifies things by name +and unit numbers. +.lp +A very common way to report errors in fact: +.(b M +.vs -3 +\fC\s-2#define LPT_NAME "lpt" /* our official name */ +[...] +printf(LPT_NAME + ": cannot alloc ppbus (%d)!", error);\fP\s+2 +.vs +3 +.)b +.lp +So despite the user renaming the device node pointing to the printer +to ``myprinter'', this has absolutely no effect in the kernel and can +be considered a userland aliasing operation. +.lp +The decision was therefore made that it should not be possible to rename +device nodes since it would only lead to confusion and because the desired +effect could be attained by giving the user the ability to create +symlinks in DEVFS. +.sh 2 "On-demand device creation" +.lp +Pseudo-devices like pty, tun and bpf, +but also some real devices, may not pre-emptively create entries for all +possible device nodes. It would be a pointless waste of resources +to always create 1000 ptys just in case they are needed, +and in the worst case more than 1800 device nodes would be needed per +physical disk to represent all possible slices and partitions. +.lp +For pseudo-devices the task at hand is to make a magic device node, +``/dev/pty'', which when opened will magically transmogrify into the +first available pty subdevice, maybe ``/dev/pty123''. +.lp +Device submodes, on the other hand, work by having multiple +entries in /dev, each with a different minor number, as a way to instruct +the device driver in aspects of its operation. The most widespread +example is probably ``/dev/mt0'' and ``/dev/nmt0'', where the node +with the extra ``n'' +instructs the tape device driver to not rewind on close.\** +.(f +\** This is the answer to the question in footnote number 2. +.)f +.lp +Some UNIX systems have solved the problem for pseudo-devices by +creating magic cloning devices like ``/dev/tcp''. +When a cloning device is opened, +it finds a free instance and through vnode and file descriptor mangling +return this new device to the opening process. +.lp +This scheme has two disadvantages: the complexity of switching vnodes +in midstream is non-trivial, but even worse is the fact that it +does not work for +submodes for a device because it only reacts to one particular /dev entry. +.lp +The solution for both needs is a more flexible on-demand device +creation, implemented in FreeBSD as a two-level lookup. +When a +filename is looked up in DEVFS, a match in the existing device nodes is +sought first and if found, returned. +If no match is found, device drivers are polled in turn to ask if +they would be able to synthesise a device node of the given name. +.lp +The device driver gets a chance to modify the name +and create a device with make_dev(). +If one of the drivers succeeds in this, the lookup is started over and +the newly found device node is returned: +.(b M +.vs -3 +\fC\s-2pty_clone() + if (name != "pty") + return(NULL); /* no luck */ + n = find_next_unit(); + dev = make_dev(...,n,"pty%d",n); + name = dev->name; + return(dev);\fP\s+2 +.vs +3 +.)b +.lp +An interesting mixed use of this mechanism is with the sound device drivers. +Modern sound devices have multiple channels, presumably to allow the +user to listen to CNN, Napstered MP3 files and Quake sound effects at +the same time. +The only problem is that all applications attempt to open ``/dev/dsp'' +since they have no concept of multiple sound devices. +The sound device drivers use the cloning facility to direct ``/dev/dsp'' +to the first available sound channel completely transparently to the +process. +.lp +There are very few drawbacks to this mechanism, the major one being +that ``ls /dev'' now errs on the sparse side instead of the rich when used +as a system device inventory, a practice which has always been +of dubious precision at best. +.sh 2 "Deleting and recreating devices" +.lp +Deleting device nodes is no problem to implement, but as likely as not, +some people will want a method to get them back. +Since only the device driver know how to create a given device, +recreation cannot be performed solely on the basis of the parameters +provided by a process in userland. +.lp +In order to not complicate the code which updates the directory +structure for a mountpoint to reflect changes in the DEVFS inode list, +a deleted entry is merely marked with DE_WHITEOUT instead of being +removed entirely. +Otherwise a separate list would be needed for inodes which we had +deleted so that they would not be mistaken for new inodes. +.lp +The obvious way to recreate deleted devices is to let mknod(2) do it +by matching the name and disregarding the major/minor arguments. +Recreating the device with mknod(2) will simply remove the DE_WHITEOUT +flag. +.sh 2 "Jail(2), chroot(2) and DEVFS" +.lp +The primary requirement from facilities like jail(2) and chroot(2) +is that it must be possible to control the contents of a DEVFS mount +point. +.lp +Obviously, it would not be desirable for dynamic devices to pop +into existence in the carefully pruned /dev of jails so it must be +possible to mark a DEVFS mountpoint as ``no new devices''. +And in the same way, the jailed root should not be able to recreate +device nodes which the real root has removed. +.lp +These behaviours will be controlled with mount options, but these have not +yet been implemented because FreeBSD has run out of bitmap flags for +mount options, and a new unlimited mount option implementation is +still not in place at the time of writing. +.lp +One mount option ``jaildevfs'', will restrict the contents of the +DEVFS mountpoint to the ``normal set'' of devices for a jail and +automatically hide all future devices and make it impossible +for a jailed root to un-hide hidden entries while letting an un-jailed +root do so. +.lp +Mounting or remounting read-only, will prevent all future +devices from appearing and will make it impossible to +hide or un-hide entries in the mountpoint. +This is probably only useful for chroots or jails where no tty +access is intended since cloning will not work either. +.lp +More mount options may be needed as more experience is gained. +.sh 2 "Default mode, owner & group" +.lp +When a device driver creates a device node, and a DEVFS mount adds it +to its directory tree, it needs to have some values for the access +control fields: mode, owner and group. +.lp +Currently, the device driver specifies the initial values in the +make_dev() call, but this is far from optimal. +For one thing, embedding magic UIDs and GIDs in the kernel is simply +bad style unless they are numerically zero. +More seriously, they represent compile-time defaults which in these +enlightened days is rather old-fashioned. +.lp +.sh 1 "Cleaning up before we build: struct specinfo and dev_t" +.lp +Most of the rest of the paper will be about the various challenges +and issues in the implementation of DEVFS in FreeBSD. +All of this should be applicable to other systems derived from +4.4BSD-Lite as well. +.lp +POSIX has defined a type called ``dev_t'' which is the identity of a device. +This is mainly for use in the few system calls which knows about devices: +stat(2), fstat(2) and mknod(2). +A dev_t is constructed by logically OR'ing +the major# and minor# for the device. +Since those have been defined +as having no overlapping bits, the major# and minor# +can be retrieved from the dev_t by a simple masking operation. +.lp +Although the kernel had a well-defined concept of any particular +device it did not have a data structure to represent "a device". +The device driver has such a structure, traditionally called ``softc'' +but the high kernel does not (and should not!) have access to the +device driver's private data structures. +.lp +It is an interesting tale how things got to be this way,\** +.(f +\** Basically, devices should have been moved up with sockets and +pipes at the file descriptor level when the VFS layering was introduced, +rather than have all the special casing throughout the vnode system. +.)f +but for now just record for +a fact how the actual relationship between the data structures was +in the 4.4BSD release (Fig. 1). [44BSDBook] +.(z +.PS 3 +F: box "file" "handle" +arrow down from F.s +V: box "vnode" +arrow right from V.e +S: box "specinfo" +arrow down from V.s +I: box "inode" +arrow right from I.e +C: box invis "devsw[]" "[major#]" +arrow down from C.s +D: box "device" "driver" +line right from D.e +box invis "softc[]" "[minor#]" +F2: box "file" "handle" at F + (2.5,0) +arrow down from F2.s +V2: box "vnode" +arrow right from V2.e +S2: box "specinfo" +arrow down from V2.s +I2: box "inode" +arrow left from I2.w +.PE +.ce 1 +Fig. 1 - Data structures in 4.4BSD +.)z +.lp +As for all other files, a vnode references a filesystem inode, but +in addition it points to a ``specinfo'' structure. In the inode +we find the dev_t which is used to reference the device driver. +.lp +Access to the device driver happens by extracting the major# from +the dev_t, indexing through the global devsw[] array to locate +the device driver's entry point. +.lp +The device driver will extract the minor# from the dev_t and use +that as the index into the softc array of private data per device. +.lp +The ``specinfo'' structure is a little sidekick vnodes grew underway, +and is used to find all vnodes which reference the same device (i.e. +they have the same major# and minor#). +This linkage is used to determine +which vnode is the ``chosen one'' for this device, and to keep track of +open(2)/close(2) against this device. +The actual implementation was an inefficient hash implementation, +which depending on the vnode reclamation rate and /dev directory lookup +traffic, may become a measurable performance liability. +.sh 2 "The new vnode/inode/dev_t layout" +.lp +In the new layout (Fig. 2) the specinfo structure takes a central +role. There is only one instanace of struct specinfo per +device (i.e. unique major# +and minor# combination) and all vnodes referencing this device point +to this structure directly. +.(z +.PS 2.25 +F: box "file" "handle" +arrow down from F.s +V: box "vnode" +arrow right from V.e +S: box "specinfo" +arrow down from V.s +I: box "inode" +F2: box "file" "handle" at F + (2.5,0) +arrow down from F2.s +V2: box "vnode" +arrow left from V2.w +arrow down from V2.s +I2: box "inode" +arrow down from S.s +D: box "device" "driver" +.PE +.ce 1 +Fig. 2 - The new FreeBSD data structures. +.)z +.lp +In userland, a dev_t is still the logical OR of the major# and +minor#, but this entity is now called a udev_t in the kernel. +In the kernel a dev_t is now a pointer to a struct specinfo. +.lp +All vnodes referencing a device are linked to a list hanging +directly off the specinfo structure, removing the need for the +hash table and consequently simplifying and speeding up a lot +of code dealing with vnode instantiation, retirement and +name-caching. +.lp +The entry points to the device driver are stored in the specinfo +structure, removing the need for the devsw[] array and allowing +device drivers to use separate entrypoints for various minor numbers. +.lp +This is is very convenient for devices which have a ``control'' +device for management and tuning. The control device, almost always +have entirely separate open/close/ioctl implementations [MD.C]. +.lp +In addition to this, two data elements are included in the specinfo +structure but ``owned'' by the device driver. Typically the +device driver will store a pointer to the softc structure in +one of these, and unit number or mode information in the other. +.lp +This removes the need for drivers to find the softc using array +indexing based on the minor#, and at the same time has obliviated +the need for the compiled-in ``NFOO'' constants which traditionally +determined how many softc structures and therefore devices +the driver could support.\** +.(f +\** Not to mention all the drivers which implemented panic(2) +because they forgot to perform bounds checking on the index before +using it on their softc arrays. +.)f +.lp +There are some trivial technical issues relating to allocating +the storage for specinfo early in the boot sequence and how to +find a specinfo from the udev_t/major#+minor#, but they will +not be discussed here. +.sh 2 "Creating and destroying devices" +.lp +Ideally, devices should only be created and +destroyed by the device drivers which know what devices are present. +This is accomplished with the make_dev() and destroy_dev() +function calls. +.lp +Life is seldom quite that simple. The operating system might be called +on to act as a NFS server for a diskless workstation, possibly even +of a different architecture, so we still need to be able to represent +device nodes with no device driver backing in the filesystems and +consequently we need to be able to create a specinfo from +the major#+minor# in these inodes when we encounter them. +In practice this is quite trivial, but in a few places in the code +one needs to be aware of the existence +of both ``named'' and ``anonymous'' specinfo structures. +.lp +The make_dev() call creates a specinfo structure and populates +it with driver entry points, major#, minor#, device node name +(for instance ``lpt0''), UID, GID and access mode bits. The return +value is a dev_t (i.e., a pointer to struct specinfo). +If the device driver determines that the device is no longer +present, it calls destroy_dev(), giving a dev_t as argument +and the dev_t will be cleaned and converted to an anonymous dev_t. +.lp +Once created with make_dev() a named dev_t exists until destroy_dev() +is called by the driver. The driver can rely on this and keep state +in the fields in dev_t which is reserved for driver use. +.sh 1 "DEVFS" +.lp +By now we have all the relevant information about each device node +collected in struct specinfo but we still have one problem to +solve before we can add the DEVFS filesystem on top of it. +.sh 2 "The interrupt problem" +.lp +Some device drivers, notably the CAM/SCSI subsystem in FreeBSD +will discover changes in the device configuration inside an interrupt +routine. +.lp +This imposes some limitations on what can and should do be done: +first one should minimise the amount +of work done in an interrupt routine for performance reasons; +second, to avoid deadlocks, vnodes and mountpoints should not be +accessed from an interrupt routine. +.lp +Also, in addition to the locking issue, +a machine can have many instances of DEVFS mounted: +for a jail(8) based virtual-machine web-server several hundred instances +is not unheard of, making it far too expensive to update all of them +in an interrupt routine. +.lp +The solution to this problem is to do all the filesystem work on +the filesystem side of DEVFS and use atomically manipulated integer indices +(``inode numbers'') as the barrier between the two sides. +.lp +The functions called from the device drivers, make_dev(), destroy_dev() +&c. only manipulate the DEVFS inode number of the dev_t in +question and do not even get near any mountpoints or vnodes. +.lp +For make_dev() the task is to assign a unique inode number to the +dev_t and store the dev_t in the DEVFS-global inode-to-dev_t array. +.(b M +.vs -3 +\fC\s-2make_dev(...) + store argument values in dev_t + assign unique inode number to dev_t + atomically insert dev_t into inode_array\fP\s+2 +.vs +3 +.)b +.lp +For destroy_dev() the task is the opposite: clear the inode number +in the dev_t and NULL the pointer in the devfs-global inode-to-dev_t +array. +.(b M +.vs -3 +\fC\s-2destroy_dev(...) + clear fields in dev_t + zero dev_t inode number. + atomically clear entry in inode_array\fP\s+2 +.vs +3 +.)b +.lp +Both functions conclude by atomically incrementing a global variable +\fCdevfs_generation\fP to leave an indication to the filesystem +side that something has changed. +.lp +By modifying the global state only with atomic instructions, locks +have been entirely avoided in this part of the code which means that +the make_dev() and destroy_dev() functions can be called from practically +anywhere in the kernel at any time. +.lp +On the filesystem side of DEVFS, the only two vnode methods which examine +or rely on the directory structure, VOP_LOOKUP and VOP_READDIR, +call the function devfs_populate() to update their mountpoint's view +of the device hierarchy to match current reality before doing any work. +.(b M +.vs -3 +\fC\s-2devfs_readdir(...) + devfs_populate(...) + ...\fP\s+2 +.)b +.vs +3 +.lp +The devfs_populate() function, compares the current \fCdevfs_generation\fP +to the value saved in the mountpoint last time devfs_populate() completed +and if (actually: while) they differ a linear run is made through the +devfs-global inode-array and the directory tree of the mountpoint is +brought up to date. +.lp +The actual code is slightly more complicated than shown in the pseudo-code +here because it has to deal with subdirectories and hidden entries. +.(b M +.vs -3 +\fC\s-2devfs_populate(...) + while (mount->generation != devfs_generation) + for i in all inodes + if inode created) + create directory entry + else if inode destroyed + remove directory entry +.vs +3 +.)b +.lp +Access to the global DEVFS inode table is again implemented +with atomic instructions and failsafe retries to avoid the +need for locking. +.lp +From a performance point of view this scheme also means that a particular +DEVFS mountpoint is not updated until it needs to be, and then always by +a process belonging to the jail in question thus minimising and +distributing the CPU load. +.sh 1 "Device-driver impact" +.lp +All these changes have had a significant impact on how device drivers +interact with the rest of the kernel regarding registration of +devices. +.lp +If we look first at the ``before'' image in Fig. 3, we notice first +the NFOO define which imposes a firm upper limit on the number of +devices the kernel can deal with. +Also notice that the softc structure for all of them is allocated +at compile time. +This is because most device drivers (and texts on writing device +drivers) are from before the general +kernel malloc facility [Mckusick1988] was introduced into the BSD kernel. +.lp +.(b M +.vs -3 +\fC\s-2 +#ifndef NFOO +# define NFOO 4 +#endif + +struct foo_softc { + ... +} foo_softc[NFOO]; + +int nfoo = 0; + +foo_open(dev, ...) +{ + int unit = minor(dev); + struct foo_softc *sc; + + if (unit >= NFOO || unit >= nfoo) + return (ENXIO); + + sc = &foo_softc[unit] + + ... +} + +foo_attach(...) +{ + struct foo_softc *sc; + static int once; + + ... + if (nfoo >= NFOO) { + /* Have hardware, can't handle */ + return (-1); + } + sc = &foo_softc[nfoo++]; + if (!once) { + cdevsw_add(&cdevsw); + once++; + } + ... +} +\fP\s+2 +Fig. 3 - Device-driver, old style. +.vs +3 +.)b +.lp +Also notice how range checking is needed to make sure that the +minor# is inside range. This code gets more complex if device-numbering +is sparse. Code equivalent to that shown in the foo_open() routine +would also be needed in foo_read(), foo_write(), foo_ioctl() &c. +.lp +Finally notice how the attach routine needs to remember to register +the cdevsw structure (not shown) when the first device is found. +.lp +Now, compare this to our ``after'' image in Fig. 4. +NFOO is totally gone and so is the compile time allocation +of space for softc structures. +.lp +The foo_open (and foo_close, foo_ioctl &c) functions can now +derive the softc pointer directly from the dev_t they receive +as an argument. +.lp +.(b M +.vs -3 +\fC\s-2 +struct foo_softc { + .... +}; + +int nfoo; + +foo_open(dev, ...) +{ + struct foo_softc *sc = dev->si_drv1; + + ... +} + +foo_attach(...) +{ + struct foo_softc *sc; + + ... + sc = MALLOC(..., M_ZERO); + if (sc == NULL) { + /* Have hardware, can't handle */ + return (-1); + } + sc->dev = make_dev(&cdevsw, nfoo, + UID_ROOT, GID_WHEEL, 0644, + "foo%d", nfoo); + nfoo++; + sc->dev->si_drv1 = sc; + ... +} +\fP\s+2 +Fig. 4 - Device-driver, new style. +.vs +3 +.)b +.lp +In foo_attach() we can now attach to all the devices we can +allocate memory for and we register the cdevsw structure per +dev_t rather than globally. +.lp +This last trick is what allows us to discard all bounds checking +in the foo_open() &c. routines, because they can only be +called through the cdevsw, and the cdevsw is only attached to +dev_t's which foo_attach() has created. +There is no way to end +up in foo_open() with a dev_t not created by foo_attach(). +.lp +In the two examples here, the difference is only 10 lines of source +code, primarily because only one of the worker functions of the +device driver is shown. +In real device drivers it is not uncommon to save 50 or more lines +of source code which typically is about a percent or two of the +entire driver. +.sh 1 "Future work" +.lp +Apart from some minor issues to be cleaned up, DEVFS is now a reality +and future work therefore is likely concentrate on applying the +facilities and functionality of DEVFS to FreeBSD. +.sh 2 "devd" +.lp +It would be logical to complement DEVFS with a ``device-daemon'' which +could configure and de-configure devices as they come and go. +When a disk appears, mount it. +When a network interface appears, configure it. +And in some configurable way allow the user to customise the action, +so that for instance images will automatically be copied off the +flash-based media from a camera, &c. +.lp +In this context it is good to question how we view dynamic devices. +If for instance a printer is removed in the middle of a print job +and another printer arrives a moment later, should the system +automatically continue the print job on this new printer? +When a disk-like device arrives, should we always mount it? Should +we have a database of known disk-like devices to tell us where to +mount it, what permissions to give the mountpoint? +Some computers come in multiple configurations, for instance laptops +with and without their docking station. How do we want to present +this to the users and what behaviour do the users expect? +.sh 2 "Pathname length limitations" +.lp +In order to simplify memory management in the early stages of boot, +the pathname relative to the mountpoint is presently stored in a +small fixed size buffer inside struct specinfo. +It should be possible to use filenames as long as the system otherwise +permits, so some kind of extension mechanism is called for. +.lp +Since it cannot be guaranteed that memory can be allocated in +all the possible scenarios where make_dev() can be called, it may +be necessary to mandate that the caller allocates the buffer if +the content will not fit inside the default buffer size. +.sh 2 "Initial access parameter selection" +.lp +As it is now, device drivers propose the initial mode, owner and group +for the device nodes, but it would be more flexible if it were possible +to give the kernel a set of rules, much like packet filtering rules, +which allow the user to set the wanted policy for new devices. +Such a mechanism could also be used to filter new devices for mount +points in jails and to determine other behaviour. +.lp +Doing these things from userland results in some awkward race conditions +and software bloat for embedded systems, so a kernel approach may be more +suitable. +.sh 2 "Applications of on-demand device creation" +.lp +The facility for on-demand creation of devices has some very interesting +possibilities. +.lp +One planned use is to enable user-controlled labelling +of disks. +Today disks have names like /dev/da0, /dev/ad4, but since +this numbering is topological any change in the hardware configuration +may rename the disks, causing /etc/fstab and backup procedures +to get out of sync with the hardware. +.lp +The current idea is to store on the media of the disk a user-chosen +disk name and allow access through this name, so that for instance +/dev/mydisk0 +would be a symlink to whatever topological name the disk might have +at any given time. +.lp +To simplify this and to avoid a forest of symlinks, it will probably +be decided to move all the sub-divisions of a disk into one subdirectory +per disk so just a single symlink can do the job. +In practice that means that the current /dev/ad0s2f will become +something like /dev/ad0/s2f and so on. +Obviously, in the same way, disks could also be accessed by their +topological address, down to the specific path in a SAN environment. +.lp +Another potential use could be for automated offline data media libraries. +It would be quite trivial to make it possible to access all the media +in the library using /dev/lib/$LABEL which would be a remarkable +simplification compared with most current automated retrieval facilities. +.lp +Another use could be to access devices by parameter rather than by +name. One could imagine sending a printjob to /dev/printer/color/A2 +and behind the scenes a search would be made for a device with the +correct properties and paper-handling facilities. +.sh 1 "Conclusion" +.lp +DEVFS has been successfully implemented in FreeBSD, +including a powerful, simple and flexible solution supporting +pseudo-devices and on-demand device node creation. +.lp +Contrary to the trend, the implementation added functionality +with a net decrease in source lines, +primarily because of the improved API seen from device drivers point of view. +.lp +Even if DEVFS is not desired, other 4.4BSD derived UNIX variants +would probably benefit from adopting the dev_t/specinfo related +cleanup. +.sh 1 "Acknowledgements" +.lp +I first got started on DEVFS in 1989 because the abysmal performance +of the Olivetti M250 computer forced me to implement a network-disk-device +for Minix in order to retain my sanity. +That initial work led to a +crude but working DEVFS for Minix, so obviously both Andrew Tannenbaum +and Olivetti deserve credit for inspiration. +.lp +Julian Elischer implemented a DEVFS for FreeBSD around 1994 which never +quite made it to maturity and subsequently was abandoned. +.lp +Bruce Evans deserves special credit not only for his keen eye for detail, +and his competent criticism but also for his enthusiastic resistance to the +very concept of DEVFS. +.lp +Many thanks to the people who took time to help me stamp out ``Danglish'' +through their reviews and comments: Chris Demetriou, Paul Richards, +Brian Somers, Nik Clayton, and Hanne Munkholm. +Any remaining insults to proper use of english language are my own fault. +.\" (list & why) +.sh 1 "References" +.lp +[44BSDBook] +Mckusick, Bostic, Karels & Quarterman: +``The Design and Implementation of 4.4 BSD Operating System.'' +Addison Wesley, 1996, ISBN 0-201-54979-4. +.lp +[Heidemann91a] +John S. Heidemann: +``Stackable layers: an architecture for filesystem development.'' +Master's thesis, University of California, Los Angeles, July 1991. +Available as UCLA technical report CSD-910056. +.lp +[Kamp2000] +Poul-Henning Kamp and Robert N. M. Watson: +``Confining the Omnipotent root.'' +Proceedings of the SANE 2000 Conference. +Available in FreeBSD distributions in \fC/usr/share/papers\fP. +.lp +[MD.C] +Poul-Henning Kamp et al: +FreeBSD memory disk driver: +\fCsrc/sys/dev/md/md.c\fP +.lp +[Mckusick1988] +Marshall Kirk Mckusick, Mike J. Karels: +``Design of a General Purpose Memory Allocator for the 4.3BSD UNIX-Kernel'' +Proceedings of the San Francisco USENIX Conference, pp. 295-303, June 1988. +.lp +[Mckusick1999] +Dr. Marshall Kirk Mckusick: +Private email communication. +\fI``According to the SCCS logs, the chroot call was added by Bill Joy +on March 18, 1982 approximately 1.5 years before 4.2BSD was released. +That was well before we had ftp servers of any sort (ftp did not +show up in the source tree until January 1983). My best guess as +to its purpose was to allow Bill to chroot into the /4.2BSD build +directory and build a system using only the files, include files, +etc contained in that tree. That was the only use of chroot that +I remember from the early days.''\fP +.lp +[Mckusick2000] +Dr. Marshall Kirk Mckusick: +Private communication at BSDcon2000 conference. +\fI``I have not used block devices since I wrote FFS and that +was \fPmany\fI years ago.''\fP +.lp +[NewBus] +NewBus is a subsystem which provides most of the glue between +hardware and device drivers. Despite the importance of this +there has never been published any good overview documentation +for it. +The following article by Alexander Langer in ``Dæmonnews'' is +the best reference I can come up with: +\fC\s-2http://www.daemonnews.org/200007/newbus-intro.html\fP\s+2 +.lp +[Pike2000] +Rob Pike: +``Systems Software Research is Irrelevant.'' +\fC\s-2http://www.cs.bell\-labs.com/who/rob/utah2000.pdf\fP\s+2 +.lp +[Pike90a] +Rob Pike, Dave Presotto, Ken Thompson and Howard Trickey: +``Plan 9 from Bell Labs.'' +Proceedings of the Summer 1990 UKUUG Conference. +.lp +[Pike92a] +Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey and Phil Winterbottom: +``The Use of Name Spaces in Plan 9.'' +Proceedings of the 5th ACM SIGOPS Workshop. +.lp +[Raspe1785] +Rudolf Erich Raspe: +``Baron Münchhausen's Narrative of his marvellous Travels and Campaigns in Russia.'' +Kearsley, 1785. +.lp +[Ritchie74] +D.M. Ritchie and K. Thompson: +``The UNIX Time-Sharing System'' +Communications of the ACM, Vol. 17, No. 7, July 1974. +.lp +[Ritchie98] +Dennis Ritchie: private conversation at USENIX Annual Technical Conference +New Orleans, 1998. +.lp +[Thompson78] +Ken Thompson: +``UNIX Implementation'' +The Bell System Technical Journal, vol 57, 1978, number 6 (part 2) p. 1931ff.