Maxim Sobolev 90a5520170 Fix typo.
Approved by:	phk
2002-02-18 16:22:23 +00:00

1278 lines
50 KiB
Plaintext
Raw Blame History

.\" format with ditroff -me
.\" $FreeBSD$
.\" format made to look as a paper for the proceedings is to look
.\" (as specified in the text)
.if n \{ .po 0
. ll 78n
. na
.\}
.if t \{ .po 1.0i
. ll 6.5i
. nr pp 10 \" text point size
. nr sp \n(pp+2 \" section heading point size
. nr ss 1.5v \" spacing before section headings
.\}
.nr tm 1i
.nr bm 1i
.nr fm 2v
.he ''''
.de bu
.ip \0\s-2\(bu\s+2
..
.lp
.rs
.ce 5
.sp
.sz 14
.b "Rethinking /dev and devices in the UNIX kernel"
.sz 12
.sp
.i "Poul-Henning Kamp"
.sp .1
.i "<phk@FreeBSD.org>"
.i "The FreeBSD Project"
.i
.sp 1.5
.b Abstract
.lp
An outstanding novelty in UNIX at its introduction was the notion
of ``a file is a file is a file and even a device is a file.''
Going from ``hardware only changes when the DEC Field engineer is here''
to ``my toaster has USB'' has put serious strain on the rather crude
implementation of the ``devices as files'' concept, an implementation which
has survived practically unchanged for 30 years in most UNIX variants.
Starting from a high-level view of devices and the semantics that
have grown around them over the years, this paper takes the audience on a
grand tour of the redesigned FreeBSD device-I/O system,
to convey an overview of how it all fits together, and to explain why
things ended up as they did, how to use the new features and
in particular how not to.
.sp
.if t \{
.2c
.\}
.\" end boilerplate... paper starts here.
.sh 1 "Introduction"
.sp
There are really only two fundamental ways to conceptualise
I/O devices in an operating system:
The usual way and the UNIX way.
.lp
The usual way is to treat I/O devices as their own class of things,
possibly several classes of things, and provide APIs tailored
to the semantics of the devices.
In practice this means that a program must know what it is dealing
with, it has to interact with disks one way, tapes another and
rodents yet a third way, all of which are different from how it
interacts with a plain disk file.
.lp
The UNIX way has never been described better than in the very first
paper
published on UNIX by Ritchie and Thompson [Ritchie74]:
.(q
Special files constitute the most unusual feature of the UNIX filesystem.
Each supported I/O device is associated with at least one such file.
Special files are read and written just like ordinary disk files,
but requests to read or write result in activation of the associated device.
An entry for each special file resides in directory /dev,
although a link may be made to one of these files just as it may to an
ordinary file.
Thus, for example, to write on a magnetic tape one may write on the file /dev/mt.
Special files exist for each communication line, each disk, each tape drive,
and for physical main memory.
Of course, the active disks and the memory special files are protected from indiscriminate access.
There is a threefold advantage in treating I/O devices this way:
file and device I/O are as similar as possible;
file and device names have the same syntax and meaning,
so that a program expecting a file name as a parameter can be passed a device name;
finally, special files are subject to the same protection mechanism as regular files.
.)q
.lp
.\" (Why was this so special at the time?)
At the time, this was quite a strange concept; it was totally accepted
for instance, that neither the system administrator nor the users were
able to interact with a disk as a disk.
Operating systems simply
did not provide access to disk other than as a filesystem.
Most vendors did not even release a program to initialise a
disk-pack with a filesystem: selling pre-initialised and ``quality
tested'' disk-packs was quite a profitable business.
.lp
In many cases some kind of API for reading and
writing individual sectors on a disk pack
did exist in the operating system,
but more often than not
it was not listed in the public documentation.
.sh 2 "The traditional implementation"
.lp
.\" (Explain how opening /dev/lpt0 lands you in the right device driver)
The initial implementation used hardcoded inode numbers [Ritchie98].
The console
device would be inode number 5, the paper-tape-punch number 6 and so on,
even if those inodes were also actual regular files in the filesystem.
.lp
For reasons one can only too vividly imagine, this was changed and
Thompson
[Thompson78]
describes how the implementation now used ``major and minor''
device numbers to index though the devsw array to the correct device driver.
.lp
For all intents and purposes, this is the implementation which survives
in most UNIX-like systems even to this day.
Apart from the access control and timestamp information which is
found in all inodes, the special inodes in the filesystem contain only
one piece of information: the major and minor device numbers, often
logically OR'ed to one field.
.lp
When a program opens a special file, the kernel uses the major number
to find the entry points in the device driver, and passes the combined
major and minor numbers as a parameter to the device driver.
.sh 1 "The challenge"
.lp
Now, we did not talk much about where the special inodes came from
to begin with.
They were created by hand, using the
mknod(2) system call, usually through the mknod(8) program.
.lp
In those days a
computer had a very static hardware configuration\**
.(f
\** Unless your assigned field engineer was present on site.
.)f
and it certainly did not
change while the system was up and running, so creating device nodes
by hand was certainly an acceptable solution.
.lp
The first sign that this would not hold up as a solution came with
the advent of TCP/IP and the telnet(1) program, or more precisely
with the telnetd(8) daemon.
In order to support remote login a ``pseudo-tty'' device driver was implemented,
basically as tty driver which instead of hardware had another device which
would allow a process to ``act as hardware'' for the tty.
The telnetd(8) daemon would read and write data on the ``master'' side of
the pseudo-tty and the user would be running on the ``slave'' side,
which would act just like any other tty: you could change the erase
character if you wanted to and all the signals and all that stuff worked.
.lp
Obviously with a device requiring no hardware, you can compile as many
instances into the kernel as you like, as long as you do not use
too much memory.
As system after system was connected
to the ARPANet, ``increasing number of ptys'' became a regular task
for system administrators, and part of this task was to create
more special nodes in the filesystem.
.lp
Several UNIX vendors also noticed an issue when they sold minicomputers
in many different configurations: explaining to system administrators
just which special nodes they would need and how to create them were
a significant documentation hassle. Some opted for the simple solution
and pre-populated /dev with every conceivable device node, resulting
in a predictable slowdown on access to filenames in /dev.
.lp
System V UNIX provided a band-aid solution:
a special boot sequence would take effect if the kernel or
the hardware had changed since last reboot.
This boot procedure would
amongst other things create the necessary special files in the filesystem,
based on an intricate system of per device driver configuration files.
.lp
In the recent years, we have become used to hardware which changes
configuration at any time: people plug USB, Firewire and PCCard
devices into their computers.
These devices can be anything from modems and disks to GPS receivers
and fingerprint authentication hardware.
Suddenly maintaining the
correct set of special devices in ``/dev'' became a major headache.
.lp
Along the way, UNIX kernels had learned to deal with multiple filesystem
types [Heidemann91a] and a ``device-pseudo-filesystem'' was a pretty
obvious idea.
The device drivers have a pretty good idea which
devices they have found in the configuration, so all that is needed is
to present this information as a filesystem filled with just the right
special files.
Experience has shown that this like most other ``pseudo
filesystems'' sound a lot simpler in theory than in practice.
.sh 1 "Truly understanding devices"
.lp
Before we continue, we need to fully understand the
``device special file'' in UNIX.
.lp
First we need to realize that a special file has the nature of
a pointer from the filesystem into a different namespace;
a little understood fact with far reaching consequences.
.lp
One implication of this is that several special files can
exist in the filename namespace all pointing to the same device
but each having their own access and timestamp attributes:
.lp
.(b M
.vs -3
\fC\s-3guest# ls -l /dev/fd0 /tmp/fd0
crw-r----- 1 root operator 9, 0 Sep 27 19:21 /dev/fd0
crw-rw-rw- 1 root wheel 9, 0 Sep 27 19:24 /tmp/fd0\fP\s+3
.vs +3
.)b
Obviously, the administrator needs to be on top of this:
one popular way to exploit an unguarded root prompt is
to create a replica of the special file /dev/kmem
in a location where it will not be noticed.
Since /dev/kmem gives access to the kernel memory,
gaining any particular
privilege can be arranged by suitably modifying the kernel's
data structures through the illicit special file.
.lp
When NFS appeared it opened a new avenue for this attack:
People may have root privilege on one machine but not another.
Since device nodes are not interpreted on the NFS server
but rather on the local computer,
a user with root privilege on a NFS client
computer can create a device node to his liking on a filesystem
mounted from an NFS server.
This device node can in turn be used to
circumvent the security of other computers which mount that filesystem,
including the server, unless they protect themselves by not
trusting any device entries on untrusted filesystem by mounting such
filesystems with the \fCnodev\fP mount-option.
.lp
The fact that the device itself does not actually exist inside the
filesystem which holds the special file makes it possible
to perform boot-strapping stunts in the spirit
of Baron Von M<>nchausen [raspe1785],
where a filesystem is (re)mounted using one of its own
device vnodes:
.(b M
.vs -3
\fC\s-2guest# mount -o ro /dev/fd0 /mnt
guest# fsck /mnt/dev/fd0
guest# mount -u -o rw /mnt/dev/fd0 /mnt\fP\s+2
.vs +3
.)b
.lp
Other interesting details are chroot(2) and jail(2) [Kamp2000] which
provide filesystem isolation for process-trees.
Whereas chroot(2) was not implemented as a security tool [Mckusick1999]
(although it has been widely used as such), the jail(2) security
facility in FreeBSD provides a pretty convincing ``virtual machine''
where even the root privilege is isolated and restricted to the designated
area of the machine.
Obviously chroot(2) and jail(2) may require access to a well-defined
subset of devices like /dev/null, /dev/zero and /dev/tty,
whereas access to other devices such as /dev/kmem
or any disks could be used to compromise the integrity of the jail(2)
confinement.
.lp
For a long time FreeBSD, like almost all UNIX-like systems had two kinds
of devices, ``block'' and
``character'' special files, the difference being that ``block''
devices would provide caching and alignment for disk device access.
This was one of those minor architectural mistakes which took
forever to correct.
.lp
The argument that block devices were a mistake is really very
very simple: Many devices other than disks have multiple modes
of access which you select by choosing which special file to use.
.lp
Pick any old timer and he will be able to recite painful
sagas about the crucial difference between the /dev/rmt
and /dev/nrmt devices for tape access.\**
.(f
\** Make absolutely sure you know the difference before you take
important data on a multi-file 9-track tape to remote locations.
.)f
.lp
Tapes, asynchronous ports, line printer ports and many other devices
have implemented submodes, selectable by the user
at a special filename level, but that has not earned them their
own special file types.
Only disks\**
.(f
\** Well, OK: and some 9-track tapes.
.)f
have enjoyed the privilege of getting an entire file type dedicated to a
a minor device mode.
.lp
Caching and alignment modes should have been enabled by setting
some bit in the minor device number on the disk special file,
not by polluting the filesystem code with another file type.
.lp
In FreeBSD block devices were not even implemented in a fashion
which would be of any use, since any write errors would never be
reported to the writing process. For this reason, and since no
applications
were found to be in existence which relied on block devices
and since historical usage was indeed historical [Mckusick2000],
block devices were removed from the FreeBSD system.
This greatly simlified the task of keeping track of open(2)
reference counts for disks and
removed much magic special-case code throughout.
.lp
.sh 1 "Files, sockets, pipes, SVID IPC and devices"
.sp
It is an instructive lesson in inconsistency to look at the
various types of ``things'' a process can access in UNIX-like
systems today.
.lp
First there are normal files, which are our reference yardstick here:
they are accessed with open(2), read(2), write(2), mmap(2), close(2)
and various other auxiliary system calls.
.lp
Sockets and pipes are also accessed via file handles but each has
its own namespace. That means you cannot open(2) a socket,\**
.(f
\** This is particularly bizarre in the case of UNIX domain sockets
which use the filesystem as their namespace and appear in directory
listings.
.)f
but you can read(2) and write(2) to it.
Sockets and pipes vector off at the file descriptor level and do
not get in touch with the vnode based part of the kernel at all.
.lp
Devices land somewhere in the middle between pipes and sockets on
one side and normal files on the other.
They use the filesystem
namespace, are implemented with vnodes, and can be operated
on like normal files, but don't actually live in the filesystem.
.lp
Devices are in fact special-cased all the way through the vnode system.
For one thing devices break the ``one file-one vnode''
rule, making it necessary to chain all vnodes for the same
device together in
order to be able to find ``the canonical vnode for this device node'',
but more importantly, many operations have to be specifically denied
on special file vnodes since they do not make any sense.
.lp
For true inconsistency, consider the SVID IPC mechanisms - not
only do they not operate via file handles,
but they also sport a singularly
illconceived 32 bit numeric namespace and a dedicated set of
system calls for access.
.lp
Several people have convincingly argued that this is an inconsistent
mess, and have proposed and implemented more consistent operating systems
like the Plan9 from Bell Labs [Pike90a] [Pike92a].
Unfortunately reality is that people are not interested in learning a new
operating system when the one they have is pretty darn good, and
consequently research into better and more consistent ways is
a pretty frustrating [Pike2000] but by no means irrelevant topic.
.sh 1 "Solving the /dev maintenance problem"
.lp
There are a number of obvious, simple but wrong ways one could
go about solving the ``/dev'' maintenance problem.
.lp
The very straightforward way is to hack the namei() kernel function
responsible for filename translation and lookup.
It is only a minor matter of programming to
add code to special-case any lookup which ends up in ``/dev''.
But this leads to problems: in the case of chroot(2) or jail(2), the
administrator will want to present only a subset of the available
devices in ``/dev'', so some kind of state will have to be kept per
chroot(2)/jail(2) about which devices are visible and
which devices are hidden, but no obvious location for this information
is available in the absence of a mount data structure.
.lp
It also leads to some unpleasant issues
because of the fact that ``/dev/foo'' is a synthesised directory
entry which may or may not actually be present on the filesystem
which seems to provide ``/dev''.
The vnodes either have to belong to a filesystem or they
must be special-cased throughout the vnode layer of the kernel.
.lp
Finally there is the simple matter of generality:
hardcoding the string "/dev" in the kernel is very general.
.lp
A cruder solution is to leave it to a daemon: make a special
device driver, have a daemon read messages from it and create and
destroy nodes in ``/dev'' in response to these messages.
.lp
The main drawback to this idea is that now we have added IPC
to the mix introducing new and interesting race conditions.
.lp
Otherwise this solution is a surprisingly effective,
but chroot(2)/jail(2) requirements prevents a simple implementation
and running a daemon per jail would become an administrative
nightmare.
.lp
Another pitfall of
this approach is that we are not able to remount the root filesystem
read-write at boot until we have a device node for the root device,
but if this node is missing we cannot create it with a daemon since
the root filesystem (and hence /dev) is read-only.
Adding a read-write memory-filesystem mount /dev to solve this problem
does not improve
the architectural qualities further and certainly the KISS principle has
been violated by now.
.lp
The final and in the end only satisfactory solution is to write a ``DEVFS''
which mounts on ``/dev''.
.lp
The good news is that it does solve the problem with chroot(2) and jail(2):
just mount a DEVFS instance on the ``dev'' directory inside the filesystem
subtree where the chroot or jail lives. Having a mountpoint gives us
a convenient place to keep track of the local state of this DEVFS mount.
.lp
The bad news is that it takes a lot of cleanup and care to implement
a DEVFS into a UNIX kernel.
.sh 1 "DEVFS architectural decisions"
.lp
Before implementing a DEVFS, it is necessary to decide on a range
of corner cases in behaviour, and some of these choices have proved
surprisingly hard to settle for the FreeBSD project.
.sh 2 "The ``persistence'' issue"
.lp
When DEVFS in FreeBSD was initially presented at a BoF at the 1995
USENIX Technical Conference in New Orleans,
a group of people demanded that it provide ``persistence''
for administrative changes.
.lp
When trying to get a definition of ``persistence'', people can generally
agree that if the administrator changes the access control bits of
a device node, they want that mode to survive across reboots.
.lp
Once more tricky examples of the sort of manipulations one can do
on special files are proposed, people rapidly disagree about what
should be supported and what should not.
.lp
For instance, imagine a
system with one floppy drive which appears in DEVFS as ``/dev/fd0''.
Now the administrator, in order to get some badly written software
to run, links this to ``/dev/fd1'':
.(b M
\fC\s-2ln /dev/fd0 /dev/fd1\fP\s+2
.)b
This works as expected and with persistence in DEVFS, the link is
still there after a reboot.
But what if after a reboot another floppy drive has been connected
to the system?
This drive would naturally have the name ``/dev/fd1'',
but this name is now occupied by the administrators hard link.
Should the link be broken?
Should the new floppy drive be called
``/dev/fd2''? Nobody can agree on anything but the ugliness of the
situation.
.lp
Given that we are no longer dependent on DEC Field engineers to
change all four wheels to see which one is flat, the basic assumption
that the machine has a constant hardware configuration is simply no
longer true.
The new assumption one should start from when analysing this
issue is that when the system boots, we cannot know what devices we
will find, and we can not know if the devices we do find
are the same ones we had when the system was last shut down.
.lp
And in fact, this is very much the case with laptops today: if I attach
my IOmega Zip drive to my laptop it appears like a SCSI disk named
``/dev/da0'', but so does the RAID-5 array attached to the PCI SCSI controller
installed in my laptop's docking station. If I change mode to ``a+rw''
on the Zip drive, do I want that mode to apply to the RAID-5 as well?
Unlikely.
.lp
And what if we have persistent information about the mode of
device ``/dev/sio0'', but we boot and do not find any sio devices?
Do we keep the information in our device-persistence registry?
How long do we keep it? If I borrow a modem card,
set the permissions to some non-standard value like 0666,
and then attach some other serial device a year from now - do I
want some old permissions changes to come back and haunt me,
just because they both happened to be ``/dev/sio0''?
Unlikely.
.lp
The fact that more people have laptop computers today than
five years ago, and the fact that nobody has been able to credibly
propose where a persistent DEVFS would actually store the
information about these things in the first place has settled the issue.
.lp
Persistence may be the right answer, but to the
wrong question: persistence is not a desirable property for a DEVFS
when the hardware configuration may change literally at any time.
.sh 2 "Who decides on the names?"
.lp
In a DEVFS-enabled system, the responsibility for creating nodes in
/dev shifts to the device drivers, and consequently the device
drivers get to choose the names of the device files.
In addition an initial value for owner, group and mode bits are
provided by the device driver.
.lp
But should it be possible to rename ``/dev/lpt0'' to ``/dev/myprinter''?
While the obvious affirmative answer is easy to arrive at, it leaves
a lot to be desired once the implications are unmasked.
.lp
Most device drivers know their own name and use it purposefully in
their debug and log messages to identify themselves.
Furthermore, the ``NewBus'' [NewBus] infrastructure facility,
which ties hardware to device drivers, identifies things by name
and unit numbers.
.lp
A very common way to report errors in fact:
.(b M
.vs -3
\fC\s-2#define LPT_NAME "lpt" /* our official name */
[...]
printf(LPT_NAME
": cannot alloc ppbus (%d)!", error);\fP\s+2
.vs +3
.)b
.lp
So despite the user renaming the device node pointing to the printer
to ``myprinter'', this has absolutely no effect in the kernel and can
be considered a userland aliasing operation.
.lp
The decision was therefore made that it should not be possible to rename
device nodes since it would only lead to confusion and because the desired
effect could be attained by giving the user the ability to create
symlinks in DEVFS.
.sh 2 "On-demand device creation"
.lp
Pseudo-devices like pty, tun and bpf,
but also some real devices, may not pre-emptively create entries for all
possible device nodes. It would be a pointless waste of resources
to always create 1000 ptys just in case they are needed,
and in the worst case more than 1800 device nodes would be needed per
physical disk to represent all possible slices and partitions.
.lp
For pseudo-devices the task at hand is to make a magic device node,
``/dev/pty'', which when opened will magically transmogrify into the
first available pty subdevice, maybe ``/dev/pty123''.
.lp
Device submodes, on the other hand, work by having multiple
entries in /dev, each with a different minor number, as a way to instruct
the device driver in aspects of its operation. The most widespread
example is probably ``/dev/mt0'' and ``/dev/nmt0'', where the node
with the extra ``n''
instructs the tape device driver to not rewind on close.\**
.(f
\** This is the answer to the question in footnote number 2.
.)f
.lp
Some UNIX systems have solved the problem for pseudo-devices by
creating magic cloning devices like ``/dev/tcp''.
When a cloning device is opened,
it finds a free instance and through vnode and file descriptor mangling
return this new device to the opening process.
.lp
This scheme has two disadvantages: the complexity of switching vnodes
in midstream is non-trivial, but even worse is the fact that it
does not work for
submodes for a device because it only reacts to one particular /dev entry.
.lp
The solution for both needs is a more flexible on-demand device
creation, implemented in FreeBSD as a two-level lookup.
When a
filename is looked up in DEVFS, a match in the existing device nodes is
sought first and if found, returned.
If no match is found, device drivers are polled in turn to ask if
they would be able to synthesise a device node of the given name.
.lp
The device driver gets a chance to modify the name
and create a device with make_dev().
If one of the drivers succeeds in this, the lookup is started over and
the newly found device node is returned:
.(b M
.vs -3
\fC\s-2pty_clone()
if (name != "pty")
return(NULL); /* no luck */
n = find_next_unit();
dev = make_dev(...,n,"pty%d",n);
name = dev->name;
return(dev);\fP\s+2
.vs +3
.)b
.lp
An interesting mixed use of this mechanism is with the sound device drivers.
Modern sound devices have multiple channels, presumably to allow the
user to listen to CNN, Napstered MP3 files and Quake sound effects at
the same time.
The only problem is that all applications attempt to open ``/dev/dsp''
since they have no concept of multiple sound devices.
The sound device drivers use the cloning facility to direct ``/dev/dsp''
to the first available sound channel completely transparently to the
process.
.lp
There are very few drawbacks to this mechanism, the major one being
that ``ls /dev'' now errs on the sparse side instead of the rich when used
as a system device inventory, a practice which has always been
of dubious precision at best.
.sh 2 "Deleting and recreating devices"
.lp
Deleting device nodes is no problem to implement, but as likely as not,
some people will want a method to get them back.
Since only the device driver know how to create a given device,
recreation cannot be performed solely on the basis of the parameters
provided by a process in userland.
.lp
In order to not complicate the code which updates the directory
structure for a mountpoint to reflect changes in the DEVFS inode list,
a deleted entry is merely marked with DE_WHITEOUT instead of being
removed entirely.
Otherwise a separate list would be needed for inodes which we had
deleted so that they would not be mistaken for new inodes.
.lp
The obvious way to recreate deleted devices is to let mknod(2) do it
by matching the name and disregarding the major/minor arguments.
Recreating the device with mknod(2) will simply remove the DE_WHITEOUT
flag.
.sh 2 "Jail(2), chroot(2) and DEVFS"
.lp
The primary requirement from facilities like jail(2) and chroot(2)
is that it must be possible to control the contents of a DEVFS mount
point.
.lp
Obviously, it would not be desirable for dynamic devices to pop
into existence in the carefully pruned /dev of jails so it must be
possible to mark a DEVFS mountpoint as ``no new devices''.
And in the same way, the jailed root should not be able to recreate
device nodes which the real root has removed.
.lp
These behaviours will be controlled with mount options, but these have not
yet been implemented because FreeBSD has run out of bitmap flags for
mount options, and a new unlimited mount option implementation is
still not in place at the time of writing.
.lp
One mount option ``jaildevfs'', will restrict the contents of the
DEVFS mountpoint to the ``normal set'' of devices for a jail and
automatically hide all future devices and make it impossible
for a jailed root to un-hide hidden entries while letting an un-jailed
root do so.
.lp
Mounting or remounting read-only, will prevent all future
devices from appearing and will make it impossible to
hide or un-hide entries in the mountpoint.
This is probably only useful for chroots or jails where no tty
access is intended since cloning will not work either.
.lp
More mount options may be needed as more experience is gained.
.sh 2 "Default mode, owner & group"
.lp
When a device driver creates a device node, and a DEVFS mount adds it
to its directory tree, it needs to have some values for the access
control fields: mode, owner and group.
.lp
Currently, the device driver specifies the initial values in the
make_dev() call, but this is far from optimal.
For one thing, embedding magic UIDs and GIDs in the kernel is simply
bad style unless they are numerically zero.
More seriously, they represent compile-time defaults which in these
enlightened days is rather old-fashioned.
.lp
.sh 1 "Cleaning up before we build: struct specinfo and dev_t"
.lp
Most of the rest of the paper will be about the various challenges
and issues in the implementation of DEVFS in FreeBSD.
All of this should be applicable to other systems derived from
4.4BSD-Lite as well.
.lp
POSIX has defined a type called ``dev_t'' which is the identity of a device.
This is mainly for use in the few system calls which knows about devices:
stat(2), fstat(2) and mknod(2).
A dev_t is constructed by logically OR'ing
the major# and minor# for the device.
Since those have been defined
as having no overlapping bits, the major# and minor#
can be retrieved from the dev_t by a simple masking operation.
.lp
Although the kernel had a well-defined concept of any particular
device it did not have a data structure to represent "a device".
The device driver has such a structure, traditionally called ``softc''
but the high kernel does not (and should not!) have access to the
device driver's private data structures.
.lp
It is an interesting tale how things got to be this way,\**
.(f
\** Basically, devices should have been moved up with sockets and
pipes at the file descriptor level when the VFS layering was introduced,
rather than have all the special casing throughout the vnode system.
.)f
but for now just record for
a fact how the actual relationship between the data structures was
in the 4.4BSD release (Fig. 1). [44BSDBook]
.(z
.PS 3
F: box "file" "handle"
arrow down from F.s
V: box "vnode"
arrow right from V.e
S: box "specinfo"
arrow down from V.s
I: box "inode"
arrow right from I.e
C: box invis "devsw[]" "[major#]"
arrow down from C.s
D: box "device" "driver"
line right from D.e
box invis "softc[]" "[minor#]"
F2: box "file" "handle" at F + (2.5,0)
arrow down from F2.s
V2: box "vnode"
arrow right from V2.e
S2: box "specinfo"
arrow down from V2.s
I2: box "inode"
arrow left from I2.w
.PE
.ce 1
Fig. 1 - Data structures in 4.4BSD
.)z
.lp
As for all other files, a vnode references a filesystem inode, but
in addition it points to a ``specinfo'' structure. In the inode
we find the dev_t which is used to reference the device driver.
.lp
Access to the device driver happens by extracting the major# from
the dev_t, indexing through the global devsw[] array to locate
the device driver's entry point.
.lp
The device driver will extract the minor# from the dev_t and use
that as the index into the softc array of private data per device.
.lp
The ``specinfo'' structure is a little sidekick vnodes grew underway,
and is used to find all vnodes which reference the same device (i.e.
they have the same major# and minor#).
This linkage is used to determine
which vnode is the ``chosen one'' for this device, and to keep track of
open(2)/close(2) against this device.
The actual implementation was an inefficient hash implementation,
which depending on the vnode reclamation rate and /dev directory lookup
traffic, may become a measurable performance liability.
.sh 2 "The new vnode/inode/dev_t layout"
.lp
In the new layout (Fig. 2) the specinfo structure takes a central
role. There is only one instanace of struct specinfo per
device (i.e. unique major#
and minor# combination) and all vnodes referencing this device point
to this structure directly.
.(z
.PS 2.25
F: box "file" "handle"
arrow down from F.s
V: box "vnode"
arrow right from V.e
S: box "specinfo"
arrow down from V.s
I: box "inode"
F2: box "file" "handle" at F + (2.5,0)
arrow down from F2.s
V2: box "vnode"
arrow left from V2.w
arrow down from V2.s
I2: box "inode"
arrow down from S.s
D: box "device" "driver"
.PE
.ce 1
Fig. 2 - The new FreeBSD data structures.
.)z
.lp
In userland, a dev_t is still the logical OR of the major# and
minor#, but this entity is now called a udev_t in the kernel.
In the kernel a dev_t is now a pointer to a struct specinfo.
.lp
All vnodes referencing a device are linked to a list hanging
directly off the specinfo structure, removing the need for the
hash table and consequently simplifying and speeding up a lot
of code dealing with vnode instantiation, retirement and
name-caching.
.lp
The entry points to the device driver are stored in the specinfo
structure, removing the need for the devsw[] array and allowing
device drivers to use separate entrypoints for various minor numbers.
.lp
This is is very convenient for devices which have a ``control''
device for management and tuning. The control device, almost always
have entirely separate open/close/ioctl implementations [MD.C].
.lp
In addition to this, two data elements are included in the specinfo
structure but ``owned'' by the device driver. Typically the
device driver will store a pointer to the softc structure in
one of these, and unit number or mode information in the other.
.lp
This removes the need for drivers to find the softc using array
indexing based on the minor#, and at the same time has obliviated
the need for the compiled-in ``NFOO'' constants which traditionally
determined how many softc structures and therefore devices
the driver could support.\**
.(f
\** Not to mention all the drivers which implemented panic(2)
because they forgot to perform bounds checking on the index before
using it on their softc arrays.
.)f
.lp
There are some trivial technical issues relating to allocating
the storage for specinfo early in the boot sequence and how to
find a specinfo from the udev_t/major#+minor#, but they will
not be discussed here.
.sh 2 "Creating and destroying devices"
.lp
Ideally, devices should only be created and
destroyed by the device drivers which know what devices are present.
This is accomplished with the make_dev() and destroy_dev()
function calls.
.lp
Life is seldom quite that simple. The operating system might be called
on to act as a NFS server for a diskless workstation, possibly even
of a different architecture, so we still need to be able to represent
device nodes with no device driver backing in the filesystems and
consequently we need to be able to create a specinfo from
the major#+minor# in these inodes when we encounter them.
In practice this is quite trivial, but in a few places in the code
one needs to be aware of the existence
of both ``named'' and ``anonymous'' specinfo structures.
.lp
The make_dev() call creates a specinfo structure and populates
it with driver entry points, major#, minor#, device node name
(for instance ``lpt0''), UID, GID and access mode bits. The return
value is a dev_t (i.e., a pointer to struct specinfo).
If the device driver determines that the device is no longer
present, it calls destroy_dev(), giving a dev_t as argument
and the dev_t will be cleaned and converted to an anonymous dev_t.
.lp
Once created with make_dev() a named dev_t exists until destroy_dev()
is called by the driver. The driver can rely on this and keep state
in the fields in dev_t which is reserved for driver use.
.sh 1 "DEVFS"
.lp
By now we have all the relevant information about each device node
collected in struct specinfo but we still have one problem to
solve before we can add the DEVFS filesystem on top of it.
.sh 2 "The interrupt problem"
.lp
Some device drivers, notably the CAM/SCSI subsystem in FreeBSD
will discover changes in the device configuration inside an interrupt
routine.
.lp
This imposes some limitations on what can and should do be done:
first one should minimise the amount
of work done in an interrupt routine for performance reasons;
second, to avoid deadlocks, vnodes and mountpoints should not be
accessed from an interrupt routine.
.lp
Also, in addition to the locking issue,
a machine can have many instances of DEVFS mounted:
for a jail(8) based virtual-machine web-server several hundred instances
is not unheard of, making it far too expensive to update all of them
in an interrupt routine.
.lp
The solution to this problem is to do all the filesystem work on
the filesystem side of DEVFS and use atomically manipulated integer indices
(``inode numbers'') as the barrier between the two sides.
.lp
The functions called from the device drivers, make_dev(), destroy_dev()
&c. only manipulate the DEVFS inode number of the dev_t in
question and do not even get near any mountpoints or vnodes.
.lp
For make_dev() the task is to assign a unique inode number to the
dev_t and store the dev_t in the DEVFS-global inode-to-dev_t array.
.(b M
.vs -3
\fC\s-2make_dev(...)
store argument values in dev_t
assign unique inode number to dev_t
atomically insert dev_t into inode_array\fP\s+2
.vs +3
.)b
.lp
For destroy_dev() the task is the opposite: clear the inode number
in the dev_t and NULL the pointer in the devfs-global inode-to-dev_t
array.
.(b M
.vs -3
\fC\s-2destroy_dev(...)
clear fields in dev_t
zero dev_t inode number.
atomically clear entry in inode_array\fP\s+2
.vs +3
.)b
.lp
Both functions conclude by atomically incrementing a global variable
\fCdevfs_generation\fP to leave an indication to the filesystem
side that something has changed.
.lp
By modifying the global state only with atomic instructions, locks
have been entirely avoided in this part of the code which means that
the make_dev() and destroy_dev() functions can be called from practically
anywhere in the kernel at any time.
.lp
On the filesystem side of DEVFS, the only two vnode methods which examine
or rely on the directory structure, VOP_LOOKUP and VOP_READDIR,
call the function devfs_populate() to update their mountpoint's view
of the device hierarchy to match current reality before doing any work.
.(b M
.vs -3
\fC\s-2devfs_readdir(...)
devfs_populate(...)
...\fP\s+2
.)b
.vs +3
.lp
The devfs_populate() function, compares the current \fCdevfs_generation\fP
to the value saved in the mountpoint last time devfs_populate() completed
and if (actually: while) they differ a linear run is made through the
devfs-global inode-array and the directory tree of the mountpoint is
brought up to date.
.lp
The actual code is slightly more complicated than shown in the pseudo-code
here because it has to deal with subdirectories and hidden entries.
.(b M
.vs -3
\fC\s-2devfs_populate(...)
while (mount->generation != devfs_generation)
for i in all inodes
if inode created)
create directory entry
else if inode destroyed
remove directory entry
.vs +3
.)b
.lp
Access to the global DEVFS inode table is again implemented
with atomic instructions and failsafe retries to avoid the
need for locking.
.lp
From a performance point of view this scheme also means that a particular
DEVFS mountpoint is not updated until it needs to be, and then always by
a process belonging to the jail in question thus minimising and
distributing the CPU load.
.sh 1 "Device-driver impact"
.lp
All these changes have had a significant impact on how device drivers
interact with the rest of the kernel regarding registration of
devices.
.lp
If we look first at the ``before'' image in Fig. 3, we notice first
the NFOO define which imposes a firm upper limit on the number of
devices the kernel can deal with.
Also notice that the softc structure for all of them is allocated
at compile time.
This is because most device drivers (and texts on writing device
drivers) are from before the general
kernel malloc facility [Mckusick1988] was introduced into the BSD kernel.
.lp
.(b M
.vs -3
\fC\s-2
#ifndef NFOO
# define NFOO 4
#endif
struct foo_softc {
...
} foo_softc[NFOO];
int nfoo = 0;
foo_open(dev, ...)
{
int unit = minor(dev);
struct foo_softc *sc;
if (unit >= NFOO || unit >= nfoo)
return (ENXIO);
sc = &foo_softc[unit]
...
}
foo_attach(...)
{
struct foo_softc *sc;
static int once;
...
if (nfoo >= NFOO) {
/* Have hardware, can't handle */
return (-1);
}
sc = &foo_softc[nfoo++];
if (!once) {
cdevsw_add(&cdevsw);
once++;
}
...
}
\fP\s+2
Fig. 3 - Device-driver, old style.
.vs +3
.)b
.lp
Also notice how range checking is needed to make sure that the
minor# is inside range. This code gets more complex if device-numbering
is sparse. Code equivalent to that shown in the foo_open() routine
would also be needed in foo_read(), foo_write(), foo_ioctl() &c.
.lp
Finally notice how the attach routine needs to remember to register
the cdevsw structure (not shown) when the first device is found.
.lp
Now, compare this to our ``after'' image in Fig. 4.
NFOO is totally gone and so is the compile time allocation
of space for softc structures.
.lp
The foo_open (and foo_close, foo_ioctl &c) functions can now
derive the softc pointer directly from the dev_t they receive
as an argument.
.lp
.(b M
.vs -3
\fC\s-2
struct foo_softc {
....
};
int nfoo;
foo_open(dev, ...)
{
struct foo_softc *sc = dev->si_drv1;
...
}
foo_attach(...)
{
struct foo_softc *sc;
...
sc = MALLOC(..., M_ZERO);
if (sc == NULL) {
/* Have hardware, can't handle */
return (-1);
}
sc->dev = make_dev(&cdevsw, nfoo,
UID_ROOT, GID_WHEEL, 0644,
"foo%d", nfoo);
nfoo++;
sc->dev->si_drv1 = sc;
...
}
\fP\s+2
Fig. 4 - Device-driver, new style.
.vs +3
.)b
.lp
In foo_attach() we can now attach to all the devices we can
allocate memory for and we register the cdevsw structure per
dev_t rather than globally.
.lp
This last trick is what allows us to discard all bounds checking
in the foo_open() &c. routines, because they can only be
called through the cdevsw, and the cdevsw is only attached to
dev_t's which foo_attach() has created.
There is no way to end
up in foo_open() with a dev_t not created by foo_attach().
.lp
In the two examples here, the difference is only 10 lines of source
code, primarily because only one of the worker functions of the
device driver is shown.
In real device drivers it is not uncommon to save 50 or more lines
of source code which typically is about a percent or two of the
entire driver.
.sh 1 "Future work"
.lp
Apart from some minor issues to be cleaned up, DEVFS is now a reality
and future work therefore is likely concentrate on applying the
facilities and functionality of DEVFS to FreeBSD.
.sh 2 "devd"
.lp
It would be logical to complement DEVFS with a ``device-daemon'' which
could configure and de-configure devices as they come and go.
When a disk appears, mount it.
When a network interface appears, configure it.
And in some configurable way allow the user to customise the action,
so that for instance images will automatically be copied off the
flash-based media from a camera, &c.
.lp
In this context it is good to question how we view dynamic devices.
If for instance a printer is removed in the middle of a print job
and another printer arrives a moment later, should the system
automatically continue the print job on this new printer?
When a disk-like device arrives, should we always mount it? Should
we have a database of known disk-like devices to tell us where to
mount it, what permissions to give the mountpoint?
Some computers come in multiple configurations, for instance laptops
with and without their docking station. How do we want to present
this to the users and what behaviour do the users expect?
.sh 2 "Pathname length limitations"
.lp
In order to simplify memory management in the early stages of boot,
the pathname relative to the mountpoint is presently stored in a
small fixed size buffer inside struct specinfo.
It should be possible to use filenames as long as the system otherwise
permits, so some kind of extension mechanism is called for.
.lp
Since it cannot be guaranteed that memory can be allocated in
all the possible scenarios where make_dev() can be called, it may
be necessary to mandate that the caller allocates the buffer if
the content will not fit inside the default buffer size.
.sh 2 "Initial access parameter selection"
.lp
As it is now, device drivers propose the initial mode, owner and group
for the device nodes, but it would be more flexible if it were possible
to give the kernel a set of rules, much like packet filtering rules,
which allow the user to set the wanted policy for new devices.
Such a mechanism could also be used to filter new devices for mount
points in jails and to determine other behaviour.
.lp
Doing these things from userland results in some awkward race conditions
and software bloat for embedded systems, so a kernel approach may be more
suitable.
.sh 2 "Applications of on-demand device creation"
.lp
The facility for on-demand creation of devices has some very interesting
possibilities.
.lp
One planned use is to enable user-controlled labelling
of disks.
Today disks have names like /dev/da0, /dev/ad4, but since
this numbering is topological any change in the hardware configuration
may rename the disks, causing /etc/fstab and backup procedures
to get out of sync with the hardware.
.lp
The current idea is to store on the media of the disk a user-chosen
disk name and allow access through this name, so that for instance
/dev/mydisk0
would be a symlink to whatever topological name the disk might have
at any given time.
.lp
To simplify this and to avoid a forest of symlinks, it will probably
be decided to move all the sub-divisions of a disk into one subdirectory
per disk so just a single symlink can do the job.
In practice that means that the current /dev/ad0s2f will become
something like /dev/ad0/s2f and so on.
Obviously, in the same way, disks could also be accessed by their
topological address, down to the specific path in a SAN environment.
.lp
Another potential use could be for automated offline data media libraries.
It would be quite trivial to make it possible to access all the media
in the library using /dev/lib/$LABEL which would be a remarkable
simplification compared with most current automated retrieval facilities.
.lp
Another use could be to access devices by parameter rather than by
name. One could imagine sending a printjob to /dev/printer/color/A2
and behind the scenes a search would be made for a device with the
correct properties and paper-handling facilities.
.sh 1 "Conclusion"
.lp
DEVFS has been successfully implemented in FreeBSD,
including a powerful, simple and flexible solution supporting
pseudo-devices and on-demand device node creation.
.lp
Contrary to the trend, the implementation added functionality
with a net decrease in source lines,
primarily because of the improved API seen from device drivers point of view.
.lp
Even if DEVFS is not desired, other 4.4BSD derived UNIX variants
would probably benefit from adopting the dev_t/specinfo related
cleanup.
.sh 1 "Acknowledgements"
.lp
I first got started on DEVFS in 1989 because the abysmal performance
of the Olivetti M250 computer forced me to implement a network-disk-device
for Minix in order to retain my sanity.
That initial work led to a
crude but working DEVFS for Minix, so obviously both Andrew Tannenbaum
and Olivetti deserve credit for inspiration.
.lp
Julian Elischer implemented a DEVFS for FreeBSD around 1994 which never
quite made it to maturity and subsequently was abandoned.
.lp
Bruce Evans deserves special credit not only for his keen eye for detail,
and his competent criticism but also for his enthusiastic resistance to the
very concept of DEVFS.
.lp
Many thanks to the people who took time to help me stamp out ``Danglish''
through their reviews and comments: Chris Demetriou, Paul Richards,
Brian Somers, Nik Clayton, and Hanne Munkholm.
Any remaining insults to proper use of english language are my own fault.
.\" (list & why)
.sh 1 "References"
.lp
[44BSDBook]
Mckusick, Bostic, Karels & Quarterman:
``The Design and Implementation of 4.4 BSD Operating System.''
Addison Wesley, 1996, ISBN 0-201-54979-4.
.lp
[Heidemann91a]
John S. Heidemann:
``Stackable layers: an architecture for filesystem development.''
Master's thesis, University of California, Los Angeles, July 1991.
Available as UCLA technical report CSD-910056.
.lp
[Kamp2000]
Poul-Henning Kamp and Robert N. M. Watson:
``Confining the Omnipotent root.''
Proceedings of the SANE 2000 Conference.
Available in FreeBSD distributions in \fC/usr/share/papers\fP.
.lp
[MD.C]
Poul-Henning Kamp et al:
FreeBSD memory disk driver:
\fCsrc/sys/dev/md/md.c\fP
.lp
[Mckusick1988]
Marshall Kirk Mckusick, Mike J. Karels:
``Design of a General Purpose Memory Allocator for the 4.3BSD UNIX-Kernel''
Proceedings of the San Francisco USENIX Conference, pp. 295-303, June 1988.
.lp
[Mckusick1999]
Dr. Marshall Kirk Mckusick:
Private email communication.
\fI``According to the SCCS logs, the chroot call was added by Bill Joy
on March 18, 1982 approximately 1.5 years before 4.2BSD was released.
That was well before we had ftp servers of any sort (ftp did not
show up in the source tree until January 1983). My best guess as
to its purpose was to allow Bill to chroot into the /4.2BSD build
directory and build a system using only the files, include files,
etc contained in that tree. That was the only use of chroot that
I remember from the early days.''\fP
.lp
[Mckusick2000]
Dr. Marshall Kirk Mckusick:
Private communication at BSDcon2000 conference.
\fI``I have not used block devices since I wrote FFS and that
was \fPmany\fI years ago.''\fP
.lp
[NewBus]
NewBus is a subsystem which provides most of the glue between
hardware and device drivers. Despite the importance of this
there has never been published any good overview documentation
for it.
The following article by Alexander Langer in ``D<>monnews'' is
the best reference I can come up with:
\fC\s-2http://www.daemonnews.org/200007/newbus-intro.html\fP\s+2
.lp
[Pike2000]
Rob Pike:
``Systems Software Research is Irrelevant.''
\fC\s-2http://www.cs.bell\-labs.com/who/rob/utah2000.pdf\fP\s+2
.lp
[Pike90a]
Rob Pike, Dave Presotto, Ken Thompson and Howard Trickey:
``Plan 9 from Bell Labs.''
Proceedings of the Summer 1990 UKUUG Conference.
.lp
[Pike92a]
Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey and Phil Winterbottom:
``The Use of Name Spaces in Plan 9.''
Proceedings of the 5th ACM SIGOPS Workshop.
.lp
[Raspe1785]
Rudolf Erich Raspe:
``Baron M<>nchhausen's Narrative of his marvellous Travels and Campaigns in Russia.''
Kearsley, 1785.
.lp
[Ritchie74]
D.M. Ritchie and K. Thompson:
``The UNIX Time-Sharing System''
Communications of the ACM, Vol. 17, No. 7, July 1974.
.lp
[Ritchie98]
Dennis Ritchie: private conversation at USENIX Annual Technical Conference
New Orleans, 1998.
.lp
[Thompson78]
Ken Thompson:
``UNIX Implementation''
The Bell System Technical Journal, vol 57, 1978, number 6 (part 2) p. 1931ff.