Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.
Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
I noticed several drivers in our tree don't actually care about parity
and framing, such as pts(4), snp(4) (and my partially finished console
driver). Instead of duplicating a lot of code, I think we'd better add a
utility function for those drivers to quickly process a buffer of input.
Also change pts(4) and snp(4) to use this function.
As an experiment, I changed snp(4) to use a mutex instead of an sx lock.
We can't enable this right now, because Syscons still picks up Giant.
It's nice to already have the framework there.
The TTY buffers used the standard <sys/queue.h> lists. Unfortunately
they have a big shortcoming. If you want to have a double linked list,
but no tail pointer, it's still not possible to obtain the previous
element in the list. Inside the buffers we don't need them. This is why
I switched to custom linked list macros. The macros will also keep track
of the amount of items in the list. Because it doesn't use a sentinel,
we can just initialize the queues with zero.
In its simplest form (the output queue), we will only keep two
references to blocks in the queue, namely the head of the list and the
last block in use. All free blocks are stored behind the last block in
use.
I noticed there was a very subtle bug in the previous code: in a very
uncommon corner case, it would uma_zfree() a block in the queue before
calling memcpy() to extract the data from the block.
Because the TTY hooks interface was not finished when I imported the
MPSAFE TTY layer, I had to disconnect the snp(4) driver. This snp(4)
implementation has been sitting in my P4 branch for some time now.
Unfortunately it still doesn't use the same error handling as snp(4)
(returning codes through FIONREAD), but it should already be usable.
I'm committing this to SVN, hoping someone else could polish off its
rough edges. It's always better than having a broken driver sitting in
the tree.
kib@ and I have decided we will MFC the bpf(4)/snp(4) fixes after we've
released 7.1. Make sure the code in HEAD doesn't refer to a flag we
don't need anyway.
snp(4) in the MPSAFE TTY P4 branch already works, but still needs some
polishing before it can be integrated to SVN.
Now we have a single /dev/snp device node, which can be opened by
watch(8) multiple times. Even though snp(4) will be dead as of next
week, it's nice having this in SVN, because:
- We may want to MFC it to RELENG_7.
- By the time we fix snp(4) again, it's already there, existing watch(8)
binaries should already work.
Just like bpf(4), I'm adding a symlink from snp0 to snp to remain binary
compatible.
Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.
Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.
This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.
The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.
Approved by: philip (mentor)
fget() call, that is sleeping point, and possibly dropping Giant.
The snp_target == NULL implies the snp_tty == NULL. Remove the code
that is put under snp_target == NULL and snp_tty != NULL clause.
In snpclose(), do the snp_detach() before scheduling the snp device
destruction. Otherwise, after the return from snpclose(), the snp
device is already removed from the snp_list, but tty is still in
snooped state. Any attempt to do i/o on such tty cause panic because
ttytosnp() returns NULL.
Tested by: Peter Holm
MFC after: 1 week
attached. Otherwise, the snp->snp_tty would be overwritten, while the
tty line discipline still set to the snpdisc. Then snplwrite() causes
panic because ttytosnp() cannot find the snp.
MFC after: 1 week
event handler, dev_clone, which accepts a credential argument.
Implementors of the event can ignore it if they're not interested,
and most do. This avoids having multiple event handler types and
fall-back/precedence logic in devfs.
This changes the kernel API for /dev cloning, and may affect third
party packages containg cloning kernel modules.
Requested by: phk
MFC after: 3 days
future:
rename ttyopen() -> tty_open() and ttyclose() -> tty_close().
We need the ttyopen() and ttyclose() for the new generic cdevsw
functions for tty devices in order to have consistent naming.
for unknown events.
A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
Remove the unused second argument from udev2dev().
Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.
Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.
Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
This commit adds a couple of functions for pseudodrivers to use for
implementing cloning in a manner we will be able to lock down (shortly).
Basically what happens is that pseudo drivers get a way to ask for
"give me the dev_t with this unit number" or alternatively "give
me a dev_t with the lowest guaranteed free unit number" (there is
unfortunately a lot of non-POLA in the exact numeric value of this
number, just live with it for now)
Managing the unit number space this way removes the need to use
rman(9) to do so in the drivers this greatly simplifies the code in
the drivers because even using rman(9) they still needed to manage
their dev_t's anyway.
I have taken the if_tun, if_tap, snp and nmdm drivers through the
mill, partly because they (ab)used makedev(), but mostly because
together they represent three different problems for device-cloning:
if_tun and snp is the plain case: just give me a device.
if_tap has two kinds of devices, with a flag for device type.
nmdm has paired devices (ala pty) can you can clone either of them.
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().
- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.
- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.
Not objected in: -arch, -current
branches:
Initialize struct cdevsw using C99 sparse initializtion and remove
all initializations to default values.
This patch is automatically generated and has been tested by compiling
LINT with all the fields in struct cdevsw in reverse order on alpha,
sparc64 and i386.
Approved by: re(scottl)
submitter, this permits Russian (and probably other locales') characters
to be entered via watch(8).
PR: 35636
Submitted by: Gleb Smirnoff <glebius@rinet.ru>
Problem:
selwakeup required calling pfind which would cause lock order
reversals with the allproc_lock and the per-process filedesc lock.
Solution:
Instead of recording the pid of the select()'ing process into the
selinfo structure, actually record a pointer to the thread. To
avoid dereferencing a bad address all the selinfo structures that
are in use by a thread are kept in a list hung off the thread
(protected by sellock). When a selwakeup occurs the selinfo is
removed from that threads list, it is also removed on the way out
of select or poll where the thread will traverse its list removing
all the selinfos from its own list.
Problem:
Previously the PROC_LOCK was used to provide the mutual exclusion
needed to ensure proper locking, this couldn't work because there
was a single condvar used for select and poll and condvars can
only be used with a single mutex.
Solution:
Introduce a global mutex 'sellock' which is used to provide mutual
exclusion when recording events to wait on as well as performing
notification when an event occurs.
Interesting note:
schedlock is required to manipulate the per-thread TDF_SELECT
flag, however if given its own field it would not need schedlock,
also because TDF_SELECT is only manipulated under sellock one
doesn't actually use schedlock for syncronization, only to protect
against corruption.
Proc locks are no longer used in select/poll.
Portions contributed by: davidc