eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues
are not passable between processes. And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow. This came up when debugging strange HW video decode
failures in Firefox. A native implementation would avoid these problems
and help with porting Linux software.
Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.
Submitted by: greg@unrelenting.technology
Reviewed by: markj (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26668
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.
Noted by: kib
p_fd nullification in fdescfree serializes against new threads transitioning
the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can
safely assume there is nobody else using the table. Losing the race and
observing > 1 is harmless.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27522
To export information from fd tables we have several loops which do
this:
FILDESC_SLOCK(fdp);
for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++)
<export info for fd i>;
FILDESC_SUNLOCK(fdp);
Before r367777, fdescfree() acquired the fd table exclusive lock between
decrementing fdp->fd_refcount and freeing table entries. This
serialized with the loop above, so the file at descriptor i would remain
valid until the lock is dropped. Now there is no serialization, so the
loops may race with teardown of file descriptor tables.
Acquire the exclusive fdtable lock after releasing the final table
reference to provide a barrier synchronizing with these loops.
Reported by: pho
Reviewed by: kib (previous version), mjg
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27513
All paths leading into closefp() will either replace or remove the fd from
the filedesc table, and closefp() will call fo_close methods that can and do
currently sleep without regard for the possibility of an ERESTART. This can
be dangerous in multithreaded applications as another thread could have
opened another file in its place that is subsequently operated on upon
restart.
The following are seemingly the only ones that will pass back ERESTART
in-tree:
- sockets (SO_LINGER)
- fusefs
- nfsclient
Reviewed by: jilles, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27310
oldfde may be invalidated if the table has grown due to the operation that
we're performing, either via fdalloc() or a direct fdgrowtable_exp().
This was technically OK before rS367927 because the old table remained valid
until the filedesc became unused, but now it may be freed immediately if
it's an unshared table in a single-threaded process, so it is no longer a
good assumption to make.
This fixes dup/dup2 invocations that grow the file table; in the initial
report, it manifested as a kernel panic in devel/gmake's configure script.
Reported by: Guy Yur <guyyur gmail com>
Reviewed by: rew
Differential Revision: https://reviews.freebsd.org/D27319
During the life of a process, new file descriptor tables may be allocated. When
a new table is allocated, the old table is placed in a free list and held onto
until all processes referencing them exit.
When a new file descriptor table is allocated, the old file descriptor table
can be freed when the current process has a single-thread and the file
descriptor table is not being shared with any other processes.
Reviewed by: kevans
Approved by: kevans (mentor)
Differential Revision: https://reviews.freebsd.org/D18617
No functional change intended.
Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).
__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.
Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037
First, funsetownlst() list looks at the first element of the list to see
whether it's processing a process or a process group list. Then it
acquires the global sigio lock and processes the list. However, nothing
prevents the first sigio tracker from being freed by a concurrent
funsetown() before the sigio lock is acquired.
Fix this by acquiring the global sigio lock immediately after checking
whether the list is empty. Callers of funsetownlst() ensure that new
sigio trackers cannot be added concurrently.
Second, fsetown() uses funsetown() to remove an existing sigio structure
from a file object. However, funsetown() uses a racy check to avoid the
sigio lock, so two threads may call fsetown() on the same file object,
both observe that no sigio tracker is present, and enqueue two sigio
trackers for the same file object. However, if the file object is
destroyed, funsetown() will only remove one sigio tracker, and
funsetownlst() may later trigger a use-after-free when it clears the
file object reference for each entry in the list.
Fix this by introducing funsetown_locked(), which avoids the racy check.
Reviewed by: kib
Reported by: pho
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27157
Since the code exits smr section prior to calling pwd_hold, the used
pwd can be freed and a new one allocated with the same address, making
the comparison erroneously true.
Note it is very unlikely anyone ran into it.
The pointer to vnode is already stored into f_vnode, so f_data can be
reused. Fix all found users of f_data for DTYPE_VNODE.
Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.
Reviewed by: markj (previous version)
Discussed with: freqlabs (openzfs chunk)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346
It keeps recalculated way more often than it is needed.
Provide a routine (fdlastfile) to get it if necessary.
Consumers may be better off with a bitmap iterator instead.
We don't expect to fail acquiring the reference unless running into a corner
case. Just in case ensure forward progress by taking the lock.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D25616
For software like PostgreSQL and SQLite that sometimes reads sequentially
while also writing sequentially some distance behind with interleaved
syscalls on the same fd, performance is better on UFS if we do
sequential access heuristics separately for reads and writes.
Patch originally by Andrew Gierth in 2008, updated and proposed by me with
his permission.
Reviewed by: mjg, kib, tmunro
Approved by: mjg (mentor)
Obtained from: Andrew Gierth <andrew@tao11.riddles.org.uk>
Differential Revision: https://reviews.freebsd.org/D25024
Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.
Reviewed by: markj (previous version), kib
Differential Revision: https://reviews.freebsd.org/D24399
close_range will clamp the range between [0, fdp->fd_lastfile], but failed
to take into account that fdp->fd_lastfile can become -1 if all fds are
closed. =-( In this scenario, just return because there's nothing further we
can do at the moment.
Add a test case for this, fork() and simply closefrom(0) twice in the child;
on the second invocation, fdp->fd_lastfile == -1 and will trigger a panic
before this change.
X-MFC-With: r359836
close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.
sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.
The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.
This patch is based on a syscall of the same design that is expected to be
merged into Linux.
Reviewed by: kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision: https://reviews.freebsd.org/D21627
filecaps_free_prep() bzeros the capabilities structure and we need to be
careful to synchronize with unlocked readers, which expect a consistent
rights structure.
Reviewed by: kib, mjg
Reported by: syzbot+5f30b507f91ddedded21@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24120
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23889
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.
This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884
The latter is a typedef of the former; the typedef exists and these bits are
representing vmprot values, so use the correct type.
Submitted by: sigsys@gmail.com
MFC after: 3 days
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.
In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).
Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.
While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.
While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.
Verified with the capsicum suite /usr/tests.
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.
This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.
The bug is only present in -CURRENT.
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.