Commit Graph

655 Commits

Author SHA1 Message Date
Mark Johnston
f52979098d Fix a pair of races in SIGIO registration
First, funsetownlst() list looks at the first element of the list to see
whether it's processing a process or a process group list.  Then it
acquires the global sigio lock and processes the list.  However, nothing
prevents the first sigio tracker from being freed by a concurrent
funsetown() before the sigio lock is acquired.

Fix this by acquiring the global sigio lock immediately after checking
whether the list is empty.  Callers of funsetownlst() ensure that new
sigio trackers cannot be added concurrently.

Second, fsetown() uses funsetown() to remove an existing sigio structure
from a file object.  However, funsetown() uses a racy check to avoid the
sigio lock, so two threads may call fsetown() on the same file object,
both observe that no sigio tracker is present, and enqueue two sigio
trackers for the same file object.  However, if the file object is
destroyed, funsetown() will only remove one sigio tracker, and
funsetownlst() may later trigger a use-after-free when it clears the
file object reference for each entry in the list.

Fix this by introducing funsetown_locked(), which avoids the racy check.

Reviewed by:	kib
Reported by:	pho
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27157
2020-11-11 13:44:27 +00:00
Mateusz Guzik
3c50616fc1 fd: make all f_count uses go through refcount_* 2020-11-05 02:12:33 +00:00
Mateusz Guzik
d737e9eaf5 fd: hide _fdrop 0 count check behind INVARIANTS
While here use refcount_load and make sure to report the tested value.
2020-11-05 02:12:08 +00:00
Mateusz Guzik
dd28b379cb vfs: support lockless dirfd lookups 2020-10-10 03:48:17 +00:00
Mateusz Guzik
4e2266100d cache: fix pwd use-after-free in setting up fallback
Since the code exits smr section prior to calling pwd_hold, the used
pwd can be freed and a new one allocated with the same address, making
the comparison erroneously true.

Note it is very unlikely anyone ran into it.
2020-10-05 19:38:51 +00:00
Konstantin Belousov
96474d2a3f Do not copy vp into f_data for DTYPE_VNODE files.
The pointer to vnode is already stored into f_vnode, so f_data can be
reused.  Fix all found users of f_data for DTYPE_VNODE.

Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.

Reviewed by:	markj (previous version)
Discussed with:	freqlabs (openzfs chunk)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26346
2020-09-15 21:55:21 +00:00
Mateusz Guzik
54052edaa0 fd: fix fhold on an uninitialized var in fdcopy_remapped
Reported by:	gcc9
2020-09-08 16:07:47 +00:00
Mateusz Guzik
cd4a1797b0 fd: pwd_drop after releasing filedesc lock
Fixes a potential LOR against vnode lock.
2020-08-22 16:57:45 +00:00
Mateusz Guzik
e914224af1 fd: put back FILEDESC_SUNLOCK to pwd_hold lost during rebase
Reported by:	pho
2020-07-25 15:34:29 +00:00
Mateusz Guzik
07d2145a17 vfs: add the infrastructure for lockless lookup
Reviewed by:    kib
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D25577
2020-07-25 10:32:45 +00:00
Mateusz Guzik
d8bc2a17a5 fd: remove fd_lastfile
It keeps recalculated way more often than it is needed.

Provide a routine (fdlastfile) to get it if necessary.

Consumers may be better off with a bitmap iterator instead.
2020-07-15 10:24:04 +00:00
Mateusz Guzik
7177149a4d fd: add obvious branch predictions to fdalloc 2020-07-15 10:14:00 +00:00
Mateusz Guzik
373278a7f6 fd: stop looping in pwd_hold
We don't expect to fail acquiring the reference unless running into a corner
case. Just in case ensure forward progress by taking the lock.

Reviewed by:	kib, markj
Differential Revision: https://reviews.freebsd.org/D25616
2020-07-11 21:57:03 +00:00
Thomas Munro
f270658873 vfs: track sequential reads and writes separately
For software like PostgreSQL and SQLite that sometimes reads sequentially
while also writing sequentially some distance behind with interleaved
syscalls on the same fd, performance is better on UFS if we do
sequential access heuristics separately for reads and writes.

Patch originally by Andrew Gierth in 2008, updated and proposed by me with
his permission.

Reviewed by:	mjg, kib, tmunro
Approved by:	mjg (mentor)
Obtained from:	Andrew Gierth <andrew@tao11.riddles.org.uk>
Differential Revision:	https://reviews.freebsd.org/D25024
2020-06-21 08:51:24 +00:00
Mateusz Guzik
21d3be9105 pwd: unbreak repeated calls to set_rootvnode
Prior to the change the once set pointer would never be updated.

Unbreaks reboot -r.

Reported by:	Ross Gohlke
2020-04-27 13:54:00 +00:00
Kyle Evans
7d03e08112 Mark closefrom(2) COMPAT12, reimplement in libc to wrap close_range
Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.

Reviewed by:	markj (previous version), kib
Differential Revision:	https://reviews.freebsd.org/D24399
2020-04-14 18:07:42 +00:00
Kyle Evans
605c4cda2f close_range/closefrom: fix regression from close_range introduction
close_range will clamp the range between [0, fdp->fd_lastfile], but failed
to take into account that fdp->fd_lastfile can become -1 if all fds are
closed. =-( In this scenario, just return because there's nothing further we
can do at the moment.

Add a test case for this, fork() and simply closefrom(0) twice in the child;
on the second invocation, fdp->fd_lastfile == -1 and will trigger a panic
before this change.

X-MFC-With:	r359836
2020-04-13 17:55:31 +00:00
Kyle Evans
472ced39ef Implement a close_range(2) syscall
close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.

sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.

The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.

This patch is based on a syscall of the same design that is expected to be
merged into Linux.

Reviewed by:	kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision:	https://reviews.freebsd.org/D21627
2020-04-12 21:23:19 +00:00
Mark Johnston
429537caeb kern_dup(): Call filecaps_free_prep() in a write section.
filecaps_free_prep() bzeros the capabilities structure and we need to be
careful to synchronize with unlocked readers, which expect a consistent
rights structure.

Reviewed by:	kib, mjg
Reported by:	syzbot+5f30b507f91ddedded21@syzkaller.appspotmail.com
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24120
2020-03-19 15:40:05 +00:00
Mateusz Guzik
d2222aa0e9 fd: use smr for managing struct pwd
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23889
2020-03-08 00:23:36 +00:00
Mateusz Guzik
8d03b99b9d fd: move vnodes out of filedesc into a dedicated structure
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23884
2020-03-01 21:53:46 +00:00
Mateusz Guzik
8243063f9b fd: make fgetvp_rights work without the filedesc lock
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23883
2020-03-01 21:50:13 +00:00
Mateusz Guzik
32a86c44ee fd: use new capsicum helpers 2020-02-15 01:28:55 +00:00
Mateusz Guzik
8f86349f8b fd: remove no longer needed atomic_load_ptr casts 2020-02-14 23:18:22 +00:00
Mateusz Guzik
6ed30ea4c0 fd: annotate finstall with prediction branches 2020-02-14 11:22:12 +00:00
Kyle Evans
0f5f49eff7 u_char -> vm_prot_t in a couple of places, NFC
The latter is a typedef of the former; the typedef exists and these bits are
representing vmprot values, so use the correct type.

Submitted by:	sigsys@gmail.com
MFC after:	3 days
2020-02-14 02:22:08 +00:00
Mateusz Guzik
1a9fe4528b fd: always nullify *fdp in fget* routines
Some consumers depend on the pointer being NULL if an error is returned.

The guarantee got broken in r357469.

Reported by:	https://syzkaller.appspot.com/bug?extid=0c9b05e2b727aae21eef
Noted by:	markj
2020-02-05 00:20:26 +00:00
Mateusz Guzik
8151b6e92a fd: partially unengrish the previous commit 2020-02-03 22:34:50 +00:00
Mateusz Guzik
e10f063b30 fd: streamline fget_unlocked
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.

In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).

Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.

While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
2020-02-03 22:32:49 +00:00
Mateusz Guzik
52604ed792 fd: remove the seq argument from fget_unlocked
It is almost always NULL.
2020-02-03 22:27:55 +00:00
Mateusz Guzik
7f1566f884 fd: remove the seq argument from fget routines
It is almost always NULL.
2020-02-03 22:27:03 +00:00
Mateusz Guzik
0a1427c5ab ktrace: provide ktrstat_error
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
2020-02-03 22:26:00 +00:00
Mateusz Guzik
bcd1cf4f03 capsicum: faster cap_rights_contains
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.

While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.

Verified with the capsicum suite /usr/tests.
2020-02-03 17:08:11 +00:00
Mateusz Guzik
fee204544e fd: fix f_count acquire in fget_unlocked
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.

This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.

The bug is only present in -CURRENT.
2020-02-03 14:28:31 +00:00
Mateusz Guzik
2568d5bb79 fd: sprinkle some predits around fget
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
2020-02-02 09:38:40 +00:00
Mateusz Guzik
da4f45ea5c fd: use atomic_load_ptr instead of hand-rolled cast through volatile
No change in assembly.
2020-02-02 09:37:16 +00:00
Mateusz Guzik
d3cc535474 vfs: provide F_ISUNIONSTACK as a kludge for libc
Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D23162
2020-01-17 14:42:25 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Mateusz Guzik
55eb92db8d fd: static-ize and devolatile openfiles
Almost all access is using atomics. The only read is sysctl which should use
a whole-int-at-a-time friendly read internally.
2019-12-11 23:09:12 +00:00
Mark Johnston
4a7b33ecf4 Disallow fcntl(F_READAHEAD) when the vnode is not a regular file.
The mountpoint may not have defined an iosize parameter, so an attempt
to configure readahead on a device file can lead to a divide-by-zero
crash.

The sequential heuristic is not applied to I/O to or from device files,
and posix_fadvise(2) returns an error when v_type != VREG, so perform
the same check here.

Reported by:	syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21864
2019-10-02 15:45:49 +00:00
Kyle Evans
af755d3e48 [1/3] Add mostly Linux-compatible file sealing support
File sealing applies protections against certain actions
(currently: write, growth, shrink) at the inode level. New fileops are added
to accommodate seals - EINVAL is returned by fcntl(2) if they are not
implemented.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D21391
2019-09-25 17:32:43 +00:00
Konstantin Belousov
f1cf2b9dcb Check and avoid overflow when incrementing fp->f_count in
fget_unlocked() and fhold().

On sufficiently large machine, f_count can be legitimately very large,
e.g. malicious code can dup same fd up to the per-process
filedescriptors limit, and then fork as much as it can.
On some smaller machine, I see
	kern.maxfilesperproc: 939132
	kern.maxprocperuid: 34203
which already overflows u_int.  More, the malicious code can create
transient references by sending fds over unix sockets.

I realized that this check is missed after reading
https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html

Reviewed by:	markj (previous version), mjg
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20947
2019-07-21 15:07:12 +00:00
Mark Johnston
7c3703a694 Use a consistent snapshot of the fd's rights in fget_mmap().
fget_mmap() translates rights on the descriptor to a VM protection
mask.  It was doing so without holding any locks on the descriptor
table, so a writer could simultaneously be modifying those rights.
Such a situation would be detected using a sequence counter, but
not before an inconsistency could trigger assertion failures in
the capability code.

Fix the problem by copying the fd's rights to a structure on the stack,
and perform the translation only once we know that that snapshot is
consistent.

Reported by:	syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com
Reviewed by:	brooks, mjg
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20800
2019-06-29 16:11:09 +00:00
Alan Somers
38b06f8ac4 fcntl: fix overflow when setting F_READAHEAD
VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.

Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936.  Formalize that by using a union type.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20710
2019-06-20 23:07:20 +00:00
Konstantin Belousov
bc2d137acb Make pack_kinfo() available for external callers.
Reviewed by:	jilles, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20258
2019-05-23 12:25:03 +00:00
Mark Johnston
fd76e780a7 Reject F_SETLK_REMOTE commands when sysid == 0.
A sysid of 0 denotes the local system, and some handlers for remote
locking commands do not attempt to deal with local locks.  Note that
F_SETLK_REMOTE is only available to privileged users as it is intended
to be used as a testing interface.

Reviewed by:	kib
Reported by:	syzbot+9c457a6ae014a3281eb8@syzkaller.appspotmail.com
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19702
2019-03-25 21:38:58 +00:00
Mateusz Guzik
55fda58146 Rename seq to seqc to avoid namespace clashes with Linux
Linux generates the content of procfs files using a mechanism prefixed with
seq_*. This in particular came up with recent gcov import.

Sponsored by:	The FreeBSD Foundation
2019-02-27 22:56:55 +00:00
Matt Macy
ebe0b35a18 Change seq_read to seq_load to avoid namespace conflicts with lkpi
MFC after:	1 week
Sponsored by:	iX Systems
2019-02-23 21:04:48 +00:00
Mark Johnston
093295ae49 Remove an obsolete comment.
MFC after:	3 days
2019-02-20 18:29:52 +00:00
Mateusz Guzik
24d64be4c5 vfs: mostly depessimize NDINIT_ALL
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by:	The FreeBSD Foundation
2018-12-14 03:55:08 +00:00