Commit Graph

18320 Commits

Author SHA1 Message Date
Mateusz Guzik
852088f6af vfs: add missing atomic conversion to writecount adjustment
Fixes:	("vfs: lockless writecount adjustment in set/unset text")
2021-05-14 17:42:05 +02:00
Mateusz Guzik
ca1ce50b2b vfs: add more safety against concurrent forced unmount to vn_write
1. stop re-reading ->v_mount (can become NULL)
2. stop re-reading ->v_type (can change to VBAD)
2021-05-14 14:22:22 +00:00
Mateusz Guzik
b5fb9ae687 vfs: lockless writecount adjustment in set/unset text
... for cases where this is not the first/last exec.
2021-05-14 14:22:21 +00:00
Mark Johnston
2cca77ee01 kqueue timer: Remove detached knotes from the process stop queue
There are some scenarios where a timer event may be detached when it is
on the process' kqueue timer stop queue.  If kqtimer_proc_continue() is
called after that point, it will iterate over the queue and access freed
timer structures.

It is also possible, at least in a multithreaded program, for a stopped
timer event to be scheduled without removing it from the process' stop
queue.  Ensure that we do not doubly enqueue the event structure in this
case.

Reported by:	syzbot+cea0931bb4e34cd728bd@syzkaller.appspotmail.com
Reported by:	syzbot+9e1a2f3734652015998c@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30251
2021-05-14 10:08:14 -04:00
Ed Maste
2c9764f36b regen syscall files after d51198d63b63 2021-05-13 14:09:58 -04:00
Mark Johnston
8b3c4231ab posix timers: Check for overflow when converting to ns
Disallow a time or timer period value when the conversion to nanoseconds
would overflow.  Otherwise it is possible to trigger a divison by zero
in realtime_expire_l(), where we compute the number of overruns by
dividing by the timer interval.

Fixes:	7995dae9 ("posix timers: Improve the overrun calculation")
Reported by:	syzbot+5ab360bd3d3e3c5a6e0e@syzkaller.appspotmail.com
Reported by:	syzbot+157b74ff493140d86eac@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30233
2021-05-13 08:34:03 -04:00
Mark Johnston
9246b3090c fork: Suspend other threads if both RFPROC and RFMEM are not set
Otherwise, a multithreaded parent process may trigger races in
vm_forkproc() if one thread calls rfork() with RFMEM set and another
calls rfork() without RFMEM.

Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to
see if the address space is shared.

Reported by:	syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com
Reported by:	syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30220
2021-05-13 08:33:23 -04:00
Mateusz Guzik
cef8a95acb vfs: fix vnode use count leak in O_EMPTY_PATH support
The vnode returned by namei_setup is already referenced.

Reported by:	pho
2021-05-13 09:39:27 +00:00
Konstantin Belousov
6de3cf14c4 vn_open_cred(): disallow O_CREAT | O_EMPTY_PATH
This combination does not make sense, and cannot be satisfied by lookup.
In particular, lookup cannot supply dvp, it only can directly return vp.

Reported and reviewed by:	markj using syzkaller
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-05-13 02:32:04 +03:00
Mark Johnston
d8acd2681b Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about
freeing mbufs in error paths.  In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.

This diff plugs various leaks in the pru_send implementations.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30151
2021-05-12 13:00:09 -04:00
Mateusz Guzik
12288bd999 cache: fix lockless absolute symlink traversal to non-fp mounts
Said lookups would incorrectly fail with EOPNOTSUP.

Reported by:	kib
2021-05-11 04:30:12 +00:00
Mark Johnston
c8bbb1272c vfs: Fix error handling in vn_fullpath_hardlink()
vn_fullpath_any_smr() will return a positive error number if the
caller-supplied buffer isn't big enough.  In this case the error must be
propagated up, otherwise we may copy out uninitialized bytes.

Reported by:	syzkaller+KMSAN
Reviewed by:	mjg, kib
MFC aftr:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30198
2021-05-10 20:22:27 -04:00
Konstantin Belousov
5e7cdf1817 openat(2): add O_EMPTY_PATH
It reopens the passed file descriptor, checking the file backing vnode'
current access rights against open mode. In particular, this flag allows
to convert file descriptor opened with O_PATH, into operable file
descriptor, assuming permissions allow that.

Reviewed by:	markj
Tested by:	Andrew Walker <awalker@ixsystems.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30148
2021-05-11 02:39:24 +03:00
Konstantin Belousov
d474440ab3 Constify vm_pager-related virtual tables.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Mark Johnston
9a7c2de364 realloc: Fix KASAN(9) shadow map updates
When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA.  This value is "alloc".  Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map.  Do so using the correct size.

Reported by:	kp
Tested by:	kp
Sponsored by:	The FreeBSD Foundation
2021-05-05 17:12:51 -04:00
Warner Losh
a512d0ab00 kern: clarify boot time
In FreeBSD, the current time is computed from uptime + boottime. Uptime
is a continuous, smooth function that's monotonically increasing. To
effect changes to the current time, boottime is adjusted.  boottime is
mutable and shouldn't be cached against future need. Document the
current implementation, with the caveat that we may stop stepping
boottime on resume in the future and will step uptime instead (noted in
the commit message, but not in the code).

Sponsored by:		Netflix
Reviewed by:		phk, rpokala
Differential Revision:	https://reviews.freebsd.org/D30116
2021-05-05 12:32:13 -06:00
Elliott Mitchell
a3c7da3d08 kern/intr: declare interrupt vectors unsigned
These should never get values large enough for sign to matter, but one
of them becoming negative could cause problems.

MFC after:	1 week
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D29327
2021-05-03 13:24:30 -04:00
Mark Johnston
2b2d77e720 VOP_STAT: Provide a default value for va_gen
Some filesystems, e.g., pseudofs and the NFSv3 client, do not provide
one.

Reviewed by:	kib
Reported by:	KMSAN
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30091
2021-05-03 13:24:30 -04:00
Mark Johnston
cdfcfc607a smp: Initialize arg->cpus sooner in smp_rendezvous_cpus_retry()
Otherwise, if !smp_started is true, then smp_rendezvous_cpus_done() will
harmlessly perform an atomic RMW on an uninitialized variable.

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-05-03 13:24:30 -04:00
Konstantin Belousov
7cb40543e9 filt_timerexpire: do not iterate over the interval
User-supplied data might make this loop too time-consuming. Divide
directly, and handle both the possibility that we were woken up earlier,
and arithmetic overflows/underflows from the calculation.

Reported and tested by:	pho (previous version)
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30069
2021-05-03 19:49:54 +03:00
Konstantin Belousov
87a64872cd Add ptrace(PT_COREDUMP)
It writes the core of live stopped process to the file descriptor
provided as an argument.

Based on the initial version from https://reviews.freebsd.org/D29691,
submitted by Michał Górny <mgorny@gentoo.org>.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:18:26 +03:00
Konstantin Belousov
68d311b666 ptracestop: mark threads suspended there with the new TDB_SSWITCH flag
This way threads in ptracestop can be discovered by debugger

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:18:25 +03:00
Konstantin Belousov
9ebf9100ba ptrace: do not allow for parallel ptrace requests
Set a new P2_PTRACEREQ flag around the request Wait for the target     .
process P2_PTRACEREQ flag to clear before setting ours                 .

Otherwise, we rely on the moment that the process lock is not dropped
until the stopped target state is important.  This is going to be no
longer true after some future change.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:16:30 +03:00
Konstantin Belousov
54c8baa021 kern_ptrace(): extract code to determine ptrace eligibility into helper
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:48 +03:00
Konstantin Belousov
2bd0506c8d kern_ptrace: change type of proctree_locked to bool
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:48 +03:00
Konstantin Belousov
af928fded0 Add thread_run_flash() helper
It unsuspends single suspended thread, passed as the argument.
It is up to the caller to arrange the target thread to suspend later,
since the state of the process is not changed from stopped.  In particular,
the unsuspended thread must not leave to userspace, since boundary code
is not prepared to this situation.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Konstantin Belousov
15465a2c25 Add sleepq_remove_nested()
The helper removes the thread from a sleep queue, assuming that it would
need to sleep. The sleepq_remove_nested() function is intended for quite
special case, where suspended thread from traced stopped process is
temporary unsuspended to do some work on behalf of the debugger in the
target context, and this work might require sleep.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Konstantin Belousov
86ffb3d1a0 ELF coredump: define several useful flags for the coredump operations
- SVC_ALL request dumping all map entries, including those marked as
  non-dumpable
- SVC_NOCOMPRESS disallows compressing the dump regardless of the sysctl
  policy
- SVC_PC_COREDUMP is provided for future use by userspace core dump
  request

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Konstantin Belousov
5bc3c61780 imgact_elf: consistently pass flags from coredump down to helper functions
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Rick Macklem
4f592683c3 copy_file_range(2): improve copying of a large hole to EOF
PR#255523 reported that a file copy for a file with a large hole
to EOF on ZFS ran slowly over NFSv4.2.
The problem was that vn_generic_copy_file_range() would
loop around reading the hole's data and then see it is all
0s. It was coded this way since UFS always allocates a data
block near the end of the file, such that a hole to EOF never exists.

This patch modifies vn_generic_copy_file_range() to check for a
ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that
case as a hole to EOF. asomers@ confirms that it works for his
ZFS test case.

PR:	255523
Tested by:	asomers
Reviewed by:	asomers
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D30076
2021-05-02 16:04:27 -07:00
Konstantin Belousov
2082565798 O_PATH: disable kqfilter for fifos
Filter on fifos is real filter for the object, and not a filesystem
events filter like EVFILT_VNODE.

Reported by:	markj using syzkaller
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-04-30 17:43:45 +03:00
Mark Johnston
20e3b9d8bd kasan: Use vm_offset_t for the first parameter to kasan_shadow_map()
No functional change intended.

Sponsored by:	The FreeBSD Foundation
2021-04-29 11:39:02 -04:00
Mateusz Guzik
074abaccfa cache: remove incomplete lockless lockout support during resize
This is already properly handled thanks to 2 step hash replacement.
2021-04-28 19:53:25 +00:00
Mark Johnston
d1e9441583 pipe: Avoid calling selrecord() on a closing pipe
pipe_poll() may add the calling thread to the selinfo lists of both ends
of a pipe.  It is ok to do this for the local end, since we know we hold
a reference on the file and so the local end is not closed.  It is not
ok to do this for the remote end, which may already be closed and have
called seldrain().  In this scenario, when the polling thread wakes up,
it may end up referencing a freed selinfo.

Guard the selrecord() call appropriately.

Reviewed by:	kib
Reported by:	syzkaller+KASAN
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D30016
2021-04-28 10:43:29 -04:00
Thomas Munro
3aaaa2efde poll(2): Add POLLRDHUP.
Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if
requested.  Triggered when the remote peer shuts down writing or closes
its end.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D29757
2021-04-28 23:00:31 +12:00
Mark Johnston
409ab7e109 imgact_elf: Ensure that the return value in parse_notes is initialized
parse_notes relies on the caller-supplied callback to initialize "res".
Two callbacks are used in practice, brandnote_cb and note_fctl_cb, and
the latter fails to initialize res.  Fix it.

In the worst case, the bug would cause the inner loop of check_note to
examine more program headers than necessary, and the note header usually
comes last anyway.

Reviewed by:	kib
Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29986
2021-04-26 14:53:16 -04:00
Edward Tomasz Napierala
5d1d844a77 kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW
This makes this API match other kern_xxxat() functions.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29776
2021-04-25 14:13:12 +01:00
Robert Watson
af14713d49 Support run-time configuration of the PIPE_MINDIRECT threshold.
PIPE_MINDIRECT determines at what (blocking) write size one-copy
optimizations are applied in pipe(2) I/O.  That threshold hasn't
been tuned since the 1990s when this code was originally
committed, and allowing run-time reconfiguration will make it
easier to assess whether contemporary microarchitectures would
prefer a different threshold.

(On our local RPi4 baords, the 8k default would ideally be at least
32k, but it's not clear how generalizable that observation is.)

MFC after:	3 weeks
Reviewers:	jrtc27, arichardson
Differential Revision: https://reviews.freebsd.org/D29819
2021-04-24 20:04:28 +01:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Warner Losh
df456a1fcf newbus: style nit (align comments)
Sponsored by:		Netflix
2021-04-21 15:37:24 -06:00
Warner Losh
1eebd6158c newbus: Optimize/Simplify kobj_class_compile_common a little
"i" is not used in this loop at all. There's no need to initialize and
increment it.

Reviewed by:		markj@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D29898
2021-04-21 15:37:24 -06:00
Konstantin Belousov
54f98c4dbf vn_open_vnode(): handle error when fp == NULL
If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to
be called, we cannot use fp fo_close() when there is no fp.  This occurs
when e.g. kernel code directly calls vn_open() instead of the open(2)
syscall.

In this case, VOP_CLOSE() can be called directly, after possible lock
upgrade.

Reported by:	nvass@gmx.com
PR:	255119
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29830
2021-04-21 18:06:51 +03:00
Konstantin Belousov
ecfbddf0cd sysctl vm.objects: report backing object and swap use
For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object.  This allows userspace
to deconstruct the shadow chain.  Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field.  I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29771
2021-04-19 21:32:01 +03:00
Konstantin Belousov
4342ba184c sysctl_handle_string: do not malloc when SYSCTL_IN cannot fault
In particular, this avoids malloc(9) calls when from early tunable handling,
with no working malloc yet.

Reported and tested by:	mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-19 21:32:01 +03:00
Konstantin Belousov
578c26f31c linkat(2): check NIRES_EMPTYPATH on the first fd arg
Reported by:	arichardson
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29834
2021-04-19 21:32:01 +03:00
Warner Losh
571a1a64b1 Minor style tidy: if( -> if (
Fix a few 'if(' to be 'if (' in a few places, per style(9) and
overwhelming usage in the rest of the kernel / tree.

MFC After:		3 days
Sponsored by:		Netflix
2021-04-18 11:19:15 -06:00
Warner Losh
f1f9870668 Minor style cleanup
We prefer 'while (0)' to 'while(0)' according to grep and stlye(9)'s
space after keyword rule. Remove a few stragglers of the latter.
Many of these usages were inconsistent within the file.

MFC After:		3 days
Sponsored by:		Netflix
2021-04-18 11:14:17 -06:00
Konstantin Belousov
bbf7a4e878 O_PATH: allow vnode kevent filter on such files
if VREAD access is checked as allowed during open

Requested by:	wulf
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:49:18 +03:00
Konstantin Belousov
f9b923af34 O_PATH: Allow to open symlink
When O_NOFOLLOW is specified, namei() returns the symlink itself.  In
this case, open(O_PATH) should be allowed, to denote the location of symlink
itself.

Prevent O_EXEC in this case, execve(2) code is not ready to try to execute
symlinks.

Reported by:	wulf
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:49:09 +03:00
Konstantin Belousov
a5970a529c Make files opened with O_PATH to not block non-forced unmount
by only keeping hold count on the vnode, instead of the use count.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:48:27 +03:00