17217 Commits

Author SHA1 Message Date
mjg
693023e586 vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit
In particular on amd64 this eliminates an atomic op in the common case,
trading it for IPIs in the uncommon case of catching CPUs executing the
code while the filesystem is getting suspended or unmounted.
2020-02-12 11:17:45 +00:00
mjg
e40f1f910d rms: use smp_rendezvous_cpus_retry instead of a hand-rolled variant 2020-02-12 11:17:18 +00:00
mjg
e165df173d Add smp_rendezvous_cpus_retry
This is a wrapper around smp_rendezvous_cpus which enables use of IPI
handlers which can fail and require retrying.

wait_func argument is added to to provide a routine which can be used to
poll CPU of interest for when the IPI can be retried.

Handlers which succeed must call smp_rendezvous_cpus_done to denote that
fact.

Discussed with:	 jeff
Differential Revision:	https://reviews.freebsd.org/D23582
2020-02-12 11:16:55 +00:00
mjg
d3a83bbe26 Store offset into zpcpu allocations in the per-cpu area.
This shorten zpcpu_get and allows more optimizations.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23570
2020-02-12 11:11:22 +00:00
mjg
bb81af52b3 epoch: convert zpcpu_get_cpua(.., curcpu) to zpcpu_get 2020-02-12 11:10:10 +00:00
glebius
755fde96b6 Add flag to struct task to mark the task as requiring network epoch.
When processing a taskqueue and a task has associated epoch, then
enter for duration of the task.  If consecutive tasks belong to the
same epoch, batch them.  Now we are talking about the network epoch
only.

Shrink the ta_priority size to 8-bits.  No current consumers use
a priority that won't fit into 8 bits.  Also complexity of
taskqueue_enqueue() is a square of maximum value of priority, so
we unlikely ever want to go over UCHAR_MAX here.

Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D23518
2020-02-11 18:48:07 +00:00
mjg
cbc55c85d0 vfs: fix vhold race in mnt_vnode_next_lazy_relock
vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by
mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the
count has to be > 0.

Reported by:	pho
Tested by:	pho
2020-02-11 18:19:56 +00:00
mjg
90f728256d capsicum: restore the cap_rights_contains symbol
It is expected to be provided by libc.

PR:		244033
Reported by:	 Jan Kokemueller
2020-02-11 18:13:53 +00:00
mjg
11a0ed9355 vfs: fix device count leak on vrele racing with vgone
The race is:

CPU1                                CPU2
                                    devfs_reclaim_vchr
make v_usecount 0
                                      VI_LOCK
                                      sees v_usecount == 0, no updates
                                      vp->v_rdev = NULL;
                                      ...
                                      VI_UNLOCK
VI_LOCK
v_decr_devcount
  sees v_rdev == NULL, no updates

In this scenario si_devcount decrement is not performed.

Note this can only happen if the vnode lock is not held.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23529
2020-02-10 22:28:54 +00:00
lwhsu
9a6525b977 Restore the behavior of allowing empty string in a string sysctl
Added as a special case to avoid unnecessary memory operations.

Reviewed by:	delphij
Sponsored by:	The FreeBSD Foundation
2020-02-10 20:53:59 +00:00
hselasky
c27aafcb66 Fix for unbalanced EPOCH(9) usage in the generic kernel interrupt
handler.

Interrupt handlers are removed via intr_event_execute_handlers() when
IH_DEAD is set. The thread removing the interrupt is woken up, and
calls intr_event_update(). When this happens, the ie_hflags are
cleared and re-built from all the remaining handlers sharing the
event. When the last IH_NET handler is removed, the IH_NET flag will
be cleared from ih_hflags (or ie_hflags may still be being rebuilt in
a different context), and the ithread_execute_handlers() may return
with ie_hflags missing IH_NET. This can lead to a scenario where
IH_NET was present before calling ithread_execute_handlers, and is not
present at its return, meaning the need for epoch must be cached
locally.

This can happen when loading and unloading network drivers. Also make
sure the ie_hflags is not cleared before being updated.

This is a regression issue after r357004.

Backtrace:
panic()
# trying to access epoch tracker on stack of dead thread
_epoch_enter_preempt()
ifunit_ref()
ifioctl()
fo_ioctl()
kern_ioctl()
sys_ioctl()
syscallenter()
amd64_syscall()

Differential Revision:	https://reviews.freebsd.org/D23483
Reviewed by:	glebius@, gallatin@, mav@, jeff@ and kib@
Sponsored by:	Mellanox Technologies
2020-02-10 20:23:08 +00:00
mjg
7591469890 vfs: fix lock recursion in vrele
vrele is supposed to be called with an unlocked vnode, but this was never
asserted for if v_usecount was > 0. For such counts the lock is never touched
by the routine. As a result the kernel has several consumers which expect
vunref semantics and get away with calling vrele since they happen to never do
it when this is the last reference (and for some of them this may happen to be
a guarantee).

Work around the problem by changing vrele semantics to tolerate being called
with a lock. This eliminates a possible bug where the lock is already held and
vputx takes it anyway.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23528
2020-02-10 13:54:34 +00:00
kib
29edd909b0 Add sysctl kern.proc.sigfastblk for reporting sigfastblock word address.
Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 12:29:51 +00:00
kib
a9939ce863 Add AT_BSDFLAGS auxv entry.
The intent is to provide bsd-specific flags relevant to interpreter
and C runtime.  I did not want to reuse AT_FLAGS which is common ELF
auxv entry.

Use bsdflags to report kernel support for sigfastblock(2).  This
allows rtld and libthr to safely infer the syscall presence without
SIGSYS.  The tunable kern.elf{32,64}.sigfastblock blocks reporting.

Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 12:10:37 +00:00
kib
625da238b0 Regen. 2020-02-09 11:53:37 +00:00
kib
c3e1a2bd2c Add a way to manage thread signal mask using shared word, instead of syscall.
A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery.  Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.

The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).

With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.

The syscall is not exported from the stable libc version namespace on
purpose.  It is intended to be used only by our C runtime
implementation internals.

Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 11:53:12 +00:00
mjg
5dfa167a47 vfs: tidy up vget_finish and vn_lock
- remove assertion which duplicates vn_lock
- use VNPASS instead of retyping the failure
- report what flags were passed if panicking on them
2020-02-08 15:52:20 +00:00
mjg
a894ce38c7 vfs: remove now useless ENODEV handling from vn_fullpath consumers
Noted by:	ngie
2020-02-08 15:51:08 +00:00
kib
d5dfeba734 Correct the function name in the comment.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2020-02-08 15:06:06 +00:00
mjg
942c9548fc rms: use newly added zpcpu routines instead of direct access where appropriate 2020-02-07 22:44:41 +00:00
jeff
5545e52295 Fix a race in smr_advance() that could result in unnecessary poll calls.
This was relatively harmless but surprising to see in counters.  The
race occurred when rd_seq was read after the goal was updated and we
incorrectly calculated the delta between them.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D23464
2020-02-06 20:51:46 +00:00
jeff
f97e1b2be3 Add some global counters for SMR. These may eventually become per-smr
counters.  In my stress test there is only one poll for every 15,000
frees.  This means we are effectively amortizing the cache coherency
overhead even with very high write rates (3M/s/core).

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D23463
2020-02-06 20:10:21 +00:00
kaktus
dfcb6515fc sysctl(9): add CTLFLAG_NEEDGIANT flag
Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to
mark sysctls that still require locking Giant.

Rewrite sysctl_handle_string() to use internal locking instead of locking
Giant.

Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE.

Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT
and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet.

Reviewed by:	kib (mentor)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23378
2020-02-06 12:45:58 +00:00
markj
2a4d78a968 Avoid releasing object PIP in vn_sendfile() if no pages were grabbed.
sendfile(2) optionally takes a set of headers that get prepended to the
file data.  If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference.  This was
introduced in r356902.

Reported by:	syzkaller
Sponsored by:	The FreeBSD Foundation
2020-02-05 16:09:21 +00:00
luporl
478cce2129 Add SYSCTL to get KERNBASE and relocated KERNBASE
This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE
values. This provides an easy, architecture independent way to calculate the
running kernel displacement (current/load address minus original base address).

The initial goal for this change is to add a new libkvm function that returns
the kernel displacement, both for live kernels and crashdumps. This would in
turn be used by kgdb to find out how to relocate kernel symbols (if needed).

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D23284
2020-02-05 11:34:10 +00:00
mjg
adbdb89768 fd: always nullify *fdp in fget* routines
Some consumers depend on the pointer being NULL if an error is returned.

The guarantee got broken in r357469.

Reported by:	https://syzkaller.appspot.com/bug?extid=0c9b05e2b727aae21eef
Noted by:	markj
2020-02-05 00:20:26 +00:00
rlibby
d44438f9d0 uma: convert mbuf_jumbo_alloc to UMA_ZONE_CONTIG & tag others
Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc).  Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23238
2020-02-04 22:40:23 +00:00
kib
55aa9af7e9 Remove unneeded assert for curproc. Simplify.
Reported by:	syzkaller by markj
Sponsored by:	The FreeBSD Foundation
2020-02-04 21:02:08 +00:00
markj
fa5ce7a2a6 Correct the malloc tag used when freeing the temporary semop(2) buffer.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-02-04 20:00:45 +00:00
dchagin
e9fa3d123f For code reuse in Linuxulator rename get_proccess_cputime()
and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>.

As both functions become a public interface add process lock assert
to ensure that the process is not exiting under it.

Fix whitespace nit while here.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D23340
MFC after		2 weeks
2020-02-04 05:25:51 +00:00
jeff
41f5785ede Implement a deferred write advancement feature that can be used to further
amortize shared cacheline writes.

Discussed with: rlibby
Differential Revision:	https://reviews.freebsd.org/D23462
2020-02-04 02:44:52 +00:00
jeff
324d06e970 Fix a recursion on the thread lock by acquiring it after call rtp_to_pri().
Reported by:	swills
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23495
2020-02-04 02:42:54 +00:00
markj
b7b7769e94 Fix the !SMP case in sched_add() after r355779.
If the thread's lock is already that of the runqueue, don't recurse on
the queue lock.

Reviewed by:	jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23492
2020-02-03 22:49:05 +00:00
mjg
4fae937212 fd: partially unengrish the previous commit 2020-02-03 22:34:50 +00:00
mjg
e0daf4956a fd: streamline fget_unlocked
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.

In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).

Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.

While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
2020-02-03 22:32:49 +00:00
mjg
ecb991e675 fd: remove the seq argument from fget_unlocked
It is almost always NULL.
2020-02-03 22:27:55 +00:00
mjg
89d1f1812c fd: remove the seq argument from fget routines
It is almost always NULL.
2020-02-03 22:27:03 +00:00
mjg
21df633ceb ktrace: provide ktrstat_error
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
2020-02-03 22:26:00 +00:00
glebius
ea2553b9e0 Couple protocol drain routines (frag6_drain and sctp_drain) may send
packets.  An unexpected behaviour for memory reclamation routine.
Anyway, we need enter the network epoch for doing that.
2020-02-03 20:48:57 +00:00
kevans
d4577c7d2d namei: preserve errors from fget_cap_locked
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.

PR:		243839
Reviewed by:	kib (committed form recommended by)
Tested by:	lwhsu
Differential Revision:	https://reviews.freebsd.org/D23479
2020-02-03 18:59:07 +00:00
imp
48b94864c5 Remove sparc64 kernel support
Remove all sparc64 specific files
Remove all sparc64 ifdefs
Removee indireeect sparc64 ifdefs
2020-02-03 17:35:11 +00:00
mjg
763314a492 capsicum: faster cap_rights_contains
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.

While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.

Verified with the capsicum suite /usr/tests.
2020-02-03 17:08:11 +00:00
mjg
f8e2d90c73 fd: fix f_count acquire in fget_unlocked
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.

This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.

The bug is only present in -CURRENT.
2020-02-03 14:28:31 +00:00
mjg
8559b9f2e3 Fix up various vnode-related asserts which did not dump the used vnode 2020-02-03 14:25:32 +00:00
kevans
11e74d9fa2 Provide O_SEARCH
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23247
2020-02-02 16:34:57 +00:00
mjg
e98538b601 fd: sprinkle some predits around fget
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
2020-02-02 09:38:40 +00:00
mjg
ebb1f3a14f fd: use atomic_load_ptr instead of hand-rolled cast through volatile
No change in assembly.
2020-02-02 09:37:16 +00:00
mjg
94ee14c445 vfs: remove the now empty vop_unlock_post 2020-02-02 09:36:32 +00:00
mjg
609d31f8f4 cache: replace kern___getcwd with vn_getcwd
The previous routine was resulting in extra data copies most notably in
linux_getcwd.
2020-02-01 20:38:38 +00:00
mjg
3bbc775c01 cache: return the total length from vn_fullpath1
This removes strlen from getcwd.
2020-02-01 20:37:11 +00:00