This was relatively harmless but surprising to see in counters. The
race occurred when rd_seq was read after the goal was updated and we
incorrectly calculated the delta between them.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23464
counters. In my stress test there is only one poll for every 15,000
frees. This means we are effectively amortizing the cache coherency
overhead even with very high write rates (3M/s/core).
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23463
Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to
mark sysctls that still require locking Giant.
Rewrite sysctl_handle_string() to use internal locking instead of locking
Giant.
Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE.
Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT
and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet.
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23378
sendfile(2) optionally takes a set of headers that get prepended to the
file data. If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference. This was
introduced in r356902.
Reported by: syzkaller
Sponsored by: The FreeBSD Foundation
This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE
values. This provides an easy, architecture independent way to calculate the
running kernel displacement (current/load address minus original base address).
The initial goal for this change is to add a new libkvm function that returns
the kernel displacement, both for live kernels and crashdumps. This would in
turn be used by kgdb to find out how to relocate kernel symbols (if needed).
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D23284
Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc). Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23238
and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>.
As both functions become a public interface add process lock assert
to ensure that the process is not exiting under it.
Fix whitespace nit while here.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23340
MFC after 2 weeks
If the thread's lock is already that of the runqueue, don't recurse on
the queue lock.
Reviewed by: jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23492
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.
In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).
Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.
While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.
PR: 243839
Reviewed by: kib (committed form recommended by)
Tested by: lwhsu
Differential Revision: https://reviews.freebsd.org/D23479
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.
While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.
Verified with the capsicum suite /usr/tests.
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.
This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.
The bug is only present in -CURRENT.
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.
This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.
This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23427
inspection and after a lengthy discussion with jhb and kib. They have not
produced test failures.
Don't pointer chase through cpu0's smr. Use cpu correct smr even when not
in a critical section to reduce the likelihood of false sharing.
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.
Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.
Simplify the KPIs. Remove stack_save_td_running() and add a return
value to stack_save_td(). On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP. If the
target thread is running in user mode, stack_save_td() returns EBUSY.
Reviewed by: kib
Reported by: mjg, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23355
The intent was to make it more likely to catch filesystems with custom
need_inactive routines which fail to call vn_need_pageq_flush (or do an
equivalent).
One immediate case which is missed is vgone from called by inactive itself.
A better assertion may land later. The routine is not added to vputx because
it is of no use to tmpfs et al.
Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm. This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost. The memory overhead
is significantly lessened by limiting the free-to-use latency. A synthetic
test uses 1/20th of the memory vs Epoch. There is significant further
discussion in the comments and code review.
This code should be considered experimental. I will write a man page after
it has settled. After further validation the VM will begin using this
feature to permit lockless page lookups.
Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures. I will commit a stress testing
tool in a follow-up.
Reviewed by: mmacy, markj, rlibby, hselasky
Discussed with: sbahara
Differential Revision: https://reviews.freebsd.org/D22586
Otherwise we risk running into use-after-free.
In particular this codepath ends up dropping all protection before
suspending writes:
ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt
Reported by: pho
ctx (and thus ctx.flags) is stack garbage at the start of this
function, so initialize ctx.flags to an explicit value instead of
using binary operations on the garbage.
Reported by: gcc9
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D23368
With this change having the listmtx lock held postpones dooming the vnode.
Use this fact to simplify iteration over the lazy list. It also allows
filters to safely access ->v_data.
Reviewed by: kib (early version)
Differential Revision: https://reviews.freebsd.org/D23397