Assert that sema[idx] allocation from sem[] is sane.
Also assert that sem_mtx is owned, it protects the SEM_ALLOC flag.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23694
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked). Use it in
preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Reviewed by: kib, trasz
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23640
At the time opt-in was introduced adding yourself as a writer was esrializing
across the mount point. Nowadays it is fully per-cpu, the only impact being
a small single-threaded hit on top of what's there right now.
Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which
has is done regardless.
Should someone want to microoptimize this single-threaded they can coalesce
looking the mount up with adding a write to it.
bintime()/binuptime().
The algorithm to read the consistent snapshot of current timehand is
repeated in each accessor, including the details proper rollup
detection and synchronization with the writer. In fact there are only
two different kind of readers: one for bintime()/binuptime() which has
to do the in-place calculation, and another kind which fetches some
member from struct timehand.
Extract the logic into type-checked macros, GETTHBINTIME() for bintime
calculation, and GETTHMEMBER() for safe read of a structure' member.
This way, the synchronization is only written in bintime_off() and
getthmember().
In bintime_off(), use overflow-safe calculation of th_scale *
delta(timecounter). In tc_windup, pre-calculate the min delta value
which overflows and require slow algorithm, into the new timehands
th_large_delta member.
This part with overflow fix was written by Bruce Evans.
Reported by: Mark Millard <marklmi@yahoo.com> (the overflow issue)
Tested by: pho
Discussed with: emaste
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 3 weeks
This in particular significantly shortens amd64_syscall, which otherwise
keeps jumping forward over 2KB of code in total.
Note some of these branches should be either eliminated altogether or
coalesced.
The latter is a typedef of the former; the typedef exists and these bits are
representing vmprot values, so use the correct type.
Submitted by: sigsys@gmail.com
MFC after: 3 days
During buildkernel there are very frequent calls to priv_check and they
all are for PRIV_VFS_GENERATION (coming from stat/fstat).
This results in branching on several potential privileges checking if
perhaps that's the one which has to be evaluated.
Instead of the kitchen-sink approach provide a way to have commonly used
privs directly evaluated.
As with e.g. getgroups and getlogin it allows querying current process
credential state.
Reported by: sigsys@gmail.com via kevans
Sponsored by: The FreeBSD Foundation
fdatasync is essentially a subset of fsync (and may be exactly fsync,
depending on filesystem and development effort) and operates only on
a provided fd.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
In particular on amd64 this eliminates an atomic op in the common case,
trading it for IPIs in the uncommon case of catching CPUs executing the
code while the filesystem is getting suspended or unmounted.
This is a wrapper around smp_rendezvous_cpus which enables use of IPI
handlers which can fail and require retrying.
wait_func argument is added to to provide a routine which can be used to
poll CPU of interest for when the IPI can be retried.
Handlers which succeed must call smp_rendezvous_cpus_done to denote that
fact.
Discussed with: jeff
Differential Revision: https://reviews.freebsd.org/D23582
When processing a taskqueue and a task has associated epoch, then
enter for duration of the task. If consecutive tasks belong to the
same epoch, batch them. Now we are talking about the network epoch
only.
Shrink the ta_priority size to 8-bits. No current consumers use
a priority that won't fit into 8 bits. Also complexity of
taskqueue_enqueue() is a square of maximum value of priority, so
we unlikely ever want to go over UCHAR_MAX here.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D23518
vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by
mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the
count has to be > 0.
Reported by: pho
Tested by: pho
The race is:
CPU1 CPU2
devfs_reclaim_vchr
make v_usecount 0
VI_LOCK
sees v_usecount == 0, no updates
vp->v_rdev = NULL;
...
VI_UNLOCK
VI_LOCK
v_decr_devcount
sees v_rdev == NULL, no updates
In this scenario si_devcount decrement is not performed.
Note this can only happen if the vnode lock is not held.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23529
handler.
Interrupt handlers are removed via intr_event_execute_handlers() when
IH_DEAD is set. The thread removing the interrupt is woken up, and
calls intr_event_update(). When this happens, the ie_hflags are
cleared and re-built from all the remaining handlers sharing the
event. When the last IH_NET handler is removed, the IH_NET flag will
be cleared from ih_hflags (or ie_hflags may still be being rebuilt in
a different context), and the ithread_execute_handlers() may return
with ie_hflags missing IH_NET. This can lead to a scenario where
IH_NET was present before calling ithread_execute_handlers, and is not
present at its return, meaning the need for epoch must be cached
locally.
This can happen when loading and unloading network drivers. Also make
sure the ie_hflags is not cleared before being updated.
This is a regression issue after r357004.
Backtrace:
panic()
# trying to access epoch tracker on stack of dead thread
_epoch_enter_preempt()
ifunit_ref()
ifioctl()
fo_ioctl()
kern_ioctl()
sys_ioctl()
syscallenter()
amd64_syscall()
Differential Revision: https://reviews.freebsd.org/D23483
Reviewed by: glebius@, gallatin@, mav@, jeff@ and kib@
Sponsored by: Mellanox Technologies
vrele is supposed to be called with an unlocked vnode, but this was never
asserted for if v_usecount was > 0. For such counts the lock is never touched
by the routine. As a result the kernel has several consumers which expect
vunref semantics and get away with calling vrele since they happen to never do
it when this is the last reference (and for some of them this may happen to be
a guarantee).
Work around the problem by changing vrele semantics to tolerate being called
with a lock. This eliminates a possible bug where the lock is already held and
vputx takes it anyway.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23528
The intent is to provide bsd-specific flags relevant to interpreter
and C runtime. I did not want to reuse AT_FLAGS which is common ELF
auxv entry.
Use bsdflags to report kernel support for sigfastblock(2). This
allows rtld and libthr to safely infer the syscall presence without
SIGSYS. The tunable kern.elf{32,64}.sigfastblock blocks reporting.
Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773
A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.
The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).
With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.
The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.
Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773
This was relatively harmless but surprising to see in counters. The
race occurred when rd_seq was read after the goal was updated and we
incorrectly calculated the delta between them.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23464
counters. In my stress test there is only one poll for every 15,000
frees. This means we are effectively amortizing the cache coherency
overhead even with very high write rates (3M/s/core).
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23463
Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to
mark sysctls that still require locking Giant.
Rewrite sysctl_handle_string() to use internal locking instead of locking
Giant.
Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE.
Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT
and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet.
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23378
sendfile(2) optionally takes a set of headers that get prepended to the
file data. If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference. This was
introduced in r356902.
Reported by: syzkaller
Sponsored by: The FreeBSD Foundation
This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE
values. This provides an easy, architecture independent way to calculate the
running kernel displacement (current/load address minus original base address).
The initial goal for this change is to add a new libkvm function that returns
the kernel displacement, both for live kernels and crashdumps. This would in
turn be used by kgdb to find out how to relocate kernel symbols (if needed).
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D23284
Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc). Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23238
and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>.
As both functions become a public interface add process lock assert
to ensure that the process is not exiting under it.
Fix whitespace nit while here.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23340
MFC after 2 weeks
If the thread's lock is already that of the runqueue, don't recurse on
the queue lock.
Reviewed by: jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23492
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.
In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).
Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.
While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.
PR: 243839
Reviewed by: kib (committed form recommended by)
Tested by: lwhsu
Differential Revision: https://reviews.freebsd.org/D23479
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.
While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.
Verified with the capsicum suite /usr/tests.
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.
This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.
The bug is only present in -CURRENT.
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.
This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.
This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23427
inspection and after a lengthy discussion with jhb and kib. They have not
produced test failures.
Don't pointer chase through cpu0's smr. Use cpu correct smr even when not
in a critical section to reduce the likelihood of false sharing.
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.
Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.
Simplify the KPIs. Remove stack_save_td_running() and add a return
value to stack_save_td(). On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP. If the
target thread is running in user mode, stack_save_td() returns EBUSY.
Reviewed by: kib
Reported by: mjg, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23355
The intent was to make it more likely to catch filesystems with custom
need_inactive routines which fail to call vn_need_pageq_flush (or do an
equivalent).
One immediate case which is missed is vgone from called by inactive itself.
A better assertion may land later. The routine is not added to vputx because
it is of no use to tmpfs et al.
Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm. This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost. The memory overhead
is significantly lessened by limiting the free-to-use latency. A synthetic
test uses 1/20th of the memory vs Epoch. There is significant further
discussion in the comments and code review.
This code should be considered experimental. I will write a man page after
it has settled. After further validation the VM will begin using this
feature to permit lockless page lookups.
Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures. I will commit a stress testing
tool in a follow-up.
Reviewed by: mmacy, markj, rlibby, hselasky
Discussed with: sbahara
Differential Revision: https://reviews.freebsd.org/D22586
Otherwise we risk running into use-after-free.
In particular this codepath ends up dropping all protection before
suspending writes:
ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt
Reported by: pho
ctx (and thus ctx.flags) is stack garbage at the start of this
function, so initialize ctx.flags to an explicit value instead of
using binary operations on the garbage.
Reported by: gcc9
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D23368
With this change having the listmtx lock held postpones dooming the vnode.
Use this fact to simplify iteration over the lazy list. It also allows
filters to safely access ->v_data.
Reviewed by: kib (early version)
Differential Revision: https://reviews.freebsd.org/D23397
These were all introduced in the initial import of hwpstate_intel(4).
Reported by: Coverity
CIDs: 1413161, 1413164, 1413165, 1413167
X-MFC-With: r357002
In r110908 (2003) alfred added DFLAG_PASSABLE to tag those types of FD
that can be passed via unix pipes, but mqueuefs didn't exist
yet. Later, in r152825 (2005) davidxu neglected to include
DFLAG_PASSABLE since people don't normally pass these things via unix
sockets (it's a FreeBSD implementation detail that it's a file
descriptor, nobody noticed). Then r223866 (2011) by jonathan used the
new flag in fdcopy, which fork uses. Due to that, mqueuefs actually
broke mqueue objects being propagated by fork. No mention of mqueuefs
was made in r223866, so I think it was an unintended consequence.
Fix this by tagging mqueuefs as passable as well. They were prior to
alfred's change (and it's clear there's no intent in his change to
change this behavior), and POSIX requires this to be the case as well.
PR: 243103
Reviewed by: kib@, jiles@
Differential Revision: https://reviews.freebsd.org/D23038
These should not be any functional change. While the change in
emul10kx-pcm.c looks like a real bug fix (as opposed to inconsistent
whitespace), the extra statements were not harmful.
Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23363
vdbatch_process leaves the critical section too early, openign a time
window where another thread can get scheduled and modify vd->freevnodes.
Once it the preempted thread gets back it overrides the value with 0.
Just move critical_exit to the end of the function.
The existing AF_UNIX socket garbage collector destroys any socket
which may potentially be in a cycle, as indicated by its file reference
count being equal to its enqueue count. However, this can produce false
positives for in-flight sockets which aren't part of a cycle but are
part of one or more SCM_RIGHTS mssages and which have been closed
on the sending side. If the garbage collector happens to run at
exactly the wrong time, destruction of these sockets will render them
unusable on the receiving side, such that no previously-written data
may be read.
This change rewrites the garbage collector to precisely detect cycles:
1. The existing check of msgcount==f_count is still used to determine
whether the socket is potentially in a cycle.
2. The socket is now placed on a local "dead list", which is used to
reduce iteration time (and therefore contention on the global
unp_link_rwlock).
3. The first pass through the dead list removes each potentially-dead
socket's outgoing references from the graph of potentially-dead
sockets, using a gc-specific copy of the original reference count.
4. The second series of passes through the dead list removes from the
list any socket whose remaining gc refcount is non-zero, as this
indicates the socket is actually accessible outside of any possible
cycle. Iteration is repeated until no further sockets are removed
from the dead list.
5. Sockets remaining in the dead list are destroyed as before.
PR: 227285
Submitted by: jan.kokemueller@gmail.com (prior version)
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23142
There is nothing to do but to bump the count even during said transition.
There are 2 places which can do it:
- vget only does this after locking the vnode, meaning there is no change in
contract versus inactive or reclamantion
- vref only ever did it with the interlock held which did not protect against
either (that is, it would always succeed)
VCHR vnodes retain special casing due to the need to maintain dev use count.
Reviewed by: jeff, kib
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23185
vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.
Reviewed by: jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23184
Doing so runs into races with filesystems which make half-constructed vnodes
visible to other users, while depending on the chain vput -> vinactive ->
vrecycle to be executed without dropping the vnode lock.
Impediments for making this work got cleared up (notably vop_unlock_post now
does not do anything and lockmgr stops touching the lock after the final
write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can
now be eliminated.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23344
This evens it up with other locking primitives.
Note lock profiling still touches the lock, which again is in line with the
rest.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23343
After r355784 we no longer hold a thread's thread lock when switching it
out. Preserve the previous synchronization protocol for td_oncpu by
setting it together with td_state, before dropping the thread lock
during a switch.
Reported and tested by: pho
Reviewed by: kib
Discussed with: jeff
Differential Revision: https://reviews.freebsd.org/D23270
it. The introduction of lockless switch in r355784 created a race to
re-use the exiting thread that was only possible to hit on a hypervisor.
Reported/Tested by: rlibby
Discussed with: rlibby, jhb
Intel Speed Shift is Intel's technology to control frequency in hardware,
with hints from software.
Let's get a working version of this in the tree and we can refine it from
here.
Submitted by: bwidawsk, scottph
Reviewed by: bcr (manpages), myself
Discussed with: jhb, kib (earlier versions)
With feedback from: Greg V, gallatin, freebsdnewbie AT freenet.de
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18028
The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033
Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23186
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.
Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.
Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.
Reviewed by: brooks, imp
Nice!: emaste
Differential Revision: https://reviews.freebsd.org/D23197
It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.
This soothes a global serialisation point for all 0<->1 hold count transitions.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23235
Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D23162
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.
The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.
Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23140
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D23158
Take advantage of global ordering introduced in r356672.
Reviewed by: mckusick (previous version)
Differential Revision: https://reviews.freebsd.org/D23067
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102
mount point while numerous tests are running that are writing to
files on that mount point cause the unmount(8) to hang forever.
The unmount(8) system call is handled in the kernel by the dounmount()
function. The cause of the hang is that prior to dounmount() calling
VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT
flag indicates that VFS_SYNC() should not return until all the dirty
buffers associated with the mount point have been written to disk.
Because user processes are allowed to continue writing and can do
so faster than the data can be written to disk, the call to VFS_SYNC()
can never finish.
Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes
when they request to do a write thus having a finite number of dirty
buffers to write that cannot be expanded. There is no need to call
VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs
to flush everything again anyway after suspending writes, to catch
anything that was dirtied between the VFS_SYNC() and writes being
suspended.
The fix is to simply remove the unnecessary call to VFS_SYNC() from
dounmount().
Reported by: Peter Holm
Analysis by: Chuck Silvers
Tested by: Peter Holm
MFC after: 7 days
Sponsored by: Netflix
subsystems tend to need to know about it, and including if_var.h is
huge header pollution for them. Polluting possible non-network
users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch. Not used yet, and unlikely
to get used in close future.
The only reason to vlazy there is to (overzealously) ensure all vnodes
which need to be visited by msync scan can be found there.
In particluar this is of no use zfs and tmpfs.
While here depessimize the check.
Remove assumptions about the minimum MINALLOCSIZE, in order to allow
testing of smaller MINALLOCSIZE. A following patch will lower the
MINALLOCSIZE, but not so much that the present patch is required for
correctness at these sites.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.
Per-cpu areas are locked in order to synchronize against UMA freeing
memory.
vnode's v_mflag is converted to short to prevent the struct from
growing.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998
The current notion of an active vnode is eliminated.
Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.
Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997
This obviates the need to scan the entire active list looking for vnodes
of interest.
msync is handled by adding all vnodes with write count to the lazy list.
deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.
Vnodes get dequeued from the list when their hold count reaches 0.
Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22995
Treat it as a synonym for GRND_NONBLOCK. The reasoning is this:
We have two choices for handling Linux's GRND_INSECURE API flag.
1. We could ignore it completely (like GRND_RANDOM). However, this might
produce the surprising result of GRND_INSECURE requests blocking, when the
Linux API does not block.
2. Alternatively, we could treat GRND_INSECURE requests as requests for
GRND_NONBLOCk. Here, the surprising result for Linux programs is that
invocations with unseeded random(4) will produce EAGAIN, rather than
garbage.
Honoring the flag in the way Linux does seems fraught. If we actually use
the output of a random(4) implementation prior to seeding, we leak some
entropy (in an information theory and also practical sense) from what will
be the initial seed to attackers (or allow attackers to arbitrary DoS
initial seeding, if we don't leak). This seems unacceptable -- it defeats
the purpose of blocking on initial seeding.
Secondary to that concern, before seeding we may have arbitrarily little
entropy collected; producing output from zero or a handful of entropy bits
does not seem particularly useful to userspace.
If userspace can accept garbage, insecure, non-random bytes, they can create
their own insecure garbage with srandom(time(NULL)) or similar. Any program
which would be satisfied with a 3-bit key CTR stream has no need for CSPRNG
bytes. So asking the kernel to produce such an output from the secure
getrandom(2) API seems inane.
For now, we've elected to emulate GRND_INSECURE as an alternative spelling
of GRND_NONBLOCK (2). Consider this API not-quite stable for now. We
guarantee it will never block. But we will attempt to monitor actual port
uptake of this bizarre API and may revise our plans for the unseeded
behavior (prior stable/13 branching).
Approved by: csprng(markm), manpages(bcr)
See also: https://lwn.net/ml/linux-kernel/cover.1577088521.git.luto@kernel.org/
See also: https://lwn.net/ml/linux-kernel/20200107204400.GH3619@mit.edu/
Differential Revision: https://reviews.freebsd.org/D23130
This creates a dedicated routine (vn_alloc) to allocate vnodes.
As a side effect code duplicationw with getnewvnode_reserve is eleminated.
Add vn_free for symmetry.
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.
Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.
Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.
Reviewed by: brooks
Discussed with: sjg
Differential Revision: https://reviews.freebsd.org/D23099
Otherwise the malloc type accounting in malloc_domainset(9) is wrong
after r355203.
Reviewed by: rlibby
Reported by: kaktus
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23095
Other mechanisms that resize the shmfd grab a write lock from 0 to OFF_MAX
for safety, so we still get proper synchronization of shmfd->shm_size in
effect. There's no need to block readers/writers of earlier segments when
we're just reserving more space, so narrow the scope -- it would likely be
safe to narrow it completely to just the section of the range that extends
beyond our current size, but this likely isn't worth it since the size isn't
stable until the writelock is granted the first time.
Suggested by: cem (passing comment)
Linux expects to be able to use posix_fallocate(2) on a memfd. Other places
would use this with shm_open(2) to act as a smarter ftruncate(2).
Test has been added to go along with this.
Reviewed by: kib (earlier version)
Differential Revision: https://reviews.freebsd.org/D23042
This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D23042
vgone dooms the vnode while keeping VI_OWEINACT set and then drops the
interlock.
vputx can pick up the interlock and pass it to vdefer_inactive since the
flag is set.
The race is harmless, just don't defer anything as vgone will take care of it.
Reported by: pho
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.
Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
- use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock
- evaluate flags before looping over vnodes
Reviewed by: kib
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23035
Otherwise in code like this:
if (numvnodes > desiredvnodes)
vnlru_free_locked(numvnodes - desiredvnodes, NULL);
numvnodes can drop below desiredvnodes prior to the call and if the
compiler generated another read the subtraction would get a negative
value.
There was only one consumer and it was using it incorrectly.
It is given an equivalent hack.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23037
r136999 introduced SYSTCL_DEBUG but apparently "opt_sysctl.h" was never
included making the option ignored.
r322954 introduced sysctl.reuse_test with OID number equal to 0, effectively
shadowing the very special sysctl.debug one. Use OID_AUTO as it doesn't need
any special treatment.
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23056
When file sealing and shm_open2 were introduced, we should have grown a new
kern_shm_open2 helper that did the brunt of the work with the new interface
while kern_shm_open remains the same. Instead, more complexity was
introduced to kern_shm_open to handle the additional features and consumers
had to keep changing in somewhat awkward ways, and a kern_shm_open2 was
added to wrap kern_shm_open.
Backpedal on this and correct the situation- kern_shm_open returns to the
interface it had prior to file sealing being introduced, and neither
function needs an initial_seals argument anymore as it's handled in
kern_shm_open2 based on the shmflags.
If a write seal is set on a shared mapping, we must exclude VM_PROT_WRITE as
the fd is effectively read-only. This was discovered by running
devel/linux-ltp, which mmap's with acceptable protections specified then
attempts to raise to PROT_READ|PROT_WRITE with mprotect(2), which we
allowed.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22978
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
This is a lock-based emulation of 64-bit atomics for kernel use, split off
from an earlier patch by jhibbits.
This is needed to unblock future improvements that reduce the need for
locking on 64-bit platforms by using atomic updates.
The implementation allows for future integration with userland atomic64,
but as that implies going through sysarch for every use, the current
status quo of userland doing its own locking may be for the best.
Submitted by: jhibbits (original patch), kevans (mips bits)
Reviewed by: jhibbits, jeff, kevans
Differential Revision: https://reviews.freebsd.org/D22976
and make it usable outside of kern_umtx.c. To be used in several
future changes.
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
r23081 introduced kern.dummy oid as a semi ABI compat for kern.maxsockbuf
that was moved to a new namespace. It never functioned as an alias of any
kind and was just returning 0 unconditionally, hence it was probably
provided to keep some 3rd party programmes happy about sysctl(3) not
reporting an error because of non-existing oid.
After nearly 23 years it seems reasonable to just hide it from sysctl(8)
list not to cause unnecessary confusion as for its purpose.
Reported by: Antranig Vartanian <antranigv@freebsd.am>
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D22982
Amount of changes to the original code has been intentionally minimised
to ease diffing.
The changes are mostly mechanical, with the following exceptions:
* lltable handler is now called directly based of RTF_LLINFO flag presense.
* "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified,
fixing several potential use-after-free cases in rt_addrinfo.
* llable asserts has been replaced with error-returning, preventing kernel
crashes when lltable gw af family is invalid (root required).
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22864
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places. It would be cool if we had
more SMP-friendly statistics, but this helps too.
Sponsored by: iXsystems, Inc.
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885
Summary:
r356113 used an older patch, which predated the
freebsd_copyout_auxargs() addition. Fix this by using a private
powerpc_copyout_auxargs() instead, and keep it private to powerpc, not in MI
files.
Reviewed by: kib, bdragon
Differential Revision: https://reviews.freebsd.org/D22935
1. The only place in the tree which calls getnewvnode with mp == NULL does it
for vp_crossmp which will never execute this codepath. Any vnode which legally
has ->v_mount == NULL is also doomed, which once more wont execute this code.
2. Remove an assertion for v_holdcnt from production kernels. It gets taken care
of by refcount macros in debug kernels.
Any code which would want to pass NULL mp can construct a fake one instead.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D22722
To be used when like rmlocks, except when sleeping for readers needs to be
allowed. See the manpage for more information.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D22823
Summary:
As a transition aide, implement an alternative elfN_freebsd_fixup which
is called for old powerpc binaries. Similarly, add a translation to rtld to
convert old values to new ones (as expected by a new rtld).
Translation of old<->new values is incomplete, but sufficient to allow an
installworld of a new userspace from an old one when a new kernel is running.
Test Plan:
Someone needs to see how a new kernel/rtld/libc works with an old
binary. If if works we can probalby ship this. If not we probalby need
some more compat bits.
Submitted by: brooks
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20799
srandom(9) is meaningless on SMP systems or any system with, say,
interrupts. One could never rely on random(9) to produce a reproducible
sequence of outputs on the basis of a specific srandom() seed because the
global state was shared by all kernel contexts. As such, removing it is
literally indistinguishable to random(9) consumers (as compared with
retaining it).
Mark random(9) as deprecated and slated for quick removal. This is not to
say we intend to remove all fast, non-cryptographic PRNG(s) in the kernel.
It/they just won't be random(9), as it exists today, in either name or
implementation.
Before random(9) is removed, a replacement will be provided and in-tree
consumers will be converted.
Note that despite the name, the random(9) interface does not bear any
resemblance to random(3). Instead, it is the same crummy 1988 Park-Miller
LCG used in libc rand(3).
A weak symbol here is decidedly cleaner than any #ifdef soup or relocating
kbdinit, the former leading to maintenance required on addition of any
console/keyboard drivers and the latter pushing kbd init bits away from
where they're used.
This leads to the revert of r355806; this reduces duplication in keyboard
registration and driver switch lookup and leaves us with one authoritative
source for currently registered drivers. The reduced duplication later is
nice as we have more procedure involved in keyboard setup.
keyboard_driver->flags is used to more quickly detect bogus adds/removes.
From KPI consumers' perspective, nothing changes- kbd_add_driver of an
already-registered driver will succeed, and a single kbd_delete_driver will
later remove it as expected. In contrast to historical behavior,
kbd_delete_driver on a driver registered via linker set will now actually
de-register the driver so that it may not be used -- e.g. if kbdmux's
MOD_LOAD handler fails somewhere.
Detection for already-registered drivers in kbd_add_driver has improved, as
the previous SLIST_NEXT(driver) != NULL check would not have caught a driver
that's at the tail end.
kbdinit is now called from cninit() rather than via SYSINIT so that keyboard
drivers are available as early as console drivers. This is particularly
important as cnprobe will, in both syscons and vt, attempt to do any early
configuration of keyboard drivers built-in (see: kbd_configure).
Reviewed by: imp (earlier version, pre-cninit change)
Differential Revision: https://reviews.freebsd.org/D22835
_sleep(9), wakeup(9), sleepqueue(9), et al do not dereference or modify the
channel pointers provided in any way; they are merely used as intptrs into a
dictionary structure to match waiters with wakers. Correctly annotate this
such that _sleep() and wakeup() may be used on const pointers without
invoking ugly patterns like __DECONST(). Plumb const through all of the
underlying sleepqueue bits.
No functional change.
Reviewed by: rlibby
Discussed with: kib, markj
Differential Revision: https://reviews.freebsd.org/D22914
Due to clang and LLD's tendency to use a PLT for builtins, and as they
don't have full support for EABI, we sometimes have to deal with a PLT in
.ko files in a clang-built kernel.
As such, augment the in-kernel linker to support jump table processing.
As there is no particular reason to support lazy binding in kernel modules,
only implement Secure-PLT immediate binding.
As part of these changes, add elf_cpu_parse_dynamic() to the MD API of the
in-kernel linker (except on platforms that use raw object files.)
The new function will allow MD code to act on MD tags in _DYNAMIC.
Use this new function in the PowerPC MD code to ensure BSS-PLT modules using
PLT will be rejected during insertion, and to poison the runtime resolver to
ensure we get a clear panic reason if a call is made to the resolver.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22608
It is UB to evaluate pointer comparisons when pointers do not point within
the same object. Instead, convert the pointers to numbers and compare the
numbers.
Reported by: kib
Discussed with: rlibby
Previously just ensuring that we do not sleep when clustering for
md(4) vnode was enough. Now, with the switch of the pbuf allocator to
uma and completely broken per-subsystem pbuf limits, it might cause
unbounded sleep even for non-md(4) vnodes.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22899
If the devmap entry uses the upper 32 bits they wouldn't be printed in
devmap_dump_table(). This fixes that.
Submitted by: Nicholas O'Brien <nickisobrien_gmail.com>
Sponsored by: Axiado
We by definition cannot trace the stack of such a thread. Also remove a
redundant stack_zero() call in the SIGINFO handler, the stack structure
is cleared by the MD stack_capture().
Sponsored by: The FreeBSD Foundation
Now that it is not used after schedlock changes got merged.
Note the unlock routine temporarily still checks for it on account of just using
regular spin unlock.
This is a prelude towards a general clean up.
Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.
This change has not been made to 4BSD but in principle it would be
more straightforward.
Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778
Eliminate lock recursion from turnstiles. This was simply used to avoid
tracking the top-level turnstile lock. explicitly check for it before
picking up and dropping locks.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22746
Do all sleepqueue post-processing in sleepq_remove_thread() so that we
do not require the thread lock after a context switch.
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D22745
The arm kernel stack unwinder has apparently never been able to unwind when
the path of execution leads through a kernel module. There was code that
tried to handle modules by looking for the unwind data in them, but it did
so by trying to find symbols which have never existed in arm kernel
modules. That caused the unwind code to panic, and because part of panic
handling calls into the unwind code, that just created a recursion loop.
Locating the unwind data in a loaded module requires accessing the Elf
section headers to find the SHT_ARM_EXIDX section. For preloaded modules
those headers are present in a metadata blob. For dynamically loaded
modules, the headers are present only while the loading is in progress; the
memory is freed once the module is ready to use. For that reason, there is
new code in kern/link_elf.c, wrapped in #ifdef __arm__, to extract the
unwind info while the headers are loaded. The values are saved into new
fields in the linker_file structure which are also conditional on __arm__.
In arm/unwind.c there is new code to locally cache the per-module info
needed to find the unwind tables. The local cache is crafted for lockless
read access, because the unwind code often needs to run in context where
sleeping is not allowed. A large comment block describes the local cache
list, so I won't repeat it all here.
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626
an exclusive object lock.
Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.
Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.
Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.
Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654
exec_map_first_page(). This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).
Discussed with: alc
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D22731
bits, by storing and modifying the complement of the original leaf
mask, and by avoiding some unnecessary intermediate variables in
computing the shift amounts. The logic is similar to what has recently
been committed to sys/sys/bitstring.h.
Compute better hint updates for the case when the cursor starts in
mid-leaf, and eliminates some otherwise viable solutions. Assume the
worst case, that all the eliminated offsets could have been solutions,
and you can still compute a better hint than we use now.
Eliminate some unnecessary conditional control flow.
Approved by: alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22666
Delay the attachment of children, when requested, until after interrutps are
running. This is often needed to allow children to run transactions on i2c or
spi busses. It's a common enough idiom that it will be useful to have its own
wrapper.
Reviewed by: ian
Differential Revision: https://reviews.freebsd.org/D21465
Allocate the callout structure on-demand from
fail_point_use_timeout_path() since most fail points do not use
timeouts.
Reviewed by: markj (earlier version), cem
Differential Revision: https://reviews.freebsd.org/D22599
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.
Don't supply a NAND (not (A and B)) operation at this time.
Discussed with: jeff
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22791
The simulation cannot be reproduced, so the value of using a deterministic PRNG
like random(3) is dubious. The number of repitions used in the sample isn't a
problem for the Chacha implementation of arc4random we have today. (Also, no
one actually runs this code; it was provided as an example of the work the
author did validating the implementation. It's not even test code.)
r355677 modified the NFS client so that it does lseek(SEEK_DATA/SEEK_HOLE)
for NFSv4.2, but calls vop_stdioctl() otherwise. As such, vop_stdioctl()
needs to be a global function.
Missed during the code merge for r355677.
via sys_sync(2). Minor cleanup, no functional changes.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19366
This fixes a regression after r355311. Specifically, sched_preempt()
may trigger a context switch by calling thread_lock(), since
thread_lock() calls critical_exit() in its slow path and the interrupted
thread may have already been marked for preemption. This would happen
before tdq_ipipending is cleared, blocking further preemption IPIs. The
CPU can be left in this state indefinitely if the interrupted thread
migrates.
Rename tdq_ipipending to tdq_owepreempt. Any switch satisfies a remote
preemption request, so clear tdq_owepreempt in sched_switch() instead of
sched_preempt() to avoid subtle problems of the sort described above.
Reviewed by: jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22758
Both of these features are not needed by many consumers and result in avoidable
reads which in turn puts them on profiles due to cache-line ping ponging.
On top of that the current lockgmr entry point is slower than necessary
single-threaded. As an attempted clean up preparing for other changes,
provide new routines which don't support any of the aforementioned features.
With these patches in place vop_stdlock and vop_stdunlock disappear from
flamegraphs during -j 104 buildkernel.
Reviewed by: jeff (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22665
A system call number should be at least reserved.
We do not expect an attempt to register a fixed number system call
when nothing at all is known about it.
MFC after: 3 weeks
Sponsored by: Panzura
This typedef is the same as timeout_t except that it is in the callout
namespace and header.
Use this typedef in various places of the callout implementation that
were either using the raw type or timeout_t.
While here, add <sys/callout.h> to the manpage.
Reviewed by: kib, imp
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D22751
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.
This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).
Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone
Reviewed by: kib, jeff (previous version)
Differential Revision: https://reviews.freebsd.org/D22718
As mandated by POSIX. Also clarify the kill(2) manpage.
While there, restructure the code in killpg1() to use helper which
keeps overall state of the process list iteration in the killpg1_ctx
structued, later used to infer the error returned.
Reported by: amdmi3
Reviewed by: jilles
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D22621
Use the power of variable to avoid spelling out source and generated
files too many times. The previous Makefiles were hard to read, hard to
edit, and badly formatted.
Reviewed by: kevans, emaste
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22714