The intent is to provide bsd-specific flags relevant to interpreter
and C runtime. I did not want to reuse AT_FLAGS which is common ELF
auxv entry.
Use bsdflags to report kernel support for sigfastblock(2). This
allows rtld and libthr to safely infer the syscall presence without
SIGSYS. The tunable kern.elf{32,64}.sigfastblock blocks reporting.
Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773
A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.
The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).
With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.
The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.
Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773
This was relatively harmless but surprising to see in counters. The
race occurred when rd_seq was read after the goal was updated and we
incorrectly calculated the delta between them.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23464
counters. In my stress test there is only one poll for every 15,000
frees. This means we are effectively amortizing the cache coherency
overhead even with very high write rates (3M/s/core).
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23463
Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to
mark sysctls that still require locking Giant.
Rewrite sysctl_handle_string() to use internal locking instead of locking
Giant.
Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE.
Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT
and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet.
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23378
sendfile(2) optionally takes a set of headers that get prepended to the
file data. If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference. This was
introduced in r356902.
Reported by: syzkaller
Sponsored by: The FreeBSD Foundation
This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE
values. This provides an easy, architecture independent way to calculate the
running kernel displacement (current/load address minus original base address).
The initial goal for this change is to add a new libkvm function that returns
the kernel displacement, both for live kernels and crashdumps. This would in
turn be used by kgdb to find out how to relocate kernel symbols (if needed).
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D23284
Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc). Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23238
and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>.
As both functions become a public interface add process lock assert
to ensure that the process is not exiting under it.
Fix whitespace nit while here.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23340
MFC after 2 weeks
If the thread's lock is already that of the runqueue, don't recurse on
the queue lock.
Reviewed by: jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23492
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.
In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).
Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.
While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.
PR: 243839
Reviewed by: kib (committed form recommended by)
Tested by: lwhsu
Differential Revision: https://reviews.freebsd.org/D23479
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.
While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.
Verified with the capsicum suite /usr/tests.
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.
This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.
The bug is only present in -CURRENT.
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.
This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.
This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23427
inspection and after a lengthy discussion with jhb and kib. They have not
produced test failures.
Don't pointer chase through cpu0's smr. Use cpu correct smr even when not
in a critical section to reduce the likelihood of false sharing.
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.
Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.
Simplify the KPIs. Remove stack_save_td_running() and add a return
value to stack_save_td(). On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP. If the
target thread is running in user mode, stack_save_td() returns EBUSY.
Reviewed by: kib
Reported by: mjg, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23355
The intent was to make it more likely to catch filesystems with custom
need_inactive routines which fail to call vn_need_pageq_flush (or do an
equivalent).
One immediate case which is missed is vgone from called by inactive itself.
A better assertion may land later. The routine is not added to vputx because
it is of no use to tmpfs et al.
Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm. This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost. The memory overhead
is significantly lessened by limiting the free-to-use latency. A synthetic
test uses 1/20th of the memory vs Epoch. There is significant further
discussion in the comments and code review.
This code should be considered experimental. I will write a man page after
it has settled. After further validation the VM will begin using this
feature to permit lockless page lookups.
Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures. I will commit a stress testing
tool in a follow-up.
Reviewed by: mmacy, markj, rlibby, hselasky
Discussed with: sbahara
Differential Revision: https://reviews.freebsd.org/D22586
Otherwise we risk running into use-after-free.
In particular this codepath ends up dropping all protection before
suspending writes:
ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt
Reported by: pho
ctx (and thus ctx.flags) is stack garbage at the start of this
function, so initialize ctx.flags to an explicit value instead of
using binary operations on the garbage.
Reported by: gcc9
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D23368
With this change having the listmtx lock held postpones dooming the vnode.
Use this fact to simplify iteration over the lazy list. It also allows
filters to safely access ->v_data.
Reviewed by: kib (early version)
Differential Revision: https://reviews.freebsd.org/D23397
These were all introduced in the initial import of hwpstate_intel(4).
Reported by: Coverity
CIDs: 1413161, 1413164, 1413165, 1413167
X-MFC-With: r357002
In r110908 (2003) alfred added DFLAG_PASSABLE to tag those types of FD
that can be passed via unix pipes, but mqueuefs didn't exist
yet. Later, in r152825 (2005) davidxu neglected to include
DFLAG_PASSABLE since people don't normally pass these things via unix
sockets (it's a FreeBSD implementation detail that it's a file
descriptor, nobody noticed). Then r223866 (2011) by jonathan used the
new flag in fdcopy, which fork uses. Due to that, mqueuefs actually
broke mqueue objects being propagated by fork. No mention of mqueuefs
was made in r223866, so I think it was an unintended consequence.
Fix this by tagging mqueuefs as passable as well. They were prior to
alfred's change (and it's clear there's no intent in his change to
change this behavior), and POSIX requires this to be the case as well.
PR: 243103
Reviewed by: kib@, jiles@
Differential Revision: https://reviews.freebsd.org/D23038
These should not be any functional change. While the change in
emul10kx-pcm.c looks like a real bug fix (as opposed to inconsistent
whitespace), the extra statements were not harmful.
Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23363
vdbatch_process leaves the critical section too early, openign a time
window where another thread can get scheduled and modify vd->freevnodes.
Once it the preempted thread gets back it overrides the value with 0.
Just move critical_exit to the end of the function.
The existing AF_UNIX socket garbage collector destroys any socket
which may potentially be in a cycle, as indicated by its file reference
count being equal to its enqueue count. However, this can produce false
positives for in-flight sockets which aren't part of a cycle but are
part of one or more SCM_RIGHTS mssages and which have been closed
on the sending side. If the garbage collector happens to run at
exactly the wrong time, destruction of these sockets will render them
unusable on the receiving side, such that no previously-written data
may be read.
This change rewrites the garbage collector to precisely detect cycles:
1. The existing check of msgcount==f_count is still used to determine
whether the socket is potentially in a cycle.
2. The socket is now placed on a local "dead list", which is used to
reduce iteration time (and therefore contention on the global
unp_link_rwlock).
3. The first pass through the dead list removes each potentially-dead
socket's outgoing references from the graph of potentially-dead
sockets, using a gc-specific copy of the original reference count.
4. The second series of passes through the dead list removes from the
list any socket whose remaining gc refcount is non-zero, as this
indicates the socket is actually accessible outside of any possible
cycle. Iteration is repeated until no further sockets are removed
from the dead list.
5. Sockets remaining in the dead list are destroyed as before.
PR: 227285
Submitted by: jan.kokemueller@gmail.com (prior version)
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23142
There is nothing to do but to bump the count even during said transition.
There are 2 places which can do it:
- vget only does this after locking the vnode, meaning there is no change in
contract versus inactive or reclamantion
- vref only ever did it with the interlock held which did not protect against
either (that is, it would always succeed)
VCHR vnodes retain special casing due to the need to maintain dev use count.
Reviewed by: jeff, kib
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23185
vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.
Reviewed by: jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23184
Doing so runs into races with filesystems which make half-constructed vnodes
visible to other users, while depending on the chain vput -> vinactive ->
vrecycle to be executed without dropping the vnode lock.
Impediments for making this work got cleared up (notably vop_unlock_post now
does not do anything and lockmgr stops touching the lock after the final
write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can
now be eliminated.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23344
This evens it up with other locking primitives.
Note lock profiling still touches the lock, which again is in line with the
rest.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23343
After r355784 we no longer hold a thread's thread lock when switching it
out. Preserve the previous synchronization protocol for td_oncpu by
setting it together with td_state, before dropping the thread lock
during a switch.
Reported and tested by: pho
Reviewed by: kib
Discussed with: jeff
Differential Revision: https://reviews.freebsd.org/D23270
it. The introduction of lockless switch in r355784 created a race to
re-use the exiting thread that was only possible to hit on a hypervisor.
Reported/Tested by: rlibby
Discussed with: rlibby, jhb
Intel Speed Shift is Intel's technology to control frequency in hardware,
with hints from software.
Let's get a working version of this in the tree and we can refine it from
here.
Submitted by: bwidawsk, scottph
Reviewed by: bcr (manpages), myself
Discussed with: jhb, kib (earlier versions)
With feedback from: Greg V, gallatin, freebsdnewbie AT freenet.de
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18028
The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033
Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23186
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.
Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.
Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.
Reviewed by: brooks, imp
Nice!: emaste
Differential Revision: https://reviews.freebsd.org/D23197
It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.
This soothes a global serialisation point for all 0<->1 hold count transitions.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23235
Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D23162
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.
The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.
Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23140
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D23158
Take advantage of global ordering introduced in r356672.
Reviewed by: mckusick (previous version)
Differential Revision: https://reviews.freebsd.org/D23067
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102
mount point while numerous tests are running that are writing to
files on that mount point cause the unmount(8) to hang forever.
The unmount(8) system call is handled in the kernel by the dounmount()
function. The cause of the hang is that prior to dounmount() calling
VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT
flag indicates that VFS_SYNC() should not return until all the dirty
buffers associated with the mount point have been written to disk.
Because user processes are allowed to continue writing and can do
so faster than the data can be written to disk, the call to VFS_SYNC()
can never finish.
Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes
when they request to do a write thus having a finite number of dirty
buffers to write that cannot be expanded. There is no need to call
VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs
to flush everything again anyway after suspending writes, to catch
anything that was dirtied between the VFS_SYNC() and writes being
suspended.
The fix is to simply remove the unnecessary call to VFS_SYNC() from
dounmount().
Reported by: Peter Holm
Analysis by: Chuck Silvers
Tested by: Peter Holm
MFC after: 7 days
Sponsored by: Netflix
subsystems tend to need to know about it, and including if_var.h is
huge header pollution for them. Polluting possible non-network
users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch. Not used yet, and unlikely
to get used in close future.
The only reason to vlazy there is to (overzealously) ensure all vnodes
which need to be visited by msync scan can be found there.
In particluar this is of no use zfs and tmpfs.
While here depessimize the check.
Remove assumptions about the minimum MINALLOCSIZE, in order to allow
testing of smaller MINALLOCSIZE. A following patch will lower the
MINALLOCSIZE, but not so much that the present patch is required for
correctness at these sites.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.
Per-cpu areas are locked in order to synchronize against UMA freeing
memory.
vnode's v_mflag is converted to short to prevent the struct from
growing.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998
The current notion of an active vnode is eliminated.
Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.
Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997
This obviates the need to scan the entire active list looking for vnodes
of interest.
msync is handled by adding all vnodes with write count to the lazy list.
deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.
Vnodes get dequeued from the list when their hold count reaches 0.
Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22995
Treat it as a synonym for GRND_NONBLOCK. The reasoning is this:
We have two choices for handling Linux's GRND_INSECURE API flag.
1. We could ignore it completely (like GRND_RANDOM). However, this might
produce the surprising result of GRND_INSECURE requests blocking, when the
Linux API does not block.
2. Alternatively, we could treat GRND_INSECURE requests as requests for
GRND_NONBLOCk. Here, the surprising result for Linux programs is that
invocations with unseeded random(4) will produce EAGAIN, rather than
garbage.
Honoring the flag in the way Linux does seems fraught. If we actually use
the output of a random(4) implementation prior to seeding, we leak some
entropy (in an information theory and also practical sense) from what will
be the initial seed to attackers (or allow attackers to arbitrary DoS
initial seeding, if we don't leak). This seems unacceptable -- it defeats
the purpose of blocking on initial seeding.
Secondary to that concern, before seeding we may have arbitrarily little
entropy collected; producing output from zero or a handful of entropy bits
does not seem particularly useful to userspace.
If userspace can accept garbage, insecure, non-random bytes, they can create
their own insecure garbage with srandom(time(NULL)) or similar. Any program
which would be satisfied with a 3-bit key CTR stream has no need for CSPRNG
bytes. So asking the kernel to produce such an output from the secure
getrandom(2) API seems inane.
For now, we've elected to emulate GRND_INSECURE as an alternative spelling
of GRND_NONBLOCK (2). Consider this API not-quite stable for now. We
guarantee it will never block. But we will attempt to monitor actual port
uptake of this bizarre API and may revise our plans for the unseeded
behavior (prior stable/13 branching).
Approved by: csprng(markm), manpages(bcr)
See also: https://lwn.net/ml/linux-kernel/cover.1577088521.git.luto@kernel.org/
See also: https://lwn.net/ml/linux-kernel/20200107204400.GH3619@mit.edu/
Differential Revision: https://reviews.freebsd.org/D23130
This creates a dedicated routine (vn_alloc) to allocate vnodes.
As a side effect code duplicationw with getnewvnode_reserve is eleminated.
Add vn_free for symmetry.
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.
Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.
Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.
Reviewed by: brooks
Discussed with: sjg
Differential Revision: https://reviews.freebsd.org/D23099
Otherwise the malloc type accounting in malloc_domainset(9) is wrong
after r355203.
Reviewed by: rlibby
Reported by: kaktus
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23095
Other mechanisms that resize the shmfd grab a write lock from 0 to OFF_MAX
for safety, so we still get proper synchronization of shmfd->shm_size in
effect. There's no need to block readers/writers of earlier segments when
we're just reserving more space, so narrow the scope -- it would likely be
safe to narrow it completely to just the section of the range that extends
beyond our current size, but this likely isn't worth it since the size isn't
stable until the writelock is granted the first time.
Suggested by: cem (passing comment)
Linux expects to be able to use posix_fallocate(2) on a memfd. Other places
would use this with shm_open(2) to act as a smarter ftruncate(2).
Test has been added to go along with this.
Reviewed by: kib (earlier version)
Differential Revision: https://reviews.freebsd.org/D23042
This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D23042
vgone dooms the vnode while keeping VI_OWEINACT set and then drops the
interlock.
vputx can pick up the interlock and pass it to vdefer_inactive since the
flag is set.
The race is harmless, just don't defer anything as vgone will take care of it.
Reported by: pho
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.
Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
- use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock
- evaluate flags before looping over vnodes
Reviewed by: kib
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23035
Otherwise in code like this:
if (numvnodes > desiredvnodes)
vnlru_free_locked(numvnodes - desiredvnodes, NULL);
numvnodes can drop below desiredvnodes prior to the call and if the
compiler generated another read the subtraction would get a negative
value.
There was only one consumer and it was using it incorrectly.
It is given an equivalent hack.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23037
r136999 introduced SYSTCL_DEBUG but apparently "opt_sysctl.h" was never
included making the option ignored.
r322954 introduced sysctl.reuse_test with OID number equal to 0, effectively
shadowing the very special sysctl.debug one. Use OID_AUTO as it doesn't need
any special treatment.
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23056
When file sealing and shm_open2 were introduced, we should have grown a new
kern_shm_open2 helper that did the brunt of the work with the new interface
while kern_shm_open remains the same. Instead, more complexity was
introduced to kern_shm_open to handle the additional features and consumers
had to keep changing in somewhat awkward ways, and a kern_shm_open2 was
added to wrap kern_shm_open.
Backpedal on this and correct the situation- kern_shm_open returns to the
interface it had prior to file sealing being introduced, and neither
function needs an initial_seals argument anymore as it's handled in
kern_shm_open2 based on the shmflags.
If a write seal is set on a shared mapping, we must exclude VM_PROT_WRITE as
the fd is effectively read-only. This was discovered by running
devel/linux-ltp, which mmap's with acceptable protections specified then
attempts to raise to PROT_READ|PROT_WRITE with mprotect(2), which we
allowed.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22978
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
This is a lock-based emulation of 64-bit atomics for kernel use, split off
from an earlier patch by jhibbits.
This is needed to unblock future improvements that reduce the need for
locking on 64-bit platforms by using atomic updates.
The implementation allows for future integration with userland atomic64,
but as that implies going through sysarch for every use, the current
status quo of userland doing its own locking may be for the best.
Submitted by: jhibbits (original patch), kevans (mips bits)
Reviewed by: jhibbits, jeff, kevans
Differential Revision: https://reviews.freebsd.org/D22976
and make it usable outside of kern_umtx.c. To be used in several
future changes.
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
r23081 introduced kern.dummy oid as a semi ABI compat for kern.maxsockbuf
that was moved to a new namespace. It never functioned as an alias of any
kind and was just returning 0 unconditionally, hence it was probably
provided to keep some 3rd party programmes happy about sysctl(3) not
reporting an error because of non-existing oid.
After nearly 23 years it seems reasonable to just hide it from sysctl(8)
list not to cause unnecessary confusion as for its purpose.
Reported by: Antranig Vartanian <antranigv@freebsd.am>
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D22982
Amount of changes to the original code has been intentionally minimised
to ease diffing.
The changes are mostly mechanical, with the following exceptions:
* lltable handler is now called directly based of RTF_LLINFO flag presense.
* "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified,
fixing several potential use-after-free cases in rt_addrinfo.
* llable asserts has been replaced with error-returning, preventing kernel
crashes when lltable gw af family is invalid (root required).
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22864
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places. It would be cool if we had
more SMP-friendly statistics, but this helps too.
Sponsored by: iXsystems, Inc.
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885
Summary:
r356113 used an older patch, which predated the
freebsd_copyout_auxargs() addition. Fix this by using a private
powerpc_copyout_auxargs() instead, and keep it private to powerpc, not in MI
files.
Reviewed by: kib, bdragon
Differential Revision: https://reviews.freebsd.org/D22935
1. The only place in the tree which calls getnewvnode with mp == NULL does it
for vp_crossmp which will never execute this codepath. Any vnode which legally
has ->v_mount == NULL is also doomed, which once more wont execute this code.
2. Remove an assertion for v_holdcnt from production kernels. It gets taken care
of by refcount macros in debug kernels.
Any code which would want to pass NULL mp can construct a fake one instead.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D22722
To be used when like rmlocks, except when sleeping for readers needs to be
allowed. See the manpage for more information.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D22823
Summary:
As a transition aide, implement an alternative elfN_freebsd_fixup which
is called for old powerpc binaries. Similarly, add a translation to rtld to
convert old values to new ones (as expected by a new rtld).
Translation of old<->new values is incomplete, but sufficient to allow an
installworld of a new userspace from an old one when a new kernel is running.
Test Plan:
Someone needs to see how a new kernel/rtld/libc works with an old
binary. If if works we can probalby ship this. If not we probalby need
some more compat bits.
Submitted by: brooks
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20799
srandom(9) is meaningless on SMP systems or any system with, say,
interrupts. One could never rely on random(9) to produce a reproducible
sequence of outputs on the basis of a specific srandom() seed because the
global state was shared by all kernel contexts. As such, removing it is
literally indistinguishable to random(9) consumers (as compared with
retaining it).
Mark random(9) as deprecated and slated for quick removal. This is not to
say we intend to remove all fast, non-cryptographic PRNG(s) in the kernel.
It/they just won't be random(9), as it exists today, in either name or
implementation.
Before random(9) is removed, a replacement will be provided and in-tree
consumers will be converted.
Note that despite the name, the random(9) interface does not bear any
resemblance to random(3). Instead, it is the same crummy 1988 Park-Miller
LCG used in libc rand(3).
A weak symbol here is decidedly cleaner than any #ifdef soup or relocating
kbdinit, the former leading to maintenance required on addition of any
console/keyboard drivers and the latter pushing kbd init bits away from
where they're used.
This leads to the revert of r355806; this reduces duplication in keyboard
registration and driver switch lookup and leaves us with one authoritative
source for currently registered drivers. The reduced duplication later is
nice as we have more procedure involved in keyboard setup.
keyboard_driver->flags is used to more quickly detect bogus adds/removes.
From KPI consumers' perspective, nothing changes- kbd_add_driver of an
already-registered driver will succeed, and a single kbd_delete_driver will
later remove it as expected. In contrast to historical behavior,
kbd_delete_driver on a driver registered via linker set will now actually
de-register the driver so that it may not be used -- e.g. if kbdmux's
MOD_LOAD handler fails somewhere.
Detection for already-registered drivers in kbd_add_driver has improved, as
the previous SLIST_NEXT(driver) != NULL check would not have caught a driver
that's at the tail end.
kbdinit is now called from cninit() rather than via SYSINIT so that keyboard
drivers are available as early as console drivers. This is particularly
important as cnprobe will, in both syscons and vt, attempt to do any early
configuration of keyboard drivers built-in (see: kbd_configure).
Reviewed by: imp (earlier version, pre-cninit change)
Differential Revision: https://reviews.freebsd.org/D22835
_sleep(9), wakeup(9), sleepqueue(9), et al do not dereference or modify the
channel pointers provided in any way; they are merely used as intptrs into a
dictionary structure to match waiters with wakers. Correctly annotate this
such that _sleep() and wakeup() may be used on const pointers without
invoking ugly patterns like __DECONST(). Plumb const through all of the
underlying sleepqueue bits.
No functional change.
Reviewed by: rlibby
Discussed with: kib, markj
Differential Revision: https://reviews.freebsd.org/D22914
Due to clang and LLD's tendency to use a PLT for builtins, and as they
don't have full support for EABI, we sometimes have to deal with a PLT in
.ko files in a clang-built kernel.
As such, augment the in-kernel linker to support jump table processing.
As there is no particular reason to support lazy binding in kernel modules,
only implement Secure-PLT immediate binding.
As part of these changes, add elf_cpu_parse_dynamic() to the MD API of the
in-kernel linker (except on platforms that use raw object files.)
The new function will allow MD code to act on MD tags in _DYNAMIC.
Use this new function in the PowerPC MD code to ensure BSS-PLT modules using
PLT will be rejected during insertion, and to poison the runtime resolver to
ensure we get a clear panic reason if a call is made to the resolver.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22608
It is UB to evaluate pointer comparisons when pointers do not point within
the same object. Instead, convert the pointers to numbers and compare the
numbers.
Reported by: kib
Discussed with: rlibby
Previously just ensuring that we do not sleep when clustering for
md(4) vnode was enough. Now, with the switch of the pbuf allocator to
uma and completely broken per-subsystem pbuf limits, it might cause
unbounded sleep even for non-md(4) vnodes.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22899
If the devmap entry uses the upper 32 bits they wouldn't be printed in
devmap_dump_table(). This fixes that.
Submitted by: Nicholas O'Brien <nickisobrien_gmail.com>
Sponsored by: Axiado
We by definition cannot trace the stack of such a thread. Also remove a
redundant stack_zero() call in the SIGINFO handler, the stack structure
is cleared by the MD stack_capture().
Sponsored by: The FreeBSD Foundation
Now that it is not used after schedlock changes got merged.
Note the unlock routine temporarily still checks for it on account of just using
regular spin unlock.
This is a prelude towards a general clean up.
Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.
This change has not been made to 4BSD but in principle it would be
more straightforward.
Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778
Eliminate lock recursion from turnstiles. This was simply used to avoid
tracking the top-level turnstile lock. explicitly check for it before
picking up and dropping locks.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22746
Do all sleepqueue post-processing in sleepq_remove_thread() so that we
do not require the thread lock after a context switch.
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D22745
The arm kernel stack unwinder has apparently never been able to unwind when
the path of execution leads through a kernel module. There was code that
tried to handle modules by looking for the unwind data in them, but it did
so by trying to find symbols which have never existed in arm kernel
modules. That caused the unwind code to panic, and because part of panic
handling calls into the unwind code, that just created a recursion loop.
Locating the unwind data in a loaded module requires accessing the Elf
section headers to find the SHT_ARM_EXIDX section. For preloaded modules
those headers are present in a metadata blob. For dynamically loaded
modules, the headers are present only while the loading is in progress; the
memory is freed once the module is ready to use. For that reason, there is
new code in kern/link_elf.c, wrapped in #ifdef __arm__, to extract the
unwind info while the headers are loaded. The values are saved into new
fields in the linker_file structure which are also conditional on __arm__.
In arm/unwind.c there is new code to locally cache the per-module info
needed to find the unwind tables. The local cache is crafted for lockless
read access, because the unwind code often needs to run in context where
sleeping is not allowed. A large comment block describes the local cache
list, so I won't repeat it all here.
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626
an exclusive object lock.
Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.
Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.
Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.
Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654
exec_map_first_page(). This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).
Discussed with: alc
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D22731
bits, by storing and modifying the complement of the original leaf
mask, and by avoiding some unnecessary intermediate variables in
computing the shift amounts. The logic is similar to what has recently
been committed to sys/sys/bitstring.h.
Compute better hint updates for the case when the cursor starts in
mid-leaf, and eliminates some otherwise viable solutions. Assume the
worst case, that all the eliminated offsets could have been solutions,
and you can still compute a better hint than we use now.
Eliminate some unnecessary conditional control flow.
Approved by: alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22666
Delay the attachment of children, when requested, until after interrutps are
running. This is often needed to allow children to run transactions on i2c or
spi busses. It's a common enough idiom that it will be useful to have its own
wrapper.
Reviewed by: ian
Differential Revision: https://reviews.freebsd.org/D21465
Allocate the callout structure on-demand from
fail_point_use_timeout_path() since most fail points do not use
timeouts.
Reviewed by: markj (earlier version), cem
Differential Revision: https://reviews.freebsd.org/D22599
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.
Don't supply a NAND (not (A and B)) operation at this time.
Discussed with: jeff
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22791
The simulation cannot be reproduced, so the value of using a deterministic PRNG
like random(3) is dubious. The number of repitions used in the sample isn't a
problem for the Chacha implementation of arc4random we have today. (Also, no
one actually runs this code; it was provided as an example of the work the
author did validating the implementation. It's not even test code.)
r355677 modified the NFS client so that it does lseek(SEEK_DATA/SEEK_HOLE)
for NFSv4.2, but calls vop_stdioctl() otherwise. As such, vop_stdioctl()
needs to be a global function.
Missed during the code merge for r355677.
via sys_sync(2). Minor cleanup, no functional changes.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19366
This fixes a regression after r355311. Specifically, sched_preempt()
may trigger a context switch by calling thread_lock(), since
thread_lock() calls critical_exit() in its slow path and the interrupted
thread may have already been marked for preemption. This would happen
before tdq_ipipending is cleared, blocking further preemption IPIs. The
CPU can be left in this state indefinitely if the interrupted thread
migrates.
Rename tdq_ipipending to tdq_owepreempt. Any switch satisfies a remote
preemption request, so clear tdq_owepreempt in sched_switch() instead of
sched_preempt() to avoid subtle problems of the sort described above.
Reviewed by: jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22758
Both of these features are not needed by many consumers and result in avoidable
reads which in turn puts them on profiles due to cache-line ping ponging.
On top of that the current lockgmr entry point is slower than necessary
single-threaded. As an attempted clean up preparing for other changes,
provide new routines which don't support any of the aforementioned features.
With these patches in place vop_stdlock and vop_stdunlock disappear from
flamegraphs during -j 104 buildkernel.
Reviewed by: jeff (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22665
A system call number should be at least reserved.
We do not expect an attempt to register a fixed number system call
when nothing at all is known about it.
MFC after: 3 weeks
Sponsored by: Panzura
This typedef is the same as timeout_t except that it is in the callout
namespace and header.
Use this typedef in various places of the callout implementation that
were either using the raw type or timeout_t.
While here, add <sys/callout.h> to the manpage.
Reviewed by: kib, imp
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D22751
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.
This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).
Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone
Reviewed by: kib, jeff (previous version)
Differential Revision: https://reviews.freebsd.org/D22718
As mandated by POSIX. Also clarify the kill(2) manpage.
While there, restructure the code in killpg1() to use helper which
keeps overall state of the process list iteration in the killpg1_ctx
structued, later used to infer the error returned.
Reported by: amdmi3
Reviewed by: jilles
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D22621
Use the power of variable to avoid spelling out source and generated
files too many times. The previous Makefiles were hard to read, hard to
edit, and badly formatted.
Reviewed by: kevans, emaste
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22714
Two changes to EPOCH_TRACE:
(1) add a sysctl to surpress the backtrace from epoch_trace_report().
Sometimes the log line for the recursion is enough and the
backtrace massively spams the console.
(2) In order to be able to go without the backtrace do not only
print where the previous occurance happened, but also where
the current one happens. That way we have file:line information
for both and can look at them without the need for getting line
numbers from backtrace and a debugging tool.
Reviewed by: glebius
Sponsored by: Netflix (originally)
Differential Revision: https://reviews.freebsd.org/D22641
First, this removes a spurious difference compared to rw locks.
More importantly though this avoids a trip through sleepq code if the lock
happens to be caught in this state.
The mbuf zones were explicitly specifying the uma trash procedures on
zcreate, conditionally on INVARIANTS, because that used to be necessary
in order to get use-after-free checking for uma zones with non-empty
constructors or destructors. After r355137 uma automatically invokes
the trash constructor and destructor as long as no init and fini are
specified. This now allows the mbuf zones to pass their constructors
and destructors without needing to add on the uma trash procedures
conditionally.
Reviewed by: cem, jhb, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22583
- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.
Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.
No functional change.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
Generally, it's preferred that an application fork/setsid if it doesn't want
to keep its controlling TTY, but it could be that a debugger is trying to
steal it instead -- so it would hook in, drop the controlling TTY, then do
some magic to set things up again. In this case, TIOCNOTTY is quite handy
and still respected by at least OpenBSD, NetBSD, and Linux as far as I can
tell.
I've dropped the note about obsoletion, as I intend to support TIOCNOTTY as
long as it doesn't impose a major burden.
Reviewed by: bcr (manpages), kib
Differential Revision: https://reviews.freebsd.org/D22572
The previously used quiesce_all_cpus walks all CPUs and waits until curthread
can run on them. Even on contemporary machines this becomes a significant
problem under load when it can literally take minutes for the operation to
complete. With the patch the stall is normally less than 1 second.
Reviewed by: kib, jeff (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21740
A variant of this facility was already used by rmlocks where IPIs would
enforce ordering.
This allows to elide fences where they are rarely needed and the cost of
IPI (should it be necessary) is cheaper.
Reviewed by: kib, jeff (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21740
This allows bumping threadcount without taking the global devmtx lock.
In particular this eliminates contention on said lock while using bhyve
with multiple vms.
Reviewed by: kib
Tested by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22548
We already assert the lock is held later during tty_rel_free(), but it is
arguably good form to clarify locking expectations here as well at the
top-level that other drivers use.
vn_open_cred() assumes that it is called from the top-level of a VFS
syscall. Writers must call bwillwrite() before locking any VFS
resource to wait for cleanup of dirty buffers.
ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which
results in wait for unrelated buffers while owning ZFS vnode lock (and
ZFS does not use buffer cache). VN_OPEN_INVFS allows caller to skip
bwillwrite.
Note that ZFS is still incorrect there, because it starts write on an
mp and locks a vnode while holding another vnode lock.
Reported by: Willem Jan Withagen <wjw@digiware.nl>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The use of the uma trash procedures is automatic, there's no need to
pass them explicitly here.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22582
tty_pts.c relies on sys/tty.h for sys/mutex.h. Include it directly instead
of relying on this pollution to ease the diff for anyone that wants to try
converting the tty lock to anything other than a mutex.
union members in vm_page.h to store the zone and slab. Remove some nearby
dead code.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22564
There are two classes of rm lock, one "sleepable" and one not. But even
a "sleepable" rm lock is only sleepable in write mode, and is
non-sleepable when taken in read mode.
Warn about sleepable rm locks in read mode as non-sleepable locks. Do
this by defining a new lock operation flag, LOP_NOSLEEP, to indicate
that a lock is non-sleepable despite what the LO_SLEEPABLE flag would
indicate, and defining a new witness lock instance flag, LI_SLEEPABLE,
to track the product of LO_SLEEPABLE and LOP_NOSLEEP on the lock
instance.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22527
. entries are never created and .. can reuse existing entries,
meaning the early count bump is both spurious and leading to
overcounting in certain cases.
The debugger like truss(1) depends on the wait(2) syscall. This syscall
waits for ALL children. When it is waiting for ALL child's the children
created by process descriptors are not returned. This behavior was
introduced because we want to implement libraries which may pdfork(1).
The behavior of process descriptor brakes truss(1) because it will
not be able to collect the status of processes with process descriptors.
To address this problem the status is returned to parent when the
child is traced. While the process is traced the debugger is the new parent.
In case the original parent and debugger are the same process it means the
debugger explicitly used pdfork() to create the child. In that case the debugger
should be using kqueue()/pdwait() instead of wait().
Add test case to verify that. The test case was implemented by markj@.
Reviewed by: kib, markj
Discussed with: jhb
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D20362
Several sysctl sysctls output to a user buffer while holding a
non-sleepable lock that protects the sysctl topology. They need to wire
the output buffer, or else they may try to sleep on a page fault.
Reviewed by: cem, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22528
Record as much bits from curthread into busy_lock as fits. Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0). Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.
Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler. For this case, add
_unchecked variants of asserts and unbusy primitives.
Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22298
Add a warning when a device registers with devfs and requests
D_NEEDGIANT. The warning says the device will go away before
13.0. This is needed to flush out the devices in the tree that are
still Giant locked. This warning, or some variant of it, should have
gone into the tree a long time ago...
The intention is to require all devices be converted to not use
automatic giant in this way, or remove any such devices that remain
that we don't have the hardware to test a conversion of.
kbd so far is the only device that can't leave the tree, yet needs
something sensible done to avoid the auto giant lock (even if it is
just doing the wrapping itself). There may be others added to this
list... Any discussions of this topic will take place on arch@.
Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP
(EARLY_AP_STARTUP). Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH.
epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND,
but likely should allocate before SI_SUB_SMP. Prior to this change,
consumers (well, epoch itself, and net/if.c) just open-coded the
SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile.
Reviewed by: mmacy
Differential Revision: https://reviews.freebsd.org/D22503
Arm64 doesn't define the bus_space_set_multi_stream and
bus_space_set_region_stream functions. Don't try to define them there.
Sponsored by: DARPA, AFRL
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.
This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22315
KCSAN is a tool to find concurrent memory access that may race each other.
After a determined number of memory accesses a cell is created, this
describes the current access. It will then delay for a short period
to allow other CPUs a chance to race. If another CPU performs a memory
access to an overlapping region during this delay the race is reported.
This is a straight import of the NetBSD code, it will be adapted to
FreeBSD in a future commit.
Sponsored by: DARPA, AFRL
Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.
There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).
Reviewed by: kib, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22202
reudundant complicated checks and additional locking required only for
anonymous memory. Introduce vm_object_allocate_anon() to create these
objects. DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.
Reviewed by: alc, dougm, kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22119
The lua-based makesyscalls produces slightly different output than its
makesyscalls.sh predecessor, all whitespace differences more closely
matching the source syscalls.master.
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.
Reviewed by: brooks
Differential Revision: https://reviews.freebsd.org/D21894
Co-mingling two things here:
* Addressing some feedback from Konstantin and Kyle re: jail,
capability mode, and a few other things
* Adding audit support as promised.
The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.
Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: kib
Relnotes: Yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22083
Zeroing of them is needed so that an image activator can update the
values as appropriate (or not set at all).
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22379
knobs and indicators for code that mitigates functional and security issues
in the architecture/platform. Controls for regular operational policy should
still go into places security, hw, kern, etc.
The machdep root node is inherently architecture dependent, but mitigations
tend to be architecture dependent as well. Some cases like Spectre do cross
architectural boundaries, but the mitigation code for them tends to be
architecture dependent anyways, and multiple architectures won't be active
in the same image of the kernel.
Many mitigation knobs already exist in the system, and they will be moved
with compat naming in the future. Going forward, mitigations should collect
in machdep.mitigations.
Reviewed by: imp, brooks, rwatson, emaste, jhb
Sponsored by: Intel
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.
Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355
Pointer arguments should be of the form "<type> *..." and not "<type>* ...".
No functional change.
Reviewed by: kevans
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22373
Suppose a writing thread has pinned its pages and gone to sleep with
pipe_map.cnt > 0. Suppose that the thread is woken up by a signal (so
error != 0) and the other end of the pipe has simultaneously been
closed. In this case, to satisfy the assertion about pipe_map.cnt in
pipe_destroy_write_buffer(), we must mark the buffer as empty.
Reported by: syzbot+5cce271bf2cb1b1e1876@syzkaller.appspotmail.com
Reviewed by: kib
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22261
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The old linux man page stated that, if the
offset + len exceeded file_size for the input file, EINVAL should be returned.
Now, the semantics is to copy up to at most file_size bytes and return that
number of bytes copied. If the offset is at or beyond file_size, a return
of 0 bytes is done.
This patch modifies copy_file_range(2) to be linux compatible for this
semantic change.
A separate patch will change copy_file_range(2) for the other semantic
change, which allows the infd and outfd to refer to the same file, so
long as the byte ranges do not overlap.
and busy pages. Add code that would carefully cleanups the state in case
of synchronous error return. Cover a case when a first I/O went on
asynchronously, but second or N-th returned error synchronously.
In collaboration with: chs
Reviewed by: jtl, kib
If we pass in a NULL mbuf to m_pulldown() we are in a bad situation
already. There is no point in doing that check for production code.
Change the if () panic() into a KASSERT.
MFC after: 3 weeks
Sponsored by: Netflix
On platforms where pointers are larger than 64-bits, struct statsblob
may be harmlessly padded out such that opaque[] always has some included
space. Make the assertion more general by comparing to the offset of
opaque rather than the size of struct statsblob.
Discussed with: jhb, James Clarke
Reviewed by: trasz, lstewart
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22188
- Optimize enqueue for two task priority values by adding new tq_hint
field, pointing to the last task inserted into the middle of the list.
In case of more then two priority values it should halve average search.
- Move tq_active insert/remove out of the taskqueue_run_locked loop.
Instead of dirtying few shared cache lines per task introduce different
mechanism to drain active tasks, based on task sequence number counter,
that uses only cache lines already present in cache. Since the new
mechanism does not need ordering, switch tq_active from TAILQ to LIST.
- Move static and dynamic struct taskqueue fields into different cache
lines. Move lock into its own cache line, so that heavy lock spinning
by multiple waiting threads would not affect the running thread.
- While there, correct some TQ_SLEEP() wait messages.
This change fixes certain ZFS write workloads, causing huge congestion
on taskqueue lock. Those workloads combine some large block writes to
saturate the pool and trigger allocation throttling, which uses higher
priority tasks to requeue the delayed I/Os, with many small blocks to
generate deep queue of small tasks for taskqueue to sort.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
bzero the entire thrmisc struct, not just the padding. Other core dump
notes are already done this way.
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
flag and use the same system.
This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116
Recent changes to busy/valid/dirty have enabled page based synchronization
and the object lock is no longer required in many cases.
Reviewed by: kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21597
Epoch itself doesn't rely on the counter and it is provided
merely for sleeping subsystems to check it.
- In functions that sleep use THREAD_CAN_SLEEP() to assert
correctness. With EPOCH_TRACE compiled print epoch info.
- _sleep() was a wrong place to put the assertion for epoch,
right place is sleepq_add(), as there ways to call the
latter bypassing _sleep().
- Do not increase td_no_sleeping in non-preemptible epochs.
The critical section would trigger all possible safeguards,
no sleeping counter is extraneous.
Reviewed by: kib
This saves 320 bytes of the precious stack space.
The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that. Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D22138
This permits constructing the entire TLS header in ktls_frame() rather
than ktls_seq(). This also matches the approach used by OpenSSL which
uses an incrementing nonce as the explicit IV rather than the sequence
number.
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22117
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().
Reported and reviewed by: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22126
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().
Reviewed by: kib
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21860
Use the section flags to derive mapping protections. When multiple
sections overlap within a page, the union of their protections must be
applied. With r353701 the .text and .rodata sections are padded to
ensure that this does not happen on amd64.
Reviewed by: kib
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21896
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.
There are three pieces in the complete NetGDB system.
First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.
Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic. Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).
Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.
The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU). There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.
Submitted by: John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with: emaste, markj
Relnotes: sure
Differential Revision: https://reviews.freebsd.org/D21568
- Remove a redundant assignment of ef->address.
- Don't return a Mach error number to the caller if vm_map_find() fails.
- Use ptoa() and fix style.
MFC after: 2 weeks
Sponsored by: Netflix
This allows ddb(4) commands to construct a static dumperinfo during
panic/debug and invoke doadump(false) using the provided dumper
configuration (always inserted first in the list).
The intended usecase is a ddb(4)-time netdump(4) command.
Reviewed by: markj (earlier version)
Differential Revision: https://reviews.freebsd.org/D21448
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).
It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).
The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.
Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.
The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.
No other functional change intended.
Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421
This can be used to group all threads belonging to a single logical
entity under a common kernel process.
I am planning to use the new interface for ZFS threads.
MFC after: 4 weeks
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.
Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.
In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.
This doesn't affect Tier-1 architectures so no SA is expected.
admbugs: 651
MFC after: 1 week
Sponsored by: DARPA, AFRL
The comments in devmap are very ARM specific, this generalizes them for other
architectures.
Submitted by: Nicholas O'Brien <nickisobrien_gmail.com>
Reviewed by: manu, philip
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D22035
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21551