If we hadn't been traced in the first place when syscallenter()
started executing, we can ignore TDB_USERWR. TDB_USERWR can get set,
sure, but if it does, it's because the debugger raced with the syscall,
and it cannot depend on winning that race.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: EPSRC
Differential Revision: https://reviews.freebsd.org/D26585
This module handles relatively few execs (initial qemu-user-static, then
qemu-user-static handles exec'ing itself for binaries it's already running),
but all execs pay the price of at least taking the relatively expensive
sx/slock to check for a match when this module is loaded. Future work will
almost certainly swap this out for another lock, perhaps an rmslock.
The RLOCK/WLOCK phrasing was chosen based on what the callers are really
wanting, rather than using the verbiage typically appropriate for an sx.
MFC after: 1 week
We may want to reserve bits in the future for kernel-only use, so start
rejecting any that aren't the two that we're currently expecting from
userland.
MFC after: 1 week
Previously, non-preemptible epochs could not check; in_epoch() would always
fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING.
For default epochs, it's easy enough to verify that we're in the given
epoch: if we're in a critical section and our record for the given epoch
is active, then we're in it.
This patch also adds some additional INVARIANTS bookkeeping. Notably, we set
and check the recorded thread in epoch_enter/epoch_exit to try and catch
some edge-cases for the caller. It also checks upon freeing that none of the
records had a thread in the epoch, which may make it a little easier to
diagnose some improper use if epoch_free() took place while some other
thread was inside.
This version differs slightly from what was just previously reviewed by the
below-listed, in that in_epoch() will assert that no CPU has this thread
recorded even if it *is* currently in a critical section. This is intended
to catch cases where the caller might have somehow messed up critical
section nesting, we can catch both if they exited the critical section or if
they exited, migrated, then re-entered (on the wrong CPU).
Reviewed by: kib, markj (both previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27098
Notably, streamline error paths through the existing 'done' label, making it
easier to quickly verify correct cleanup.
Future work might add a kernel-only flag to indicate that a interpreter uses
#a. Currently, all executions via imgact_binmisc pay the penalty of
constructing sname/fname, even if they will not use it. qemu-user-static
doesn't need it, the stock rc script for qemu-user-static certainly doesn't
use it, and I suspect these are the vast majority of (if not the only)
current users.
MFC after: 1 week
According to code comments the original motivation was to allow for
malloc_type_internal changes without ABI breakage. This can be trivially
accomplished by providing spare fields and versioning the struct, as
implemented in the patch below.
The upshots are one less memory indirection on each alloc and disappearance
of mt_zone.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27104
This ensures that no writes are pending in memory, either metadata or
user data, but not including dirty pages not yet converted to fs writes.
Only filesystems declared local are suspended.
Note that this does not guarantee absence of the metadata errors or
leaks if resume is not done: for instance, on UFS unlinked but opened
inodes are leaked and require fsck to gc.
Reviewed by: markj
Discussed with: imp
Tested by: imp (previous version), pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D27054
Sample usage: kernel modules can decide whether to stick to malloc or
create their own zone.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27097
The 2 provided zones had inconsistent naming between each other
("int" and "64") and other allocator zones (which use bytes).
Follow malloc by naming them "pcpu-" + size in bytes.
This is a step towards replacing ad-hoc per-cpu zones with
general slabs.
On a sample box vmstat -z shows:
ITEM SIZE LIMIT USED FREE REQ
64: 64, 0, 1043784, 4367538,3698187229
selfd: 64, 0, 1520, 13726,182729008
But at the same time:
vm.uma.selfd.keg.domain.1.pages: 121
vm.uma.selfd.keg.domain.0.pages: 121
Thus 242 pages got pulled even though the malloc zone would likely accomodate
the load without using extra memory.
- Removed a bunch of redundant headers
- Don't explicitly initialize to 0
- The !error check prior to setting imgp->interpreter_name is redundant, all
error paths should and do return or go to 'done'. We have larger problems
otherwise.
Linux allows polling without any events specified and it happens to be the case
in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask
and this works fine if the condition is already present. However, if it is
missing, selrecord is only called if the eventmask has relevant bits set. This
in particular leads to a conditon where pipe_poll can return 0 events and
neglect to selrecord, while kern_poll takes it as an indication it has to go to
sleep, but then there is nobody to wake it up.
While the problem seems systemic to *_poll handlers the least we can do is fix
it up for pipes.
Reported by: Jeremie Galarneau <jeremie.galarneau at efficios.com>
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D27094
Previously the code had one wait channel for all pending writers.
This could result in a buggy scenario where after a writer switches
the lock mode form readers to writers goes off CPU, another writer
queues itself and then the last reader wakes up the latter instead
of the former.
Use a separate channel.
While here add features to reliably detect whether curthread has
the lock write-owned. This will be used by ZFS.
This is mostly mechanical except for vmspace_exit(). There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace. In that case, upon switching to
vmspace0 we can unconditionally release the reference.
Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.
Reviewed by: alc, kib, mmel
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27057
Alter shmget_allocate_segment and shmget_existing to take the values
they want from struct shmget_args rather than passing the struct
around. In general, uap structures should only be the interface to
sys_<foo> functions.
This makes on small functional change and records the allocated space
rather than the requested space. If this turns out to be a problem (e.g.
if software tries to find undersized segments by exact size rather than
using keys), we can correct that easily.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27077
This option is intended to be semantically identical to Linux's
SOL_SOCKET:SO_PASSCRED. For now, it is mutually exclusive with the
pre-existing sockopt SOL_LOCAL:LOCAL_CREDS.
Reviewed by: markj (penultimate version)
Differential Revision: https://reviews.freebsd.org/D27011
This sysctl value had been provided as a read-only variable that is
compiled into the C library based on the value of _PATH_LOCALBASE in
paths.h.
After this change, the value is compiled into the kernel as an empty
string, which is translated to _PATH_LOCALBASE by the C library.
This empty string can be overridden at boot time or by a privileged
user at run time and will then be returned by sysctl.
When set to an empty string, the value returned by sysctl reverts to
_PATH_LOCALBASE.
This update does not change the behavior on any system that does
not modify the default value of user.localbase.
I consider this change as experimental and would prefer if the run-time
write permission was reconsidered and the sysctl variable defined with
CLFLAG_RDTUN instead to restrict it to be set at boot time.
MFC after: 1 month
It is almost never needed and adds an avoidable branch.
While here do minior clean ups in preparation for larger changes.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27019
The value is provided by the C library as for other sysctl variables in
the user tree. It is compiled in and returns the value of _PATH_LOCALBASE
defined in paths.h.
Reviewed by: imp, scottl
Differential Revision: https://reviews.freebsd.org/D27009
This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.
Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.
Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.
Reported by: pho
Tested by: pho (different variant)
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
The calling process's process group can change between PROC_UNLOCK(p)
and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call
from another process. If that happens, the signal is not sent to the
calling process, even if the prior checks determine that one should be
sent. Re-check that the process group hasn't changed after acquiring
the pgrp lock, and if it has, redo the checks.
PR: 250701
Submitted by: Jakub Piecuch <j.piecuch96@gmail.com>
MFC after: 2 weeks
Foundation copyrights, approved by emaste@. It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.
Reviewed by: emaste, imp, gbe (manpages)
Differential Revision: https://reviews.freebsd.org/D26980
All vnodes allocated by UMA are present on the global list used by
vnlru. getnewvnode modifies the state of the vnode (most notably
altering v_holdcnt) but never locks it. Moreover filesystems also
modify it in arbitrary manners sometimes before taking the vnode
lock or adding any other indicator that the vnode can be used.
Picking up such a vnode by vnlru would be problematic.
To that end there are 2 fixes:
- vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the
interlock and verifies that v_mount has been set. It is an
invariant that the vnode lock is held by that point, providing
the necessary serialisation against locking after vhold.
- vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes,
now makes sure to only transition the count 0->1 and newly allocated
vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only
transition VHOLD_NO_SMR->1 once more making the hold fail
Tested by: pho
Without the 'car limit' enabled (before this), running sequential ZFS scrub
on HDD without command queuing support, I've measured latency on concurrent
random reads reaching 4 seconds (surprised that not more). Enabling this
reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s.
For disks with command queuing this does not make much difference (if any),
since most time all the requests are queued down to the disk or HBA, leaving
nothing in the queue to sort. And even if something does not fit, staying on
the queue, it is likely not for long. To not limit sorting in such bursty
scenarios I've added batched counter zeroing when the queue is getting empty.
The internal scheduler of the SAS HDD I was testing seems to be even more
loyal to random I/O, reducing the scrub speed to ~120MB/s. So in case
somebody worried this is limit is too strict -- it actually looks relaxed.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Before this GEOM passed bio pointer to transaction start, but not end.
It was irrelevant until devstat(9) got DTrace hooks, that appeared to
provide bio pointer on I/O completion, but not on submission.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
o Add iommu_unmap_msi() to release the msi GAS entry.
o Provide default implementations for iommu init/deinit methods.
Reviewed by: kib
Sponsored by: Innovate DSbD
Differential Revision: https://reviews.freebsd.org/D26906
Ensure we also skip descendants of SKIP nodes when iterating through children
of an explicitly specified node.
Reported by: np
Reviewed by: np
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D26833
Remove unused oidpp parameter from sysctl_sysctl_next_ls and
add high level comments to describe how it works.
No functional change.
Reviewed by: imp
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D26854
r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.
Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.
PR: 249179
Reported by: Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by: csjp, kib
MFC after: 1 week
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26922