There are workloads with very bursty tid allocation and since unr tries very
hard to have small-sized bitmaps it keeps reallocating memory. Just doing
buildkernel gives almost 150k calls to free coming from unr.
This also gets rid of the hack which tried to postpone TID reuse.
Reviewed by: kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D27101
The intent is to replace the current id allocation method and a known upper
bound will be useful.
Reviewed by: kib (previous version), markj (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D27100
imgact_binmisc matches magic/mask from imgp->image_header, which is only a
single page in size mapped from the first page of an image. One can specify
an interpreter that matches on, e.g., --offset 4096 --size 256 to read up to
256 bytes past the mapped first page.
The limitation is that we cannot specify a magic string that exceeds a
single page, and we can't allow offset + size to exceed a single page
either. A static assert has been added in case someone finds it useful to
try and expand the size, but it does seem a little unlikely.
While this looks kind of exploitable at a sideways squinty-glance, there are
a couple of mitigating factors:
1.) imgact_binmisc is not enabled by default,
2.) entries may only be added by the superuser,
3.) trying to exploit this information to read what's mapped past the end
would be worse than a root canal or some other relatably painful
experience, and
4.) there's no way one could pull this off without it being completely
obvious.
The first page is mapped out of an sf_buf, the implementation of which (or
lack thereof) depends on your platform.
MFC after: 1 week
access the socket send or receive buffer. This is not possible for
listening sockets since r319722.
Because send()/recv() calls fail on listening sockets, fail also ioctl()
indicating EINVAL.
PR: 250366
Reported by: Yong-Hao Zou
Reviewed by: glebius, rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D26897
The offset we need to account for in the interpreter string comes in two
variants:
1. Fixed - macros other than #a that will not vary from invocation to
invocation
2. Variable - #a, which is substitued with the argv0 that we're replacing
Note that we don't have a mechanism to modify an existing entry. By
recording both of these offset requirements when the interpreter is added,
we can avoid some unnecessary calculations in the exec path.
Most importantly, we can know up-front whether we need to grab
calculate/grab the the filename for this interpreter. We also get to avoid
walking the string a first time looking for macros. For most invocations,
it's a swift exit as they won't have any, but there's no point entering a
loop and searching for the macro indicator if we already know there will not
be one.
While we're here, go ahead and only calculate the argv0 name length once per
invocation. While it's unlikely that we'll have more than one #a, there's no
reason to recalculate it every time we encounter an #a when it will not
change.
I have not bothered trying to benchmark this at all, because it's arguably a
minor and straightforward/obvious improvement.
MFC after: 1 week
This adds a dedicated counter updated with atomics when INVARIANTS
is used. As a side effect one can reliably determine the lock is held
for reading by at least one thread, but it's still not possible to
find out whether curthread has the lock in said mode.
This should be good enough in practice.
Problem spotted by avg.
This doesn't change anything at the moment since the out-of-order elements
were a pair of uint32_t, but future additions may have caused unnecessary
padding by following the existing precedent.
MFC after: 1 week
If we hadn't been traced in the first place when syscallenter()
started executing, we can ignore TDB_USERWR. TDB_USERWR can get set,
sure, but if it does, it's because the debugger raced with the syscall,
and it cannot depend on winning that race.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: EPSRC
Differential Revision: https://reviews.freebsd.org/D26585
This module handles relatively few execs (initial qemu-user-static, then
qemu-user-static handles exec'ing itself for binaries it's already running),
but all execs pay the price of at least taking the relatively expensive
sx/slock to check for a match when this module is loaded. Future work will
almost certainly swap this out for another lock, perhaps an rmslock.
The RLOCK/WLOCK phrasing was chosen based on what the callers are really
wanting, rather than using the verbiage typically appropriate for an sx.
MFC after: 1 week
We may want to reserve bits in the future for kernel-only use, so start
rejecting any that aren't the two that we're currently expecting from
userland.
MFC after: 1 week
Previously, non-preemptible epochs could not check; in_epoch() would always
fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING.
For default epochs, it's easy enough to verify that we're in the given
epoch: if we're in a critical section and our record for the given epoch
is active, then we're in it.
This patch also adds some additional INVARIANTS bookkeeping. Notably, we set
and check the recorded thread in epoch_enter/epoch_exit to try and catch
some edge-cases for the caller. It also checks upon freeing that none of the
records had a thread in the epoch, which may make it a little easier to
diagnose some improper use if epoch_free() took place while some other
thread was inside.
This version differs slightly from what was just previously reviewed by the
below-listed, in that in_epoch() will assert that no CPU has this thread
recorded even if it *is* currently in a critical section. This is intended
to catch cases where the caller might have somehow messed up critical
section nesting, we can catch both if they exited the critical section or if
they exited, migrated, then re-entered (on the wrong CPU).
Reviewed by: kib, markj (both previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27098
Notably, streamline error paths through the existing 'done' label, making it
easier to quickly verify correct cleanup.
Future work might add a kernel-only flag to indicate that a interpreter uses
#a. Currently, all executions via imgact_binmisc pay the penalty of
constructing sname/fname, even if they will not use it. qemu-user-static
doesn't need it, the stock rc script for qemu-user-static certainly doesn't
use it, and I suspect these are the vast majority of (if not the only)
current users.
MFC after: 1 week
According to code comments the original motivation was to allow for
malloc_type_internal changes without ABI breakage. This can be trivially
accomplished by providing spare fields and versioning the struct, as
implemented in the patch below.
The upshots are one less memory indirection on each alloc and disappearance
of mt_zone.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27104
This ensures that no writes are pending in memory, either metadata or
user data, but not including dirty pages not yet converted to fs writes.
Only filesystems declared local are suspended.
Note that this does not guarantee absence of the metadata errors or
leaks if resume is not done: for instance, on UFS unlinked but opened
inodes are leaked and require fsck to gc.
Reviewed by: markj
Discussed with: imp
Tested by: imp (previous version), pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D27054
Sample usage: kernel modules can decide whether to stick to malloc or
create their own zone.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27097
The 2 provided zones had inconsistent naming between each other
("int" and "64") and other allocator zones (which use bytes).
Follow malloc by naming them "pcpu-" + size in bytes.
This is a step towards replacing ad-hoc per-cpu zones with
general slabs.
On a sample box vmstat -z shows:
ITEM SIZE LIMIT USED FREE REQ
64: 64, 0, 1043784, 4367538,3698187229
selfd: 64, 0, 1520, 13726,182729008
But at the same time:
vm.uma.selfd.keg.domain.1.pages: 121
vm.uma.selfd.keg.domain.0.pages: 121
Thus 242 pages got pulled even though the malloc zone would likely accomodate
the load without using extra memory.
- Removed a bunch of redundant headers
- Don't explicitly initialize to 0
- The !error check prior to setting imgp->interpreter_name is redundant, all
error paths should and do return or go to 'done'. We have larger problems
otherwise.
Linux allows polling without any events specified and it happens to be the case
in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask
and this works fine if the condition is already present. However, if it is
missing, selrecord is only called if the eventmask has relevant bits set. This
in particular leads to a conditon where pipe_poll can return 0 events and
neglect to selrecord, while kern_poll takes it as an indication it has to go to
sleep, but then there is nobody to wake it up.
While the problem seems systemic to *_poll handlers the least we can do is fix
it up for pipes.
Reported by: Jeremie Galarneau <jeremie.galarneau at efficios.com>
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D27094
Previously the code had one wait channel for all pending writers.
This could result in a buggy scenario where after a writer switches
the lock mode form readers to writers goes off CPU, another writer
queues itself and then the last reader wakes up the latter instead
of the former.
Use a separate channel.
While here add features to reliably detect whether curthread has
the lock write-owned. This will be used by ZFS.
This is mostly mechanical except for vmspace_exit(). There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace. In that case, upon switching to
vmspace0 we can unconditionally release the reference.
Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.
Reviewed by: alc, kib, mmel
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27057
Alter shmget_allocate_segment and shmget_existing to take the values
they want from struct shmget_args rather than passing the struct
around. In general, uap structures should only be the interface to
sys_<foo> functions.
This makes on small functional change and records the allocated space
rather than the requested space. If this turns out to be a problem (e.g.
if software tries to find undersized segments by exact size rather than
using keys), we can correct that easily.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27077
This option is intended to be semantically identical to Linux's
SOL_SOCKET:SO_PASSCRED. For now, it is mutually exclusive with the
pre-existing sockopt SOL_LOCAL:LOCAL_CREDS.
Reviewed by: markj (penultimate version)
Differential Revision: https://reviews.freebsd.org/D27011
This sysctl value had been provided as a read-only variable that is
compiled into the C library based on the value of _PATH_LOCALBASE in
paths.h.
After this change, the value is compiled into the kernel as an empty
string, which is translated to _PATH_LOCALBASE by the C library.
This empty string can be overridden at boot time or by a privileged
user at run time and will then be returned by sysctl.
When set to an empty string, the value returned by sysctl reverts to
_PATH_LOCALBASE.
This update does not change the behavior on any system that does
not modify the default value of user.localbase.
I consider this change as experimental and would prefer if the run-time
write permission was reconsidered and the sysctl variable defined with
CLFLAG_RDTUN instead to restrict it to be set at boot time.
MFC after: 1 month
It is almost never needed and adds an avoidable branch.
While here do minior clean ups in preparation for larger changes.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27019
The value is provided by the C library as for other sysctl variables in
the user tree. It is compiled in and returns the value of _PATH_LOCALBASE
defined in paths.h.
Reviewed by: imp, scottl
Differential Revision: https://reviews.freebsd.org/D27009
This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.
Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.
Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.
Reported by: pho
Tested by: pho (different variant)
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
The calling process's process group can change between PROC_UNLOCK(p)
and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call
from another process. If that happens, the signal is not sent to the
calling process, even if the prior checks determine that one should be
sent. Re-check that the process group hasn't changed after acquiring
the pgrp lock, and if it has, redo the checks.
PR: 250701
Submitted by: Jakub Piecuch <j.piecuch96@gmail.com>
MFC after: 2 weeks
Foundation copyrights, approved by emaste@. It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.
Reviewed by: emaste, imp, gbe (manpages)
Differential Revision: https://reviews.freebsd.org/D26980
All vnodes allocated by UMA are present on the global list used by
vnlru. getnewvnode modifies the state of the vnode (most notably
altering v_holdcnt) but never locks it. Moreover filesystems also
modify it in arbitrary manners sometimes before taking the vnode
lock or adding any other indicator that the vnode can be used.
Picking up such a vnode by vnlru would be problematic.
To that end there are 2 fixes:
- vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the
interlock and verifies that v_mount has been set. It is an
invariant that the vnode lock is held by that point, providing
the necessary serialisation against locking after vhold.
- vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes,
now makes sure to only transition the count 0->1 and newly allocated
vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only
transition VHOLD_NO_SMR->1 once more making the hold fail
Tested by: pho
Without the 'car limit' enabled (before this), running sequential ZFS scrub
on HDD without command queuing support, I've measured latency on concurrent
random reads reaching 4 seconds (surprised that not more). Enabling this
reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s.
For disks with command queuing this does not make much difference (if any),
since most time all the requests are queued down to the disk or HBA, leaving
nothing in the queue to sort. And even if something does not fit, staying on
the queue, it is likely not for long. To not limit sorting in such bursty
scenarios I've added batched counter zeroing when the queue is getting empty.
The internal scheduler of the SAS HDD I was testing seems to be even more
loyal to random I/O, reducing the scrub speed to ~120MB/s. So in case
somebody worried this is limit is too strict -- it actually looks relaxed.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Before this GEOM passed bio pointer to transaction start, but not end.
It was irrelevant until devstat(9) got DTrace hooks, that appeared to
provide bio pointer on I/O completion, but not on submission.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
o Add iommu_unmap_msi() to release the msi GAS entry.
o Provide default implementations for iommu init/deinit methods.
Reviewed by: kib
Sponsored by: Innovate DSbD
Differential Revision: https://reviews.freebsd.org/D26906
Ensure we also skip descendants of SKIP nodes when iterating through children
of an explicitly specified node.
Reported by: np
Reviewed by: np
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D26833
Remove unused oidpp parameter from sysctl_sysctl_next_ls and
add high level comments to describe how it works.
No functional change.
Reviewed by: imp
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D26854
r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.
Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.
PR: 249179
Reported by: Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by: csjp, kib
MFC after: 1 week
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26922
platforms.
This allows to not depend on the IOMMU macro in AHCI driver.
Requested by: kib
Suggested by: andrew
Reviewed by: kib
Sponsored by: Innovate DSbD
Differential Revision: https://reviews.freebsd.org/D26887
The previous scheme for calculating the total size was doing sizeof
on the struct and then adding the wanted space for the buffer.
nc_name is at offset 58 while sizeof(struct namecache) is 64.
With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were
allocating 104 bytes for the entry and never accounting for the 6
byte padding, wasting that space.
It no longer protects any of tested fields, keeping all the checks racy.
While here make vtryrecycle drop the vnode on its own. Avoids an additional
lock trip.
mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.
The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.
For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.
Tested by: pho
Discussed with: kib
Size of the per-process semaphore undo structure (semusz) depends on
the number of the per-process undos. If kern.ipc.semume is adjusted,
semusz must be adjusted as well, and it makes no sense to delegate
adjustment to user. Make it automatic.
Reported and tested by: Olef <o.vandestadt@gmail.com>
PR: 250361
Reviewed by: jhb, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26826
Instead, add arguments to vmapbuf. Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)
In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.
Suggested by: jhb
Reviewed by: imp, jhb
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26784
It is a common pattern for filesystems' VOP_INACTIVE() implementation
to forcibly reclaim the vnode when its state is final. For instance,
UFS vnode with zero link count is removed, and since it is
inactivated, the last open reference on it is dropped.
On the other hand, vnode might get spurious usecount reference for
many reasons. If the spurious reference exists while vgonel() checks
for active state of the vnode, it would recurse into VOP_INACTIVE().
Fix it by checking and not doing inactivation when vgone() was called
from inactive VOP.
Reported and tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
During tinderbox and similar workloads negative entries get at least one
hit before they get evicted. In the current scheme this avoidably promotes
them.
Be conservative and stick to 2 hits for now.
The TF_TOE flag is the check used in the rest of the network stack to
determine if TOE is active on a socket. There is at least one path in
the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a
socket not using TOE.
Reported by: Sony Arpita Das <sonyarpitad@chelsio.com>
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26803
This will cause the VM to back sufficiently large .text sections, such
as those in zfs.ko or amdgpu.ko on amd64, with superpage mappings when
possible.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26802
BT_MAXALLOC (4) is the number of boundary tags required to complete an
allocation in the worst case: two to clip a free segment, and two to
import from a parent arena. vmem_xalloc() preallocates four boundary
tags before attempting a search to simplify the segment allocation code.
It implements a loop that:
1) ensures that BT_MAXALLOC boundary tags are available,
2) attempts to find and clip a free segment satisfying the allocation
constraints, and failing that,
3) attempts to import a segment.
On !UMA_MD_SMALL_ALLOC platforms the btag zone has to handle recusion:
it needs boundary tags to allocate boundary tags. Thus we reserve
2 * BT_MAXALLOC * mp_ncpus tags for use when recursing: the factor of 2
is because there are two layers of vmem arenas, the per-domain arena and
global arena. For a single thread, 2 * BT_MAXALLOC tags should be
sufficient.
Because of the way the loop is structured, BT_MAXALLOC tags are not
sufficient. The first bt_fill() call may allocate BT_MAXALLOC tags,
then import a segment (consuming two tags), then attempt to top up the
preallocation before carving into the imported free segment, thus
requiring up to six tags in the worst case. Because we don't
preallocate that many, this bug can cause deadlocks in rare scenarios.
Fix the problem by moving the preallocation out the loop. This assumes
that only a single import is ever required to satisfy an allocation
request.
Thanks to manu, emaste and lwhsu for helping test debug patches.
Reported by: Jenkins (hardware CI lab)
Reviewed by: alc, kib, rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26770
This allows the interrupt controller driver only need a small change to
create a map for the page the device will write to raise an interrupt.
Submitted by: andrew
Reviewed by: kib
Sponsored by: Innovate DSbD
Differential Revision: https://reviews.freebsd.org/D26705
Add a show sysinit command to ddb (similar to show vnet_sysinit) which
proved to be helpful to debug some ordering issues on early-mid kernel
start panics.
The previous scheme only looked at negative entry count in relation to the
total count, leading to tons of spurious evictions if the cache is not
significantly populated.
Instead, only try the above if negative entry count goes beyond namecache
capacity.
Split everything into neg, debug, param and stat categories.
The legacy nchstats sysctl (queried e.g., by systat) remains untouched.
While here rename some vars to be easier on the eye.
- declutter sysctl vfs.cache by moving relevant entries into
vfs.cache.neg
- add a little more parallelism to eviction by replacing the
global lock with an atomically modified counter
- track more statistics
The code needs further effort.
This simplifies the code while allowing for concurrent negative eviction
down the road.
Cache misses increased slightly due to higher rate of evictions allowed by
the change.
The current algorithm remains too aggressive.
Only assign the address from the iovec to bio_data if it is a kernel
address. This was the single place where bio_data stored (however
briefly) a userspace pointer.
Reviewed by: imp, markj
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26783
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.
Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741
Due to a weakness in the TLS 1.0 protocol, OpenSSL will periodically
send empty TLS records ("empty fragments"). These TLS records have no
payload (and thus a page count of zero). m_uiotombuf_nomap() was
returning NULL instead of an empty mbuf, and a few places needed to be
updated to treat an empty TLS record as having a page count of "1" as
0 means "no work to do" (e.g. nothing to encrypt, or nothing to mark
ready via sbready()).
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26729
Both s_len and s_size are ssize_t, so their differece is also more
properly a ssize_t not a size_t. Also, assert that len is <= size when
we enter. This should always be the case. Ensure that we have that one
byte that we write to the end of the buffer before we do so, though
the error should already be set on the buffer if not, and the only
times we supply 'partial' buffers they should be plenty large.
Reviewed by: cem, jhb (prior version, I did cem's suggestion)
Differential Revsion: https://reviews.freebsd.org/D26752
Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled. Otherwise, there is no
functional change. The mechanism can be disabled with
debug.fxrng_vdso_enable=0.
arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time. Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed. On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.
This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator. That change is
left for future discussion/work.
Reviewed by: kib (previous version)
Approved by: csprng (me -- only touching FXRNG here)
Differential Revision: https://reviews.freebsd.org/D22839
Truncating requires an exclusive lock, but it was not taken if the
filesystem indicates support for shared writes. This only concerns
ZFS.
In particular fixes cp of files which have trailing holes.
Reported by: bdrewery
The module handler code invokes a MOD_UNLOAD event immediately if
MOD_LOAD fails. The result was that if seminit() failed, semunload()
was invoked twice. semunload() is not idempotent however and would
try to remove it's process_exit eventhandler twice resulting in a
panic.
Reviewed by: kib, markj
Obtained from: CheriBSD
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26696
Use of dead_vnodeops would result in a panic instead of returning the intended
EOPNOTSUPP error.
While here make sure to abort, not just try to return a partial result.
The former allows the regular lookup to restart from scratch, while the latter
makes it stuck with an unusable vnode.
Reported by: kevans
Without this patch, when vn_generic_copy_file_range() is
doing a large copy, it will remain in the function for a
considerable amount of time, delaying handling of any
outstanding signals until the copy completes.
This patch adds checks for signals that need to be
processed after each successful data copy cycle.
When sig_intr() returns non-zero, vn_generic_copy_file_range()
will return.
The check "if (len < savlen)" ensures that some data
has been copied, so that progress will be made.
Note that, since copy_file_range(2) is allowed to
return fewer bytes copied than requested, it
will never return EINTR/ERESTART when sig_intr()
returns non-zero.
Reviewed by: kib, asomers
Differential Revision: https://reviews.freebsd.org/D26620
Check td_flags for relevant AST requests lock-less. This opens the
race slightly wider where sig_intr() returns false negative, but might
be it is worth it.
Requested by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Specifically, if lookup() returned any error and the topping directory
was not latched, which means that (non-existent) path did not returned
to the topping location, give ENOTCAPABLE a priority over the lookup()
error.
PR: 249960
Reviewed by: emaste, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26695
The correct way to identify the end of the metadata is two adjacent
entries set to zero/MODINFO_END. I made a typo and this was checking the
first entry twice.
Reported by: rpokala
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
The boot metadata (also referred to as modinfo, or preload metadata)
provides information about the size and location of the kernel,
pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be
consumed during by the kernel during early boot. It is encoded as a
series of type-length-value entries and is usually constructed by
loader(8) and passed to the kernel. It is also faked on some
architectures when booted by other means.
Although much of the module information is available via kldstat(8),
there is no easy way to debug the metadata in its entirety. Add some
routines to parse this data and allow it to be printed to the console
during early boot or output via a sysctl.
Since the output can be lengthly, printing to the console is gated
behind the debug.dump_modinfo_at_boot kenv variable as well as the
BOOTVERBOSE flag. The sysctl to print the metadata is named
debug.dump_modinfo.
Reviewed by: tsoome
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26687
It is possible for elf_reloc_local() to fail in the unlikely case of
an unsupported relocation type. If this occurs, do not continue to
process the file.
Reviewed by: kib, markj (earlier version)
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26701
The kernel globals for kenv are confined to 2 files that need them and
a few that likely shouldn't (but as written the code does). Move them
from sys/systm.h to sys/kenv.h. This removed a XXX from systm.h and
cleans it up a little bit...
The prototype was added with the creation of kern_shutdown.c in r17658,
but it appears to have never been implemented. Remove it now.
Reviewed by: cem, kib
Differential Revision: https://reviews.freebsd.org/D26702
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).
In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.
Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689
Add an "nextnoskip" sysctl that allows for listing of sysctls intended to be
normally skipped for cost reasons.
This makes it so the names/descriptions of those sysctls can be discovered with
sysctl -aN/sysctl -ad/sysctl -at.
It also makes it so children are visited when a node flagged with CTLFLAG_SKIP
is explicitly requested.
The intended use case is to mark the root "kstat" node with CTLFLAG_SKIP so that
the extensive and expensive stats are skipped by default but may still be easily
obtained without having to know them all (which may not even be possible) and
request each one-by-one.
Reviewed by: jhb
MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D26560
Since the code exits smr section prior to calling pwd_hold, the used
pwd can be freed and a new one allocated with the same address, making
the comparison erroneously true.
Note it is very unlikely anyone ran into it.
It gives the answer would the thread sleep according to current state
of signals and suspensions. Of course the answer is racy and allows
for false-negatives (no sleep when signal is delivered after process
lock is dropped). Also the answer might change due to signal
rescheduling among threads in multi-threaded process.
Still it is the best approximation I can provide, to answering the
question was the thread interrupted.
Reviewed by: markj
Tested by: pho, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D26628
- Extract suspension check into sig_ast_checksusp() helper.
- Extract signal check and calculation of the interruption errno into
sig_ast_needsigchk() helper.
The helpers are moved to kern_sig.c which is the proper place for
signal-related code.
Improve control flow in sleepq_catch_signals(), to handle ret == 0
(can sleep) and ret != 0 (interrupted) only once, by separating
checking code into sleepq_check_ast_sq_locked(), which return value is
interpreted at single location.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D26628
detour - it writes ktrace entries to the filesystem - so the overhead
of ast() won't make any difference.
Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26404
Currently we allocate and map zero-filled anonymous pages when dumping
core. This can result in lots of needless disk I/O and page
allocations. This change tries to make the core dumper more clever and
represent unbacked ranges of virtual memory by holes in the core dump
file.
Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to
clean up and return an error when it would otherwise map a zero-filled
page. Then, in the core dumper code, prefault all user pages and handle
errors by simply extending the size of the core file. This also fixes a
bug related to the fact that vn_io_fault1() does not attempt partial I/O
in the face of errors from vm_fault_quick_hold_pages(): if a truncated
file is mapped into a user process, an attempt to dump beyond the end of
the file results in an error, but this means that valid pages
immediately preceding the end of the file might not have been dumped
either.
The change reduces the core dump size of trivial programs by a factor of
ten simply by excluding unaccessed libc.so pages.
PR: 249067
Reviewed by: kib
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26590
OBJT_DEFAULT, _SWAP, _VNODE and _PHYS is exactly the set of
non-fictitious object types, so just check for OBJ_FICTITIOUS. The
check no longer excludes dead objects, but such objects have to be
handled regardless.
No functional change intended.
Reviewed by: alc, dougm, kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26589
hole size boundary.
By clipping the len argument of vn_generic_copy_file_range() to end at
an exact multiple of hole size, holes are more likely to be maintained
during the copy.
A hole can still straddle the boundary at the end of the
copy range, resulting in a block being allocated in the
output file as it is being grown in size, but this will reduce the
likelyhood of this happening.
While here, also modify setting of blksize to better handle the
case where _PC_MIN_HOLE_SIZE is returned as 1.
Reviewed by: asomers
Differential Revision: https://reviews.freebsd.org/D26570
A user pointer is not a suitable value for bio_data and the next block
of code always overwrites bio_data anyway. Just use cb->aio_buf
directly in the call to vm_fault_quick_hold_pages().
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26595
Without this patch, if a call to copy_file_range(2) specifies an input file
offset + len that would wrap around, EINVAL is returned.
I thought that was the Linux behaviour, but recent testing showed that
Linux accepts this case and does the copy_file_range() to EOF.
This patch changes the FreeBSD code to exhibit the same behaviour as
Linux for this case.
Reviewed by: asomers, kib
Differential Revision: https://reviews.freebsd.org/D26569
Until we can do proper /etc/rc output on both consoles in multicons
boot (or all of them if we ever generalize), report when we are
booting multicons. Also report the primary console. This will be a big
hint why output stops after this line (though some slow USB discovery
still happens after mountroot / init starts).
Reviewed by: scottl@, tsoome@
Differential Revision: https://reviews.freebsd.org/D26574
It is only ever called for negative entries and for those it is
just a wrapper around cache_zap_negative_locked_vnode_kl which
always succeeds.
This also fixes a bug where cache_lookup_fallback should have been
calling cache_zap_locked_bucket instead. Note that in order to trigger
the bug NOCACHE must not be set, which currently only happens when
creating a new coredump (and then the coredump-to-be has to have a
negative entry).
and current address space is already destroyed, so kern_execve()
terminates the process.
While there, clean up some internals of post_execve() inlined in init_main.
Reported by: Peter <pmc@citylink.dinoex.sub.org>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26525
The entire cache scan was a leftover from the old implementation.
It is incredibly wasteful in presence of several mount points and does not
win much even for single ones.
It is like O_BENEATH, but disables to walk out of the subtree rooted
in the starting directory. O_BENEATH does not care if path walks out
if it returned.
Requested by: Dan Gohman <dev@sunfishcode.online>
PR: 248335
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
Do not care if path walks out of the topping directory if it returns back.
Requested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
not unconditionally on any dotdot component.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
the helper to calculate namei flags both for open(2) and creat(2).
Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
the helper to convert AT_ flags for *at() syscalls to namei flags.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
Stop abusing internal namei flag NI_LCF_STRICTRELATIVE as indicator of
cap-restricted lookup. Add designated returned flag NIRES_STRICTREL
to inform kern_openat() that lookup was restricted.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
This makes open("/a/../a", O_BENEATH) with cwd == "/a" work.
Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
Explain why tracker is needed at all.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
A small example of how these functions can be used to simplify checks of
this nature.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26271
This adds the getenv_bool() function, to parse a boolean value from a
kernel environment variable or tunable. This works for traditional
boolean values like "0" and "1", and also "true" and "false"
(case-insensitive). These semantics do not yet apply to sysctls declared
using SYSCTL_BOOL with CTLFLAG_TUN (they still only parse 1 and 0).
Also added are two wrapper functions, getenv_is_true() and
getenv_is_false(). These are slightly simpler for callers wishing to
perform a single check of a configuration variable.
Reviewed by: jhb (slightly earlier version)
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26270
One problem with the bus_space_read_N() and bus_space_write_N() family of
functions is that they provide no protection against exceptions which can
occur when no physical hardware or device responds to the read or write
cycles. In such a situation, the system typically would panic due to a
kernel-mode bus error. The bus_space_peek_N() and bus_space_poke_N() family
of functions provide a mechanism to handle these exceptions gracefully
without the risk of crashing the system.
Typical example is access to PCI(e) configuration space in bus enumeration
function on badly implemented PCI(e) root complexes (RK3399 or Neoverse
N1 N1SDP and/or access to PCI(e) register when device is in deep sleep state.
This commit adds a real implementation for arm64 only. The remaining
architectures have bus_space_peek()/bus_space_poke() emulated by using
bus_space_read()/bus_space_write() (without exception handling).
MFC after: 1 month
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D25371
vm_ooffset_t is now unsigned. Remove some tests for negative values,
or make other adjustments accordingly.
Reported by: Coverity
Reviewed by: kib markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26214
Change the zone setup:
- Allow slabs to be returned to the OS
- Set the number of slots to the max devctl will queue before discarding
- Reserve 2% of the max (capped at 100) for low memory allocations
- Disable per-cpu caching since we don't need it and we avoid some pathologies
Change the alloation strategiy a bit:
- If a normal allocation fails, try to get the reserve
- If a reserve allocation fails, re-use the oldest-queued entry for storage
- If there's a weird race/failure and nothing on the queue to steal, return NULL
This addresses two main issues in the old code:
- If devd had died, and we're generating a lot of messages, we have an
unbounded leak. This new scheme avoids the issue that lead to this.
- The MPASS that was 'sure' the allocation couldn't have failed turned out
to be wrong in some rare cases. The new code doesn't make this assumption.
Since we reserve only 2% of the space, we go from about 1MB of
allocation all the time to more like 50kB for the reserve.
Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26448
Both before and after job control adjustments.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26416
Orphans affect job control state, we must account for them when
changing pg_jobc.
Instead of p_pptr, use proc_realparent() to get parent relevant for
job control.
Use correct calculation of the parent for exiting process. For jobc
purposes, we must use realparent, but if it is also exiting, we should
fall to reaper, then recursively find non-exiting reaper.
Reported by: trasz
PR: 249257
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26416
Use ddb pager.
Make lines more compact.
Eliminate unneeded casts.
Print more job-control related info when reporting process group.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26416
There are several negative side-effects of not calling into VOP layer
at all for page cache reads. The biggest is the missed activation of
EVFILT_READ knotes.
Also, it allows filesystem to make more fine grained decision to
refuse read from page cache.
Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for
asserts.
Reviewed by: markj
Tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346
The pointer to vnode is already stored into f_vnode, so f_data can be
reused. Fix all found users of f_data for DTYPE_VNODE.
Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.
Reviewed by: markj (previous version)
Discussed with: freqlabs (openzfs chunk)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346
This function wasn't converted to use the new locking protocol in
r333744. Make it use the PCB lock for synchronizing connection state.
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26300
unp_pcb_owned_lock2() has some sharp edges and forces callers to deal
with a bunch of cases. Simplify it:
- Rename to unp_pcb_lock_peer().
- Return the connected peer instead of forcing callers to load it
beforehand.
- Handle self-connected sockets.
- In unp_connectat(), just lock the accept socket directly. It should
not be possible for the nascent socket to participate in any other
lock orders.
- Get rid of connect_internal(). It does not provide any useful
checking anymore.
- Block in unp_connectat() when a different thread is concurrently
attempting to lock both sides of a connection. This provides simpler
semantics for callers of unp_pcb_lock_peer().
- Make unp_connectat() return EISCONN if the socket is already
connected. This fixes a race[1] when multiple threads attempt to
connect() to different addresses using the same datagram socket.
Upper layers will disconnect a connected datagram socket before
calling the protocol connect's method, but there is no synchronization
between this and protocol-layer code.
Reported by: syzkaller [1]
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26299
The allocated memory is only required for SOCK_STREAM and SOCK_SEQPACKET
sockets.
Reviewed by: kevans
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26298
In all cases, PCBs are unlocked after unp_disconnect() returns. Since
unp_disconnect() may release the last PCB reference, callers may have to
bump the refcount before the call just so that they can release them
again.
Change unp_disconnect() to release PCB locks as well as connection
references; this lets us remove several refcount manipulations. Tighten
assertions.
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26297
unp_pcb_lock_pair() seems like a better name. Also make it handle the
case where the two sockets are the same instead of making callers do it.
No functional change intended.
Reviewed by: glebius, kevans, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26296
- Use refcount_init().
- Define an INVARIANTS-only zone destructor to assert that various
bits of PCB state aren't left dangling.
- Annotate unp_pcb_rele() with __result_use_check.
- Simplify control flow.
Reviewed by: glebius, kevans, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26295
- Define a locking key for unpcb members.
- Rewrite some of the locking protocol description to make it less
verbose and avoid referencing some subroutines which will be renamed.
- Reorder includes.
Reviewed by: glebius, kevans, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26294
for it to sit in the syscall fast path.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26368
that can be extended, but also ensure compile-time type checking. Refactor
common code out of arch-specific implementations. Move the mpr and mps
drivers to this new API. The template type remains visible to the consumer
so that it can be allocated on the stack, but should be considered opaque.
On write with SHM_GROW_ON_WRITE, use proper truncate.
Do not allow to grow largepage shm if F_SEAL_GROW is set. Note that
shrinks are not supported at all due to unmanaged mappings.
Call to vm_pager_update_writecount() is only valid for swap objects,
skip it for unmanaged largepages.
Largepages cannot support write sealing.
Do not writecnt largepage mappings.
Reported by: kevans
Reviewed by: kevans, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26394
Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.
Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
purpose of epoch_trace() and for calling subsequent panic, but to keep
code fully under INVARIANTS, so don't use bare function call to panic().
However, at the last stage of review a true value slipped in, while
always false was assumed. I checked that in email archive with kib@.
Noticed by: trasz
Future changes would require additional initialization of OBJT_PHYS
objects, and vm_object_allocate() is not suitable for it.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
Similar to the userspace rtld check.
Reviewed by: dim, emaste (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26339
If multiple threads race calling vfs_hash_insert() while creating vnodes
with the same identity, all of the vnodes which lose the race must be
destroyed before any other thread can see them. Previously this was
accomplished by the vput() in vfs_hash_insert() resulting in the vnode's
VOP_INACTIVE() method calling vgone() before the vnode lock was unlocked,
but at some point changes to the the vnode refcount/inactive logic have caused
that to no longer work, leading to crashes, so instead vfs_hash_insert()
must call vgone() itself before calling vput() on vnodes which lose the race.
Reviewed by: mjg, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26291
m_segments() was added with r363464 but never used. Remove it to
avoid warnings when compiling kernels.
Reported by: rmacklem (also says jhb)
Reviewed by: gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D26330
When using ifnet ktls, and when ktls_reset_send_tag()
fails to allocate a replacement tag, it leaves
the tls session's snd_tag pointer NULL. ktls_cleanup()
tries to release the send tag, and will trip over
this NULL pointer and panic unless NULL is checked for.
Reviewed by: jhb
Sponsored by: Netflix
While rare, encountering an unimplemented system call early in init is
catastrophic and difficult to debug. Even after a SIGSYS handler is
registered, such configurations are problematic. As such, always report
such events for pid 1 (following kern.lognosys if non-zero).
Reviewed by: kevans, imp
Obtained from: CheriBSD (plus suggestions from kevans)
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26288
- Change the type of hw.pagesizes to OPAQUE, since it returns an array.
- Modify the handler to only truncate the returned length if the caller
supplied an output buffer. This allows use of the trick of passing a
NULL output buffer to fetch the output size, while preserving
compatibility if MAXPAGESIZES is increased.
- Add a "S,pagesize" formatter to sysctl(8).
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26239
The CS_DRAIN flag cannot be set at the same time like the async-drain function
pointer is set. These are orthogonal features. Assert this at the beginning
of the function.
Before:
if (flags & CS_DRAIN) {
/* FALLTHROUGH */
} else if (xxx) {
return yyy;
}
if (drain) {
zzz = drain;
}
After:
if (flags & CS_DRAIN) {
/* FALLTHROUGH */
} else if (xxx) {
return yyy;
} else {
if (drain) {
zzz = drain;
}
}
Reviewed by: markj@
Tested by: callout_test
Differential Revision: https://reviews.freebsd.org/D26285
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Noted in D24652, we currently set shmfd->shm_flags on every
shm_open()/shm_open2(). This wasn't properly thought out; one shouldn't be
able to specify incompatible flags on subsequent opens of non-anon shm.
Move setting of shm_flags explicitly to the two places shmfd are created, as
we do with seals, and validate when we're opening a pre-existing mapping
that we've either passed no flags or we've passed the exact same flags as
the first time.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D26242
When the zone_jumbop is exhausted, most things using
using sosend* (like sshd) will eventually
fail or hang if allocations are limited to the
depleted jumbop zone. This makes it imossible to
communicate with a box which is under an attach which
exhausts the jumbop zone.
Rather than depending on the page size zone, also try cluster
allocations to satisfy larger requests. This allows me
to ssh to, and serve 100Gb/s of traffic from a server which
under attack and has had its page-sized zone exhausted.
Reviewed by: glebius, markj, rmacklem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26150
In Linux, ksize() gets the actual amount of memory allocated for a given
object. This commit adds malloc_usable_size() to FreeBSD KPI which does
the same. It also maps LinuxKPI ksize() to newly created function.
ksize() function is used by drm-kmod.
Reviewed by: hselasky, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26215
Convert two different sysctl to using sbuf. First, for all the default
sysctls we implement for each device driver that's attached. This is a
pure sbuf conversion.
Second, convert sysctl_devices to fill its buffer with sbuf rather
than a hand-rolled crappy thing I wrote years ago.
Reviewed by: cem, markj
Differential Revision: https://reviews.freebsd.org/D26206
devctl_notify_f isn't needed, so retire it. The flags argument is now
unused, so rather than keep it around, retire it. Convert all old
users of it to devctl_notify(). This path no longer sleeps, so is safe
to call from any context. Since it doesn't sleep, it doesn't need to
know if it is OK to sleep or not.
Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26140
Convert the memory management of devctl. Rewrite if to make better
use of memory. This eliminates several mallocs (5? worse case) needed
to send a message. It's now possible to always send a message, though
if things are really backed up the oldest message will be dropped to
free up space for the newest.
Add a static bus_child_{location,pnpinfo}_sb to start migrating to
sbuf instead of buffer + length. Use it in the new code. Other code
will be converted later (bus_child_*_str is only used inside of
subr_bus.c, though implemented in ~100 places in the tree).
Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26140
Previously any residual data in the final block of a compressed kernel
dump would be written unencrypted. Note, such a configuration already
does not work properly when using AES-CBC since the compressed data is
typically not a multiple of the AES block length in size and EKCD does
not implement any padding scheme. However, EKCD more recently gained
support for using the ChaCha20 cipher, which being a stream cipher does
not have this problem.
Submitted by: sigsys@gmail.com
Reviewed by: cem
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26188
Turn FLUSHO on/off with ^O (or whatever VDISCARD is). Honor that to
throw away output quickly. This tries to remain true to 4.4BSD
behavior (since that was the origin of this feature), with any
corrections NetBSD has done. Since the implemenations are a little
different, though, some edge conditions may be handled differently.
Reviewed by: kib, kevans
Differential Review: https://reviews.freebsd.org/D26148
r363210 introduced v_seqc_users to the vnodes. This change requires
a vn_seqc_write_end() to match the vn_seqc_write_begin() in
vfs_cache_root_clear().
mjg@ provided this patch which seems to fix the panic.
Tested for an NFS mount where the VFS_STATFS() call will fail.
Submitted by: mjg
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D26160
vmem uses span tags to delimit imported segments, so that they can be
released if the segment becomes free in the future. However, the
per-domain kernel KVA arenas never release resources, so the span tags
between imported ranges are unused when the ranges are contiguous.
Furthermore, such span tags prevent coalescing of free segments across
KVA_QUANTUM boundaries, resulting in internal fragmentation which
inhibits superpage promotion in the kernel map.
Stop allocating span tags in arenas that never release resources. This
saves a small amount of memory and allows free segements to coalesce
across import boundaries. This manifests as improved kernel superpage
usage during poudriere runs, which also helps to reduce physical memory
fragmentation by reducing the number of broken partially populated
reservations.
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24548
crypto(9) functions can now be used on buffers composed of an array of
vm_page_t structures, such as those stored in an unmapped struct bio. It
requires the running to kernel to support the direct memory map, so not all
architectures can use it.
Reviewed by: markj, kib, jhb, mjg, mat, bcr (manpages)
MFC after: 1 week
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D25671
r358097 introduced a problem for i386, where kernel builds will intermittently
get hung, typically with many processes sleeping on "btalloc".
I know nothing about VM, but received assistance from rlibby@ and markj@.
rlibby@ stated the following:
It looks like the problem is that
for systems that do not have UMA_MD_SMALL_ALLOC, we do
uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
but we haven't set an appropriate free function. This is probably why
UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was
appropriate for systems with uma_small_alloc.
So by default we get page_free as our free function. That calls
kmem_free, which calls vmem_free ... but we do our allocs with
vmem_xalloc. I'm not positive, but I think the problem is that in
effect we vmem_xalloc -> vmem_free, not vmem_xfree.
Three possible fixes:
1: The one you tested, but this is not best for systems with
uma_small_alloc.
2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
3: Actually provide an appropriate vmem_bt_free function.
I think we should just do option 2 with a comment, it's simple and it's
what we used to do. I'm not sure how much benefit we would see from
option 3, but it's more work.
This patch implements #2. I haven't done a comment, since I don't know
what the problem is.
markj@ noted the following:
I think the suggested patch is ok, but not for the reason stated.
On platforms without a direct map the problem is:
to allocate btags we need a slab,
and to allocate a slab we need to map a page, and to map a page we need
to allocate btags.
We handle this recursion using a custom slab allocator which specifies
M_USE_RESERVE, allowing it to dip into a reserve of free btags.
Because the returned slab can be used to keep the reserve populated,
this ensures that there are always enough free btags available to
handle the recursion.
UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone.
However, when it was removed, an apparent bug in UMA was exposed:
keg_drain() ignores the reservation set by uma_zone_reserve()
in vmem_startup().
So under memory pressure we reclaim the free btags that are needed to
break the recursion.
That's why adding _NOFREE back fixes the problem: it disables the
reclamation.
We could perhaps fix it more cleverly, by modifying keg_drain() to always
leave uk_reserve slabs available.
markj@'s initial patch failed testing, so committing this patch was agreed
upon as the interim solution.
Either rlibby@ or markj@ might choose to add a comment to it.
PR: 248008
Reviewed by: rlibby, markj
rtentry lock traditionally served 2 purposed: first was protecting refcounts,
the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
the need for the latter reduced.
To be more precise, the following rte field are mutable:
rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.
rt_nhop and rt_weight (addressed in this review) are read under rib lock and
stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.
rt_expire and rte_flags reads will be dealt in a separate reviews soon.
Differential Revision: https://reviews.freebsd.org/D26162
We have both a system of 'kern' and of 'kernel'. Prefer the latter and
convert this notification to use 'kernel' instead of 'kern'. As a
transition period, continue to also generate the 'kern' notification
until sometime after FreeBSD 13 is branched.
MFC After: 3 days
While here change the field size from long to int and move it into the
gap next to cn_flags.
Shrinks struct componentname from 64 to 56 bytes on amd64.