purpose of epoch_trace() and for calling subsequent panic, but to keep
code fully under INVARIANTS, so don't use bare function call to panic().
However, at the last stage of review a true value slipped in, while
always false was assumed. I checked that in email archive with kib@.
Noticed by: trasz
Future changes would require additional initialization of OBJT_PHYS
objects, and vm_object_allocate() is not suitable for it.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
Similar to the userspace rtld check.
Reviewed by: dim, emaste (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26339
If multiple threads race calling vfs_hash_insert() while creating vnodes
with the same identity, all of the vnodes which lose the race must be
destroyed before any other thread can see them. Previously this was
accomplished by the vput() in vfs_hash_insert() resulting in the vnode's
VOP_INACTIVE() method calling vgone() before the vnode lock was unlocked,
but at some point changes to the the vnode refcount/inactive logic have caused
that to no longer work, leading to crashes, so instead vfs_hash_insert()
must call vgone() itself before calling vput() on vnodes which lose the race.
Reviewed by: mjg, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26291
m_segments() was added with r363464 but never used. Remove it to
avoid warnings when compiling kernels.
Reported by: rmacklem (also says jhb)
Reviewed by: gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D26330
When using ifnet ktls, and when ktls_reset_send_tag()
fails to allocate a replacement tag, it leaves
the tls session's snd_tag pointer NULL. ktls_cleanup()
tries to release the send tag, and will trip over
this NULL pointer and panic unless NULL is checked for.
Reviewed by: jhb
Sponsored by: Netflix
While rare, encountering an unimplemented system call early in init is
catastrophic and difficult to debug. Even after a SIGSYS handler is
registered, such configurations are problematic. As such, always report
such events for pid 1 (following kern.lognosys if non-zero).
Reviewed by: kevans, imp
Obtained from: CheriBSD (plus suggestions from kevans)
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26288
- Change the type of hw.pagesizes to OPAQUE, since it returns an array.
- Modify the handler to only truncate the returned length if the caller
supplied an output buffer. This allows use of the trick of passing a
NULL output buffer to fetch the output size, while preserving
compatibility if MAXPAGESIZES is increased.
- Add a "S,pagesize" formatter to sysctl(8).
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26239
The CS_DRAIN flag cannot be set at the same time like the async-drain function
pointer is set. These are orthogonal features. Assert this at the beginning
of the function.
Before:
if (flags & CS_DRAIN) {
/* FALLTHROUGH */
} else if (xxx) {
return yyy;
}
if (drain) {
zzz = drain;
}
After:
if (flags & CS_DRAIN) {
/* FALLTHROUGH */
} else if (xxx) {
return yyy;
} else {
if (drain) {
zzz = drain;
}
}
Reviewed by: markj@
Tested by: callout_test
Differential Revision: https://reviews.freebsd.org/D26285
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Noted in D24652, we currently set shmfd->shm_flags on every
shm_open()/shm_open2(). This wasn't properly thought out; one shouldn't be
able to specify incompatible flags on subsequent opens of non-anon shm.
Move setting of shm_flags explicitly to the two places shmfd are created, as
we do with seals, and validate when we're opening a pre-existing mapping
that we've either passed no flags or we've passed the exact same flags as
the first time.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D26242
When the zone_jumbop is exhausted, most things using
using sosend* (like sshd) will eventually
fail or hang if allocations are limited to the
depleted jumbop zone. This makes it imossible to
communicate with a box which is under an attach which
exhausts the jumbop zone.
Rather than depending on the page size zone, also try cluster
allocations to satisfy larger requests. This allows me
to ssh to, and serve 100Gb/s of traffic from a server which
under attack and has had its page-sized zone exhausted.
Reviewed by: glebius, markj, rmacklem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26150
In Linux, ksize() gets the actual amount of memory allocated for a given
object. This commit adds malloc_usable_size() to FreeBSD KPI which does
the same. It also maps LinuxKPI ksize() to newly created function.
ksize() function is used by drm-kmod.
Reviewed by: hselasky, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26215
Convert two different sysctl to using sbuf. First, for all the default
sysctls we implement for each device driver that's attached. This is a
pure sbuf conversion.
Second, convert sysctl_devices to fill its buffer with sbuf rather
than a hand-rolled crappy thing I wrote years ago.
Reviewed by: cem, markj
Differential Revision: https://reviews.freebsd.org/D26206
devctl_notify_f isn't needed, so retire it. The flags argument is now
unused, so rather than keep it around, retire it. Convert all old
users of it to devctl_notify(). This path no longer sleeps, so is safe
to call from any context. Since it doesn't sleep, it doesn't need to
know if it is OK to sleep or not.
Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26140
Convert the memory management of devctl. Rewrite if to make better
use of memory. This eliminates several mallocs (5? worse case) needed
to send a message. It's now possible to always send a message, though
if things are really backed up the oldest message will be dropped to
free up space for the newest.
Add a static bus_child_{location,pnpinfo}_sb to start migrating to
sbuf instead of buffer + length. Use it in the new code. Other code
will be converted later (bus_child_*_str is only used inside of
subr_bus.c, though implemented in ~100 places in the tree).
Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26140
Previously any residual data in the final block of a compressed kernel
dump would be written unencrypted. Note, such a configuration already
does not work properly when using AES-CBC since the compressed data is
typically not a multiple of the AES block length in size and EKCD does
not implement any padding scheme. However, EKCD more recently gained
support for using the ChaCha20 cipher, which being a stream cipher does
not have this problem.
Submitted by: sigsys@gmail.com
Reviewed by: cem
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26188
Turn FLUSHO on/off with ^O (or whatever VDISCARD is). Honor that to
throw away output quickly. This tries to remain true to 4.4BSD
behavior (since that was the origin of this feature), with any
corrections NetBSD has done. Since the implemenations are a little
different, though, some edge conditions may be handled differently.
Reviewed by: kib, kevans
Differential Review: https://reviews.freebsd.org/D26148
r363210 introduced v_seqc_users to the vnodes. This change requires
a vn_seqc_write_end() to match the vn_seqc_write_begin() in
vfs_cache_root_clear().
mjg@ provided this patch which seems to fix the panic.
Tested for an NFS mount where the VFS_STATFS() call will fail.
Submitted by: mjg
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D26160
vmem uses span tags to delimit imported segments, so that they can be
released if the segment becomes free in the future. However, the
per-domain kernel KVA arenas never release resources, so the span tags
between imported ranges are unused when the ranges are contiguous.
Furthermore, such span tags prevent coalescing of free segments across
KVA_QUANTUM boundaries, resulting in internal fragmentation which
inhibits superpage promotion in the kernel map.
Stop allocating span tags in arenas that never release resources. This
saves a small amount of memory and allows free segements to coalesce
across import boundaries. This manifests as improved kernel superpage
usage during poudriere runs, which also helps to reduce physical memory
fragmentation by reducing the number of broken partially populated
reservations.
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24548
crypto(9) functions can now be used on buffers composed of an array of
vm_page_t structures, such as those stored in an unmapped struct bio. It
requires the running to kernel to support the direct memory map, so not all
architectures can use it.
Reviewed by: markj, kib, jhb, mjg, mat, bcr (manpages)
MFC after: 1 week
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D25671
r358097 introduced a problem for i386, where kernel builds will intermittently
get hung, typically with many processes sleeping on "btalloc".
I know nothing about VM, but received assistance from rlibby@ and markj@.
rlibby@ stated the following:
It looks like the problem is that
for systems that do not have UMA_MD_SMALL_ALLOC, we do
uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
but we haven't set an appropriate free function. This is probably why
UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was
appropriate for systems with uma_small_alloc.
So by default we get page_free as our free function. That calls
kmem_free, which calls vmem_free ... but we do our allocs with
vmem_xalloc. I'm not positive, but I think the problem is that in
effect we vmem_xalloc -> vmem_free, not vmem_xfree.
Three possible fixes:
1: The one you tested, but this is not best for systems with
uma_small_alloc.
2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
3: Actually provide an appropriate vmem_bt_free function.
I think we should just do option 2 with a comment, it's simple and it's
what we used to do. I'm not sure how much benefit we would see from
option 3, but it's more work.
This patch implements #2. I haven't done a comment, since I don't know
what the problem is.
markj@ noted the following:
I think the suggested patch is ok, but not for the reason stated.
On platforms without a direct map the problem is:
to allocate btags we need a slab,
and to allocate a slab we need to map a page, and to map a page we need
to allocate btags.
We handle this recursion using a custom slab allocator which specifies
M_USE_RESERVE, allowing it to dip into a reserve of free btags.
Because the returned slab can be used to keep the reserve populated,
this ensures that there are always enough free btags available to
handle the recursion.
UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone.
However, when it was removed, an apparent bug in UMA was exposed:
keg_drain() ignores the reservation set by uma_zone_reserve()
in vmem_startup().
So under memory pressure we reclaim the free btags that are needed to
break the recursion.
That's why adding _NOFREE back fixes the problem: it disables the
reclamation.
We could perhaps fix it more cleverly, by modifying keg_drain() to always
leave uk_reserve slabs available.
markj@'s initial patch failed testing, so committing this patch was agreed
upon as the interim solution.
Either rlibby@ or markj@ might choose to add a comment to it.
PR: 248008
Reviewed by: rlibby, markj
rtentry lock traditionally served 2 purposed: first was protecting refcounts,
the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
the need for the latter reduced.
To be more precise, the following rte field are mutable:
rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.
rt_nhop and rt_weight (addressed in this review) are read under rib lock and
stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.
rt_expire and rte_flags reads will be dealt in a separate reviews soon.
Differential Revision: https://reviews.freebsd.org/D26162
We have both a system of 'kern' and of 'kernel'. Prefer the latter and
convert this notification to use 'kernel' instead of 'kern'. As a
transition period, continue to also generate the 'kern' notification
until sometime after FreeBSD 13 is branched.
MFC After: 3 days