"." and ".." names are not maintained in the mqueuefs dirent datastructure and
cannot be opened as mqueues. Creating or removing them is invalid; return
EINVAL instead of crashing.
PR: 236836
Submitted by: Torbjørn Birch Moltu <t.b.moltu AT lyse.net>
Discussed with: jilles (earlier version)
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed
out. This makes the comparision of ni_dvp with ni_vp always fail.
Move the calculation of unlock_dvp right after unlock_vp, so that the
code sees correct ni_vp value.
Reproduced by
chdir("/usr");
open("/..", O_BENEATH | O_RDONLY);
Reported by: syzkaller
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20304
are able to determine some virtual machines, but the vm_guest variable was
still only being set to VM_GUEST_VM.
Since we do know what some of them specifically are, we can set vm_guest
appropriately.
Also, if we see the CPUID has the HV flag, but we were unable to find a
definitive vendor in the Hypervisor CPUID Information Leaf, fall back to
the older detection methods, as they may be able to determine a specific
HV type.
Add VM_GUEST_PARALLELS value to VM_GUEST for Parallels.
Approved by: cem
Differential Revision: https://reviews.freebsd.org/D20305
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
This is described in the vmem paper: "directs vmem to use the next free
segment after the one previously allocated." The implementation adds a
new boundary tag type, M_CURSOR, which is linked into the segment list
and precedes the segment following the previous M_NEXTFIT allocation.
The cursor is used to locate the next free segment satisfying the
allocation constraints.
This implementation isn't O(1) since busy tags aren't coalesced, and we
may potentially scan the entire segment list during an M_NEXTFIT
allocation.
Reviewed by: alc
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D17226
type, use a table to make it easier to add more in the future, if needed.
Add VirtualBox detection to the table ("VBoxVBoxVBox" is the hypervisor
vendor string to look for.) Also add VM_GUEST_VBOX to the VM_GUEST
enumeration to indicate VirtualBox.
Save the CPUID base for the hypervisor entry that we detected. Driver code
may need to know about it in order to obtain additional CPUID features.
Approved by: bryanv, jhb
Differential Revision: https://reviews.freebsd.org/D16305
For machines having cmpxcgh16b instruction, i.e. everything but very
early Athlons, provide lockless implementation of delayed
invalidation.
The implementation maintains lock-less single-linked list with the
trick from the T.L. Harris article about volatile mark of the elements
being removed. Double-CAS is used to atomically update both link and
generation. New thread starting DI appends itself to the end of the
queue, setting the generation to the generation of the last element
+1. On DI finish, thread donates its generation to the previous
element. The generation of the fake head of the list is the last
passed DI generation. Basically, the implementation is a queued
spinlock but without spinlock.
Many thanks both to Peter Holm and Mark Johnson for keeping with me
while I produced intermediate versions of the patch.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
MFC note: td_md.md_invl_gen should go to the end of struct thread
Differential revision: https://reviews.freebsd.org/D19630
Code walks the list of contested turnstiles to calculate the priority
to unlend.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
We have a better, more comprehensive knob for this now:
kern.random.initial_seeding.bypass_before_seeding=1.
Requested by: delphij
Sponsored by: Dell EMC Isilon
If bumping over the counter goes over the limit we have to decrement it back.
Previous code would only bump the counter after adding the entry (thus allowing
the cache to go over the limit).
Sponsored by: The FreeBSD Foundation
the allocation request, so that the blocks allocated are from the next
set of free blocks big enough to satisfy the minimum requirements of
the request, and the number of blocks allocated are as many as
possible, up to the specified maximum. The implementation of
swp_pager_getswapspace uses this parameter to ask for a number of
blocks between the new halved request size and the previous failed
request size. Thus a request for 32 blocks may fail, but instead of
getting only 16 blocks instead, the caller asks for 16 to 31 next, and
might get 19 or 27, which is closer to what they originally wanted.
I expect this to lead to bigger block allocations and less block
fragmentation, at least in some cases.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20001
change the binary search so that it does not depend on a single bit
only being set in the bitmask. Use bitpos more generally, and avoid
some clearing of bits to accommodate its current behavior.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20237
change the binary search so that it does not depend on a single bit
only being set in the bitmask. Use bitpos more generally, and avoid
some clearing of bits to accommodate its current behavior.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20232
the same thing, but is commented so that it might be better
understood.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20231
and the next one, and if blocks are allocated from the next leaf, it
walks back toward where it started, as long as there are interleaving
meta-nodes to be updated on account of the last free blocks under
those meta-nodes being allocated. Only if the walk goes all the way
back to the starting point must we calculate the position of the
meta-node that is the least-comment parent of one leaf and the next,
and update a bit in that meta-node to indicate the allocation of its
last free block.
There's no need to start calculating the position of that least-common
parent until the walk back reaches the original starting point, and
there's no need for a calculation that updates 'radius' to tell us
when we've walked back to the beginning, since comparing scan to next
suffices for that.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20229
Bind the TCP pacer threads to NUMA domains and build per-domain
pacer-thread lookup tables. These tables allow us to use the
inpcb's NUMA domain information to match an inpcb with a pacer
thread on the same domain.
The motivation for this is to keep the TCP connection local to a
NUMA domain as much as possible.
Thanks to jhb for pre-reviewing an earlier version of the patch.
Reviewed by: rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20134
- there is no need to take the process lock to iterate the thread
list after single-threading is enforced
- typically there are no mutexes to clean up (testable without taking
the global umtx lock)
- typically there is no need to adjust the priority (testable without
taking thread lock)
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20160
device_printf does multiple calls to printf allowing other console messages to
be inserted between the device name, and the rest of the message. This change
uses sbuf to compose to two into a single buffer, and prints it all at once.
It exposes an sbuf drain function (drain-to-printf) for common use.
Update documentation to match; some unit tests included.
Submitted by: jmg
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16690
Multiple tools use @generated to identify generated files (for example,
in a review Phabricator will by default hide diffs in generated files).
Use the @generated tag in makesyscalls.sh as we've done for other
generated files.
Reviewed by: cem
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20183
resume where the last search left off. Suppose that there are no free
blocks of size 32, but plenty of size 16. If we repeatedly request
size 32 blocks, fail, and retry with size 16 blocks, then the failures
all reset the cursor to the beginning of memory, making the 16 block
allocation use a first fit, rather than next fit, strategy.
This change has blist_alloc make a copy of the cursor for its own
decision making, and only updates the real blist cursor after a
successful allocation, making those 16 block searches behave like
next-fit searches.
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20177
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails. E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.
This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now. It also does not implement
IPv6 netdump.
savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.
This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device. Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.
As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.
Reviewed by: markj, scottl (earlier version)
Relnotes: maybe
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19996
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
We unlock the vnode around malloc(M_WAITOK), to make it possible for
pagedaemon to flush vnode pages for us. Instead of doing it
unconditionally, first try M_NOWAIT allocation, which typically
succeed. Only on failure, unlock the vnode and retry with M_WAITOK.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19923
IPI_STOP is used after panic or when ddb is entered manually. MONITOR/
MWAIT allows CPUs that support the feature to sleep in a low power way
instead of spinning. Something similar is already used at idle.
It is perhaps especially useful in oversubscribed VM environments, and is
safe to use even if the panic/ddb thread is not the BSP. (Except in the
presence of MWAIT errata, which are detected automatically on platforms with
known wakeup problems.)
It can be tuned/sysctled with "machdep.stop_mwait," which defaults to 0
(off). This commit also introduces the tunable
"machdep.mwait_cpustop_broken," which defaults to 0, unless the CPU has
known errata, but may be set to "1" in loader.conf to signal that mwait
wakeup is broken on CPUs FreeBSD does not yet know about.
Unfortunately, Bhyve doesn't yet support MONITOR extensions, so this doesn't
help bhyve hypervisors running FreeBSD guests.
Submitted by: Anton Rang <rang AT acm.org> (earlier version)
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20135
These predicates are vestigal and cannot be true today. For example,
idle threads are not allowed to acquire locks.
Also cache curthread in breada().
No functional change intended.
Reviewed by: kib, mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20066
Contrary to the comments, it was never used by core dumps or
debuggers. Instead, it used to hold the signal code of a pending
signal, but that was replaced by the 'ksi_code' member of ksiginfo_t
when signal information was reworked in 7.0.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D20047
Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.
Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.
Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028
This is a stopgap measure to unbreak installer/VM/embedded boot issues
introduced (or at least exposed by) in r346250.
Add the new tunable, "security.stack_protect.permit_nonrandom_cookies," in
order to continue boot with insecure non-random stack cookies if the random
device is unavailable.
For now, enable it by default. This is NOT safe. It will be disabled by
default in a future revision.
There is follow-on work planned to use fast random sources (e.g., RDRAND on
x86 and DARN on Power) to seed when the early entropy file cannot be
provided, for whatever reason. Please see D19928.
Some better hacks may be used to make the non-random __stack_chk_guard
slightly less predictable (from delphij@ and mjg@); those suggestions are
left for a future revision. I think it may also be plausible to move stack
guard initialization far later in the boot process; potentially it could be
moved all the way to just before userspace is started.
Reported by: many
Reviewed by: delphij, emaste, imp (all w/ caveat: this is a stopgap fix)
Security: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19927
r176215 corrected readlink(2)'s return type and the type of the last
argument. readlink(2) was introduced in r177788 after being developed
as part of Google Summer of Code 2007; it appears to have inherited the
wrong return type.
Man pages and header files were already ssize_t; update syscalls.master
to match.
PR: 197915
Submitted by: Henning Petersen <henning.petersen@t-online.de>
MFC after: 2 weeks