This appears to have been an oversight in r213536.
Reviewed by: markj
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11521
the IO type (Admin or NVM) using XPT op-codes XPT_NVME_ADMIN or
XPT_NVME_IO.
Submitted by: Chuck Tuffli <chuck@tuffli.net>
Differential Revision: https://reviews.freebsd.org/D10247
Using the https://github.com/google/capsicum-test/ suite, the
PosixMqueue.CapModeForked test was failing due to an ECAPMODE after
calling kmq_notify(). On further inspection, the dynamically
loaded syscall entry was initialized with sy_flags zeroed out, since
SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value.
Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an
additional argument to specify the sy_flags value.
Submitted by: Siva Mahadevan <smahadevan@freebsdfoundation.org>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11576
Make the %b formatter accept number formatting flags. It will now accept
alternate form, precision, and length modifiers. It also now partially
supports field width (but forces left justification).
Reviewed by: markj
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11284
on clock drivers.
This tracks multiple concurrent realtime clock drivers in a list sorted by
clock resolution. When system time changes (and periodically) the
clock_settime() methods of all registered clocks are invoked.
To initialize system time, each driver is tried in turn from best to worst
resolution, until one succesfully returns a valid time.
The code no longer holds a mutex while calling the clock_settime() and
clock_gettime() methods of the registered clocks. This allows clock drivers
to do whatever kind of locking or sleeping is necessary (this is especially
important for i2c clock chips since i2c drivers often need to sleep).
A new clock_register_flags() function allows the clock driver to pass
flags. The flags currently defined help support drivers that use their own
techniques to avoid roundoff errors (prevents the 4/5 rounding done by the
subr_rtc code). A driver which may need to wait for resources (such as bus
ownership) may pass a flag to indicate that it will obtain system time for
itself after waiting for resources; this is merely an optimization to avoid
the common code retrieving a timespec that will never get used.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11484
Uiomove can only block when the segflag is UIO_USERSPACE,
otherwise we end up just doing a bcopy (or nothing) and
moving cursors. So only emit witness warnings and
set deadlock thread flags in the UIO_USERSPACE case.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D11489
socket structure into a listening socket. This resulted in an invalid
instruction fault for all 32-bit platforms.
When INVARIANTS is set the union where the two uninitialized fields
reside gets properly zeroed. This patch ensures the two uninitialized
fields are zeroed when INVARIANTS is undefined.
For 64-bit platforms this issue was not visible because so->sol_upcall
which is uninitialized overlaps with so->so_rcv.sb_state which is
already zero during soalloc();
For 32-bit platforms this issue was visible and resulted in an invalid
instruction fault, because so->sol_upcall overlaps with
so->so_rcv.sb_sel which is always initialized to a valid data pointer
during soalloc().
Verifying the offset locations mentioned above are identical is left
as an exercise to the reader.
PR: 220452
PR: 220358
Reviewed by: ae (network), gallatin
Differential Revision: https://reviews.freebsd.org/D11475
Sponsored by: Mellanox Technologies
The vm_map_fixed() and vm_map_stack() VM functions return Mach error
codes. Convert them into errno values before returning result from
exec_new_vmspace().
While there, modernize the comment and do minor style adjustments.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
performance.
To find in the leaf bitmap all ranges of sufficient length, use a doubling
strategy with shift-and-and until each bit still set represents a bit
sequence of length 'count', or until the bitmask is zero. In the latter
case, update the hint based on the first bit sequence length not found to
be available. For example, seeking an interval of length 12, the set bits
of the bitmap would represent intervals of length 1, then 2, then 3, then
6, then 12. If no bits are set at the point when each bit represents an
interval of length 6, then the hint can be updated to 5 and the search
terminated.
If long-enough intervals are found, discard those before the cursor. If
any remain, use binary search to find the position of the first of them,
and allocate that interval.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D11426
recycles the current vm space. Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).
It's pointless for vm_map_stack() to check the MEMLOCK limit. It will
never be asked to wire the stack. Moreover, it doesn't even implement
wiring of the stack.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11421
Process core notes for a 32-bit process running on a 64-bit host need to
use 32-bit structures so that the note layout matches the layout of notes
of a core dump of a 32-bit process under a 32-bit kernel.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D11407
struct g_kevent_args.
On some architectures, e.g. PowerPC, there is additional padding in uap.
Reported and tested by: andreast
Sponsored by: The FreeBSD Foundation
and "next_skip" variables. The "skip" value in struct blist has long been
a 64-bit quantity but various functions have implicitly truncated this
value to 32 bits. Now, all arithmetic involving the "skip" value is 64
bits wide. (This should allow us to relax the size limit on a swap device
in the swap pager.)
Maintain the ability to test this allocator as a user-space application by
including <stdbool.h>.
Remove an unused variable from blst_radix_print().
Reviewed by: kib, markj
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D11358
It distinguishes between data flow sockets and listening sockets, and
in case of the latter doesn't change resource limits, since listening
sockets don't hold any buffers, they only carry values to be inherited
by their children.
Most of the lock slowpaths assert that the calling thread isn't an idle
thread. However, this may not be true if the system has panicked, and in
some cases the assertion appears before a SCHEDULER_STOPPED() check.
MFC after: 3 days
Sponsored by: Dell EMC Isilon
device nodes.
Otherwise, the current check of aio_offset == -1LL makes it possible
to pass negative file offsets down to the filesystems. This trips
assertions and is even unsafe for e.g. FFS which keeps metadata at
negative offsets.
Reported and tested by: pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11266
that disk writes are more likely to be sequential. This change is
beneficial on both the solid state and mechanical disks that I've
tested. (A similar change in allocation policy was made by DragonFly
BSD in 2013 to speed up Poudriere with "stressful memory parameters".)
Increase the width of blst_meta_alloc()'s parameter "skip" and the local
variables whose values are derived from it to 64 bits. (This matches the
width of the field "skip" that is stored in the structure "blist" and
passed to blst_meta_alloc().)
Eliminate a pointless check for a NULL blist_t.
Simplify blst_meta_alloc()'s handling of the ALL-FREE case.
Address nearby style errors.
Reviewed by: kib, markj
MFC after: 5 weeks
Differential Revision: https://reviews.freebsd.org/D11247
By making MAXBCACHEBUF a tunable, it can be increased to allow for
larger read/write data sizes for the NFS client.
The tunable is limited to MAXPHYS, which is currently 128K.
Making MAXPHYS a tunable or increasing its value is being discussed,
since it would be nice to support a read/write data size of 1Mbyte
for the NFS client when mounting the AmazonEFS file service.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D10991
This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.
To make this useful, data member of the struct kevent must be extended
to 64bit. Using the opportunity, I also added ext members. This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.
The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).
Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2). Compat shims
are provided for both host native and compat32.
Requested by: bapt
Reviewed by: bapt, brooks, ngie (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11025
Display the mbuf/cluster count for a sockbuf and fix a couple whitespace
issues in the output.
Reviewed by: jhb, markj (both previous version)
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11062
This makes ddb show files more descriptive and also adjusts the
whitespace to align the columns for non-32-bit architectures.
Reviewed by: cem (previous version), jhb
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D11061
additional allocation overhead. Previously, blst_meta_alloc() updated the
hint after every successful allocation. However, these "eager" hint
updates are of no actual benefit if, instead, the "lazy" hint update at
the start of blst_meta_alloc() is generalized to handle all cases where
the number of available blocks is less than the requested allocation.
Previously, the lazy hint update at the start of blst_meta_alloc() only
handled the ALL-FULL case. (I would also note that this change provides
consistency between blist_alloc() and blist_fill() in that their hint
maintenance is now entirely lazy.)
Eliminate unnecessary checks for terminators in blst_meta_alloc() and
blst_meta_fill() when handling ALL-FREE meta nodes.
Eliminate the field "bl_free" from struct blist. It is redundant. Unless
the entire radix tree is a single leaf, the count of free blocks is stored
in the root node. Instead, provide a function blist_avail() for obtaining
the number of free blocks.
In blst_meta_alloc(), perform a sanity check on the allocation once rather
than repeating it in a loop over the meta node's children.
In blst_leaf_fill(), use the optimized bitcount*() function instead of a
loop to count the blocks being allocated.
Add or improve several comments.
Address some nearby style errors.
Reviewed by: kib
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D11146
struct thread.
For all architectures, the syscall trap handlers have to allocate the
structure on the stack. The structure takes 88 bytes on 64bit arches
which is not negligible. Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already. The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.
The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.
This move will also allow several more uses shortly.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
what this field represented was also inaccurate.) Suggested by: kib
In r178792, blist_create() grew a malloc flag, allowing M_NOWAIT to be
specified. However, blist_create() was not modified to handle the
possibility that a malloc() call failed. Address this omission.
Increase the width of the local variable "radix" to 64 bits. (This
matches the width of the corresponding field in struct blist.)
Reviewed by: kib
MFC after: 6 weeks
quantity as the size of the range to fill, but returns a 32-bit quantity
as the number of blocks that were allocated to fill that range. This
revision corrects that mismatch. Currently, swaponsomething() limits
the size of a swap area to prevent arithmetic arithmetic overflow in
other parts of the blist allocator. That limit has also prevented this
type mismatch from causing problems.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D11096
Provide a new mode "2" which returns a special overflow indicator in
the non-representable field instead of the silent truncation (mode
"0") or EOVERFLOW (mode "1").
In particular, the typical use of st_ino to detect hard links with
mode "2" reports false positives, which might be more suitable for
some uses.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
subtree is already zero, then setting the "largest contiguous free block"
hint for that subtree to anything other than zero makes no sense. To be
clear, assigning a value to the hint that is too large is not a correctness
problem, only a pessimization.
Dragonfly BSD has applied the same change to blst_meta_alloc() but not
blst_meta_fill().
MFC after: 6 weeks
This happens when closing a socket with upcall, and trace is: soclose()->
... protocol ... -> soisdisconnected() -> socantrcvmore_locked() ->
sowakeup() -> soisconnected().
Right now this case is innocent for two reasons. First, soisconnected()
doesn't clear SS_ISDISCONNECTED flag. Second, the mutex to lock the
socket is the socket receive buffer mutex, and sodisconnected() first
disables the receive buffer. But in future code, the mutex to lock
socket is different to buffer mutex, and we would get undesired mutex
recursion.
The fix is to check SS_ISDISCONNECTED flag before calling upcall.
testing purposes. However, over the years, various changes to the kernel
have broken this feature. This revision applies some fixes to get user-
space compilation working again. There are no changes in this revision
to code that is used by the kernel.
MFC after: 3 days
pager used a different scheme for striping the allocation of swap space
across multiple devices. And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size. Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected. Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme. This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.
Reviewed by: kib
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11043