When sendmsg(2) sucessfully internalized one SCM_RIGHTS control
message, but failed to process some other control message later, both
file references and filedescent memory needs to be freed. This was not
done, only mbuf chain was freed.
Noted, test case written, reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21000
I accidentally broke the main point of r349248 when making stylistic changes
in r349391. Restore the original behavior, and also fix an additional
overflow that was possible when uio->uio_resid was nearly SSIZE_MAX.
Reported by: cem
Reviewed by: bde
MFC after: 2 weeks
MFC-With: 349248
Sponsored by: The FreeBSD Foundation
Add format capability to core file names to include signal
that generated the core. This can help various validation workflows
where all cores should not be considered equally (SIGQUIT is often
intentional and not an error unlike SIGSEGV or SIGBUS)
Submitted by: David Leimbach (leimy2k@gmail.com)
Reviewed by: markj
MFC after: 1 week
Relnotes: sysctl kern.corefile can now include the signal number
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20970
This ptrace operation returns a structure containing the error and
return values from the current system call. It is only valid when a
thread is stopped during a system call exit (PL_FLAG_SCX is set).
The sr_error member holds the error value from the system call. Note
that this error value is the native FreeBSD error value that has _not_
been translated to an ABI-specific error value similar to the values
logged to ktrace.
If sr_error is zero, then the return values of the system call will be
set in sr_retval[0] and sr_retval[1].
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D20901
syscallret() doesn't use error anymore. Fix a few other places to permit
removing the return value from syscallenter() entirely.
- Remove a duplicated assertion from arm's syscall().
- Use td_errno for amd64_syscall_ret_flush_l1d.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D2090
Early errors prior to a system call did not set td_errno. This commit
sets td_errno for all errors during syscallenter(). As a result,
syscallret() can now always use td_errno without checking TDP_NERRNO.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D20898
If umtxq_check_susp() indicates an exit, we should clean the resources
before returning. Do it by breaking out of the loop and relying on
post-loop cleanup.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
Differential revision: https://reviews.freebsd.org/D20949
After r349951, the return code must be checked instead of old == new
comparision.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
Differential revision: https://reviews.freebsd.org/D20949
When using the SOL_SOCKET level socket option SO_LINGER, the structure
struct linger is used as the option value. The component l_linger is of
type int, but internally copied to the field so_linger of the structure
struct socket. The type of so_linger is short, but it is assumed to be
non-negative and the value is used to compute ticks to be stored in a
variable of type int.
Therefore, perform input validation on l_linger similar to the one
performed by NetBSD and OpenBSD.
Thanks to syzkaller for making me aware of this issue.
Thanks to markj@ for pointing out that a similar check should be added
to so_linger_set().
Reviewed by: markj@
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20948
Casueword(9) on ll/sc architectures must be prepared for userspace
constantly modifying the same cache line as containing the CAS word,
and not loop infinitely. Otherwise, rogue userspace livelocks the
kernel.
To fix the issue, change casueword(9) interface to return new value 1
indicating that either comparision or store failed, instead of relying
on the oldval == *oldvalp comparison. The primitive no longer retries
the operation if it failed spuriously. Modify callers of
casueword(9), all in kern_umtx.c, to handle retries, and react to
stops and requests to terminate between retries.
On x86, despite cmpxchg should not return spurious failures, we can
take advantage of the new interface and just return PSL.ZF.
Reviewed by: andrew (arm64, previous version), markj
Tested by: pho
Reported by: https://xenbits.xen.org/xsa/advisory-295.txt
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D20772
if a retry to allocate swap space, after a larger allocation attempt failed, allocated a smaller set of free blocks
that ended on a 32- or 64-block boundary.
Add tests to detect this kind of failure-to-extend-at-boundary and prevent the associated accounting screwup.
Reported by: pho
Tested by: pho
Reviewed by: alc
Approved by: markj (mentor)
Discussed with: kib
Differential Revision: https://reviews.freebsd.org/D20893
Thus, when using proccontrol(1) to disable implicit application of
PROT_MAX within a process, child processes will inherit this setting.
Discussed with: kib
MFC with: r349609
Sponsored by: The FreeBSD Foundation
This is more consistent with the rest of the function and lets us
unindent most of the function.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D20897
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
after the one where the possible block allocation begins, and allocate
a larger number of blocks than the current limit. This does not affect
the limit on minimum allocation size, which still cannot exceed
BLIST_MAX_ALLOC.
Use this change to modify swp_pager_getswapspace and its callers, so
that they can allocate more than BLIST_MAX_ALLOC blocks if they are
available.
Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20579
We were otherwise failing to call funsetown() for some descriptors
associated with a tty, such as pts descriptors. Then, if the
descriptor is closed before the owner exits, we may get memory
corruption.
Reported by: syzbot+c9b6206303bf47bac87e@syzkaller.appspotmail.com
Reviewed by: ed
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Save the last callout function pointer (and its argument) executed
on each CPU for inspection by a debugger. Add a ddb `show callout_last`
command to show these pointers. Add a kernel module that I used
for testing that command.
Relocate `ce_migration_cpu` to reduce padding and therefore preserve
the size of `struct callout_cpu` (320 bytes on amd64) despite the
added members.
This should help diagnose reference-after-free bugs where the
callout's mutex has already been freed when `softclock_call_cc`
tries to unlock it.
You might hope that the pointer would still be available, but it
isn't. The argument to that function is on the stack (because
`softclock_call_cc` uses it later), and that might be enough in
some cases, but even then, it's very laborious. A pointer to the
callout is saved right before these newly added fields, but that
callout might have been freed. We still have the pointer to its
associated mutex, and the name within might be enough, but it might
also have been freed.
Reviewed by: markj jhb
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20794
Fix a mis-merge when extracting the unmapped mbuf changes from
Netflix's in-kernel TLS changes where the call to the function that
freed the backing pages from an unmapped mbuf was missed.
Sponsored by: Chelsio Communications
feature bit.
In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement. Provide procctl(2) verbs to set and query
implicit PROTMAX handling. The knobs mimic the same per-image flag
and per-process controls for ASLR.
Reviewed by: emaste, markj (previous version)
Discussed with: brooks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D20795
Previously we would attempt to unlock the socket buffer despite having
failed to lock it. Simply return an error instead: no resources need
to be released at this point, and doing so is consistent with
soreceive_generic().
PR: 238789
Submitted by: Greg Becker <greg@codeconcepts.com>
MFC after: 1 week
This patch factors the code in vn_truncate() that does the actual
VOP_SETATTR() of size into a separate function called vn_truncate_locked().
This will allow the NFS server and the patch that adds a
copy_file_range(2) syscall to call this function instead of duplicating
the code and carrying over changes, such as the recent r347151.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D20808
fget_mmap() translates rights on the descriptor to a VM protection
mask. It was doing so without holding any locks on the descriptor
table, so a writer could simultaneously be modifying those rights.
Such a situation would be detected using a sequence counter, but
not before an inconsistency could trigger assertion failures in
the capability code.
Fix the problem by copying the fd's rights to a structure on the stack,
and perform the translation only once we know that that snapshot is
consistent.
Reported by: syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com
Reviewed by: brooks, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20800
We use PIPE_DIRECTW as a semaphore for direct writes to a pipe, where
the reader copies data directly from pages mapped into the writer.
However, when a reader finishes such a copy, it previously cleared
PIPE_DIRECTW, allowing multiple writers to race and corrupt the state
used to track wired pages belonging to the writer.
Fix this by having the writer clear PIPE_DIRECTW and instead use the
count of unread bytes to determine whether a write is finished.
Reported by: syzbot+21811cc0a89b2a87a9e7@syzkaller.appspotmail.com
Reviewed by: kib, mjg
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20784
Apply similar logic from sbcompress to pending data in the socket
buffer once it is marked ready via sbready. Normally sbcompress
merges small mbufs to reduce the length of mbuf chains in the socket
buffer. However, sbcompress cannot do this for mbufs marked
M_NOTREADY. sbcompress_ready is now called from sbready when mbufs
are marked ready to merge small mbuf chains once the data is available
to copy.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
The epoch_drain_callbacks() function is used to drain all pending
callbacks which have been invoked by prior epoch_call() function calls
on the same epoch. This function is useful when there are shared
memory structure(s) referred to by the epoch callback(s) which are not
refcounted and are rarely freed. The typical place for calling this
function is right before freeing or invalidating the shared
resource(s) used by the epoch callback(s). This function can sleep and
is not optimized for performance.
Differential Revision: https://reviews.freebsd.org/D20109
MFC after: 1 week
Sponsored by: Mellanox Technologies
A future patch that will add a Linux compatible copy_file_range(2) syscall
needs to be able to lock the byte ranges of two files concurrently.
To do this without a risk of deadlock, a non-blocking variant of
vn_rangelock_rlock() called vn_rangelock_tryrlock() was needed.
This patch adds this, along with vn_rangelock_trywlock(), in order to
do this.
The patch also adds a couple of comments, that I hope clarify how the
algorithm used in kern_rangelock.c works.
Reviewed by: kib, asomers (previous version)
Differential Revision: https://reviews.freebsd.org/D20645
r160875 added sbdestroy() as a wrapper around sbrelease_internal to be
called from sofree(), yet the comment added in the same revision to
sofree() still mentions sbrelease_internal().
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20488
into existing files for sugid processes.
Despite using real user id pronounces the intent, it actually breaks
suid coredumps, while not making any difference for non-sugid
processes. The reason for the breakage is that non-existent core file
is created with the effective uid (unless weird hacks like SUIDDIR are
configured).
Then, if user enabled kern.sugid_coredump, core dumping should not
overwrite core files owned by effective uid, but we cannot pretend to
use real uid for dumping.
PR: 68905
admbugs: 358
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.
Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936. Formalize that by using a union type.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20710
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.
Reviewed by: mckusick
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20705
wakeup_one() and underlying sleepq_signal() spend additional time trying
to be fair, waking thread with highest priority, sleeping longest time.
But in case of taskqueue there are many absolutely identical threads, and
any fairness between them is quite pointless. It makes even worse, since
round-robin wakeups not only make previous CPU affinity in scheduler quite
useless, but also hide from user chance to see CPU bottlenecks, when
sequential workload with one request at a time looks evenly distributed
between multiple threads.
This change adds new SLEEPQ_UNFAIR flag to sleepq_signal(), making it wakeup
thread that went to sleep last, but no longer in context switch (to avoid
immediate spinning on the thread lock). On top of that new wakeup_any()
function is added, equivalent to wakeup_one(), but setting the flag.
On top of that taskqueue(9) is switchied to wakeup_any() to wakeup its
threads.
As result, on 72-core Xeon v4 machine sequential ZFS write to 12 ZVOLs
with 16KB block size spend 34% less time in wakeup_any() and descendants
then it was spending in wakeup_one(), and total write throughput increased
by ~10% with the same as before CPU usage.
Reviewed by: markj, mmacy
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D20669
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
On large systems those sysctls may generate megabytes of output. Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns. Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.
This change improves situation in two ways:
- on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
- on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
Normally td_runtime is updated on context switch, but there are some kernel
threads that due to high absolute priority may run for many seconds without
context switches (yes, that is bad, but that is true), which means their
td_runtime was not updated all that time, that made them invisible for top
other then as some general CPU usage.
MFC after: 1 week
Sponsored by: iXsystems, Inc.