Based on the description in Linux man page.
Reviewed by: markj, ngie (previous version)
Sponsored by: Mellanox Technologies
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18837
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
vm_kmem_size is u_long, and it might be not capable of holding page
count times PAGE_SIZE, even when scaled down by VM_KMEM_SIZE_SCALE. As
bde reported, 12G PAE config ends up with zero for kmem size.
Explicitly check for overflow and clamp kmem size at vm_kmem_size_max.
If we end up at zero size because VM_KMEM_SIZE_MAX is not defined,
panic with clear explanation rather then failing in a way which is
hard to relate.
Reported by: bde, pho
Tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18767
r319722 separated struct socket and parts of the socket I/O path into
listening-socket-specific and dataflow-socket-specific pieces. Listening
socket connection notifications are now handled by solisten_wakeup() instead
of sowakeup(), but solisten_wakeup() does not currently post SIGIO to the
owning process.
PR: 234258
Reported by: Kenneth Adelman
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18664
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).
Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.
The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.
The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.
KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.
A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.
[1] https://github.com/google/syzkaller
Submitted by: Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by: kib
Testing by: tuexen
Sponsored by: DARPA, AFRL
Sponsored by: The FreeBSD Foundation (Mitchell Horne)
Differential Revision: https://reviews.freebsd.org/D14599
Using daddr_t instead of int avoids trunclbn to become negative when it
shouldn't.
This isssue was found by running syzkaller.
Reviewed by: mckusick, kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18763
With new sysctls (to the best of our ability do detect them). Restructured
smp.4 slightly for clarity (keep relevant stuff closer to the top) while
documenting.
Reviewed by: markj, jhibbits (ppc parts)
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18322
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead. Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.
Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: tuexen
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18728
This merge brings in a couple new files, which needed to be attached to the
build; a new dependency on <limits.h>, which must be stubbed; and a name
change in the Context parameter constants, from ZSTD_p_foo to ZSTD_c_foo.
Significantly, it fixes a kernel build error with GCC where floating-point
functions were included in the kernel build, by hiding them under the same
compile-time #ifdef that already covered their invocation. That issue was
introduced to FreeBSD in the 1.3.7 update and tracked upstream here:
https://github.com/facebook/zstd/issues/1386
The full 1.3.8 release notes can be found on Github:
https://github.com/facebook/zstd/releases/tag/v1.3.8
Relnotes: yes
If the memory size does not fit into u_long, current code truncates
the returned value and returns complete nonsense. Make the result
slightly more useful by clamping it at ULONG_MAX.
Reported and tested : pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Kernel now includes jail ID when logging a process exit. jid is 0 for unjailed
processes.
Submitted by: Marie Helene Kvello-Aune <freebsd@mhka.no>
Relnotes: yes
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D18618
SVN r340744 erroneously changed pfind() to return any process including
zombies and pfind_any() to return only non-zombie processes.
In particular, this caused kill() on a zombie process to fail with [ESRCH].
There is no direct test case for this but /usr/tests/bin/sh/builtins/kill1.0
occasionally triggers it (as reported by lwhsu).
Conversely, returning zombies from pfind() seems likely to violate
invariants and cause panics, but I have not looked at this.
PR: 233646
Reviewed by: mjg, kib, ngie
Differential Revision: https://reviews.freebsd.org/D18665
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.
Reported by: Christopher Krah <krah@protonmail.com>
Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
M sys/fs/ext2fs/ext2_vnops.c
M sys/kern/vfs_subr.c
M sys/ufs/ffs/ffs_snapshot.c
M sys/ufs/ufs/ufs_vnops.c
This value remained unchanged for 15 years, and now this bump reduces
lock spinning in GEOM and BIO layers while doing ~1.6M IOPS to 4 NVMe
on 72-core system from ~25% to ~5% by the cost of additional 28KB RAM.
While there, align struct mtx_pool fields to cache lines.
MFC after: 1 month
The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind. Buffer might has been allocated before the
object created, then the buffer is malloced. Although we try to avoid
such situation, it seems to be still legitimate.
Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
POSIX_FADV_DONTNEED). The most broken case was for applications that
advise for the whole file and then do block-aligned i/o's 1 block at
a time. Then advice is sent to VOP_ADVISE() 1 block at a time, but
in vop_stdadvise() the 1-block advice was turned into 0-block advice
for the buffer cache part.
The bugs were caused partly by callers representing the region as
(a_start, a_end), where a_end is actually the maximum, and everything
else representing the region as (start, end) where 'end' is actually
the end (1 after the maximum). The maximum a_end must be rounded up,
but was rounded down. Also, rounding to page boundaries was inconsistent.
The bugs and fixes have no effect for zfs and other file systems that
don't use the buffer cache or the page cache. Most or all file systems
currently use the default VOP_FADVISE(), but it finds a null buffer cache
and a null page cache for file systems that don't use normal methods.
Reviewed by: kib
be called before VFS_ROOT() is called. Move the call for VFS_STATFS()
so that it is done after VFS_MOUNT(), but before VFS_ROOT().
This change actually improves the robustness of the mount system
call because it returns an error rather than failing silently
when VFS_STATFS() returns failure.
Reported by: Rebecca Cran <rebecca@bluestop.org>
Sponsored by: Netflix
The goal of this change is to fix a problem with PCI shared interrupts
during suspend and resume.
I have observed a couple of variations of the following scenario.
Devices A and B are on the same PCI bus and share the same interrupt.
Device A's driver is suspended first and the device is powered down.
Device B generates an interrupt. Interrupt handlers of both drivers are
called. Device A's interrupt handler accesses registers of the powered
down device and gets back bogus values (I assume all 0xff). That data is
interpreted as interrupt status bits, etc. So, the interrupt handler
gets confused and may produce some noise or enter an infinite loop, etc.
This change affects only PCI devices. The pci(4) bus driver marks a
child's interrupt handler as suspended after the child's suspend method
is called and before the device is powered down. This is done only for
traditional PCI interrupts, because only they can be shared.
At the moment the change is only for x86.
Notable changes in core subsystems / interfaces:
- BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus
interface along with convenience functions bus_suspend_intr and
bus_resume_intr;
- rman_set_irq_cookie and rman_get_irq_cookie functions are added to
provide a way to associate an interrupt resource with an interrupt
cookie;
- intr_event_suspend_handler and intr_event_resume_handler functions
are added to the MI interrupt handler interface.
I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to
implement the new intr_event functions. IH_SUSP marks a suspended
interrupt handler. IH_CHANGED is used to implement a barrier that
ensures that a change to the interrupt handler's state is visible
to future interrupts.
While there, I fixed some whitespace issues in comments and changed a
couple of logically boolean variables to be bool.
MFC after: 1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D15755
check-hash fails. Panic'ing is not an appropriate response. So, check
for an error return from VFS_ROOT() and when an error is reported,
unwind and return the error.
Reported by: Gary Jennejohn (gj)
Sponsored by: Netflix
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init
Sponsored by: The FreeBSD Foundation
It is a function call only to accomodate *some* ABIs which install a hook.
They only care for 3 types of limits: DATA, STACK, VMEM
Instead of always calling the func, see at compilation time if the requested
limit is something else and just do the read if so.
Sponsored by: The FreeBSD Foundation
Bootstacks are unused after APs executed sched_throw() in
init_secondary_tail() and started executing on proper idle thread
stack. Add sysinit that detects that the idle thread for each CPU was
scheduled at least once, and free corresponding bootstack.
Slight addition of the code (~200 bytes) is compensated by the saving,
because even on typical small modern desktop CPU we leak 128K of
memory otherwise (4 pages x 8 threads).
Reviewed by: jhb
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18486
There is no reason for it to behave differently from openat(fd, NULL).
Also the handling did not worked because the substituted path was from
the system address space, causing EFAULT.
Submitted by: Jack Halford <jack@gandi.net>
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18501
Once a signal's siginfo was copied to 'td_si' as part of the signal
exchange in issignal(), it was never cleared. This caused future
thread events that are reported as SIGTRAP events without signal
information to report the stale siginfo in 'td_si'. For example, if a
debugger created a new process and used SIGSTOP to stop it after
PT_ATTACH, future system call entry / exit events would set PL_FLAG_SI
with the SIGSTOP siginfo in pl_siginfo. This broke 'catch syscall' in
current versions of gdb as it assumed PL_FLAG_SI with SIGTRAP
indicates a breakpoint or single step trap.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18487
and includes the last block represented by the leaf. The reasoning is that,
if the last block is included, then there must be no solution before that
one in the leaf, so the leaf cannot provide an allocation that big again;
indeed, the leaf cannot provide a solution bigger than range1.
Which is all correct, except that if the value of blk passed in did not
represent the first block of the leaf, because the cursor was pointing to
the middle of the leaf, then a possible solution before the cursor may have
been ignored, and bighint cannot be updated.
Consider the sequence allocate 63 (returning address 0), free 0,63 (freeing
that same block, and allocate 1 (returning 63). The result is that one
block is allocated from the first leaf, and the value of bighint is 0, so
that nothing can be allocated from that leaf until the only block allocated
from that leaf is freed. This change detects that skipped-over solution,
and when there is one it makes sure that the value of bighint is not changed
when the last block is allocated.
Submitted by: Doug Moore <dougm@rice.edu>
Tested by: pho
X-MFC with: r340402
Differential Revision: https://reviews.freebsd.org/D18474
If all IDs from trypid to pid_max were used as pids, the code would enter
a loop which would be infinite if none of the IDs could become free (e.g.
they all belong to processes which did not transitioned to zombie).
Fixes: r341684 ("Manage process-related IDs with bitmaps")
Sponsored by: The FreeBSD Foundation
kqueue would always relock immediately afterwards.
While here drop the NULL check for list itself. The list is
always allocated.
Sponsored by: The FreeBSD Foundation