Commit Graph

19592 Commits

Author SHA1 Message Date
Andrew Gallatin
1985585233 ktls: re-work alloc thread
When the ktls_buffer zone needs to expand, it may fail due
to a lack of physically contiguous memory.  We tried to rectify
that by introducing an alloc thread to provide a context where
it is harmless to sleep, and letting that thread repopulate
the ktls_buffer zone.

However, it turns out that M_WAITOK is not enough, and we
must call vm_page_reclaim_contig_domain() to reclaim contig
memory. Worse, M_WAITOK results in the allocation essentially
busy-looping around vm_domain_alloc_fail() returning EAGIN,
causing vm_page_alloc_noobj_contig_domain() to loop and resulting
in the alloc thread consuming 100% CPU.

To fix this, we change the alloc thread to call
vm_page_reclaim_contig_domain_ext()

In order to prevent the busy loop around vm_domain_alloc_fail(), we
must change the uma_zalloc flags to M_NORECLAIM | M_NOWAIT.  However,
once that is done, these allocations become no different than the
allocations done in the critical path in ktls_buffer_alloc(), so its
best to just eliminate them.

Since we're no longer doing allocations but just calling
vm_page_reclaim_contig_domain_ext(), the name has changed to the ktls
reclaim thread.

Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D39421
2023-05-09 13:09:34 -04:00
Jonathan T. Looney
8a16fb4730 locks: fix two potential overflows in the lock delay code
With large numbers of CPUs, the calculation of the maximum lock delay
could overflow, leading to an unexpectedly low delay. In fact, the
maximum delay would calculate to 0 on systems with between 128 and
255 cores (inclusive). Also, when calculating the new delay in
lock_delay(), the delay would overflow if the old delay was >= 32,768.

This commit fixes these two overflows. It also updates the maximum
delay from 32,678 to SHRT_MAX.

Reviewed by:	gallatin, jhb, mjg
Fixes:	6b8dd26e7c ("locks: convert delay times to u_short")
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D39372
2023-05-09 16:20:49 +00:00
Konstantin Belousov
361c8f75a6 smp_topo(): correct allocation sizes for trivial topologies
This patch should not modify the correctness, only the clarity.

Requested and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39981
2023-05-09 18:30:07 +03:00
Konstantin Belousov
d0f67f9757 smp_topo(): make it idempotent
If more than one call to the function occurs, it currently allocates the
same amount from the group[] array, eventually leading to the memory
corruption.

Noted and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39981
2023-05-09 18:30:07 +03:00
Konstantin Belousov
9801e7c275 smp_topo: dynamically allocate group array
Limit its size to mp_maxid + 1 times MAX_CACHE_LEVELS instead MAXCPU.
Allocate the array on a first call into smp_topo(9) functions, where
the mp_maxid is already known.

Make the array private to smp_topo_alloc(), assuming that the callers
that allocate top-level group do it once.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39981
2023-05-09 18:30:07 +03:00
Konstantin Belousov
ccc6b87b38 quiesce_cpus(): do not overallocate generation array
Also switch to mallocarray().

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39981
2023-05-09 18:30:07 +03:00
Mark Johnston
e8f6e5b2d9 unix: Fix locking in uipc_peeraddr()
After the locking protocol changed in commit 75a67bf3d0 ("AF_UNIX:
make unix socket locking finer grained"), uipc_peeraddr() was not
updated accordingly.

The link lock now only protects global socket lists.  The PCB lock is
used to protect the link between connected PCBs, so use that.  Remove an
old comment which appears to be noting that unp_conn is not set for
connected SOCK_DGRAM sockets (in one direction anyway).

Reviewed by:	glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D39855
2023-05-03 11:56:46 -04:00
Mateusz Guzik
cf0fc64bc2 vfs: reduce audit branching in namei_setup 2023-05-03 06:56:10 +00:00
Konstantin Belousov
a1d71cebc0 fstatat(2): restore AT_EMPTY_PATH handling
Fixes:	cb858340dc
Reported by:	markj
Sponsored by:	The FreeBSD Foundation
2023-05-02 18:11:39 +03:00
Eugene Grosbein
4824d78872 listen(2): improve administrator control over logging
As documented in listen.2 manual page, the kernel emits a LOG_DEBUG
syslog message if a socket listen queue overflows. For some appliances,
it may be desirable to change the priority to some higher value
like LOG_INFO while keeping other debugging suppressed.

OTOH there are cases when such overflows are normal and expected.
Then it may be desirable to suppress overflow logging altogether,
so that dmesg buffer is not flooded over long run.

In addition to existing sysctl kern.ipc.sooverinterval,
introduce new sysctl kern.ipc.sooverprio that defaults to 7 (LOG_DEBUG)
to preserve current behavior. It may be changed to any value
in a range of 0..7 for corresponding priority or to -1 to suppress logging.
Document it in the listen.2 manual page.

MFC after:	1 month
2023-05-01 03:26:44 +07:00
Olivier Certner
2544b8e00c vfs: Rename vfs_emptydir() to vn_dir_check_empty()
No functional change.  While here, adapt comments to style(9).

Reviewed by:    kib
MFC after:      1 week
2023-04-28 22:37:35 +03:00
Olivier Certner
c21d87a88c vfs: vn_dir_next_dirent(): Adapt comments to style(9)
No functional change.

Reviewed by:    kib
MFC after:	1 week
2023-04-28 22:37:35 +03:00
Dmitry Chagin
cb858340dc linux(4): Add a dedicated statat() implementation
Get rid of calling Linux stat translation hook and specific to Linux
handling of non-vnode dirfd from kern_statat(),

Reviewed by:		kib, mjg
Differential revision:	https://reviews.freebsd.org/D35474
2023-04-28 11:55:04 +03:00
Olivier Certner
6450e7bbad vfs: Fix "emptydir" mount option
Fix vfs_emptydir(). It would consider directories containing directories
with name of the form 'X.' (X being any authorized byte) as empty. Also,
it would cause VOP_READDIR() to return an error on directories
containing enough whiteouts. While here, use a more decently sized
buffer as done elsewhere.

Remove ad-hoc iteration on the directory's content and instead use the
newly exported vn_dir_next_dirent() function (this is what fixes the
second problem mentioned above).

PR:	270988
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D39775
2023-04-28 04:27:54 +03:00
Olivier Certner
3d8450db4c vfs: vn_dir_next_dirent(): Simplify interface and harden
Simplify the old interface (one less argument, simpler termination test)
and add documentation about it. Add more sanity checks (mostly under
INVARIANTS, but also in the general case to prevent infinite
loops). Drop the explicit test on minimum directory entry size (without
INVARIANTS).

Deal with the impacts in callers (dirent_exists() and vop_stdvptocnp()).
dirent_exists() has been simplified a bit, preserving the exact same
semantics but for the return code whose meaning has been reversed (0 now
means the entry exists, ENOENT that it doesn't and other values are
genuine errors). While here, suppress gratuitous casts of malloc return
values.

vn_dir_next_dirent() has been tested by a 'make -j4 buildkernel' with a
temporary modification to the VFS cache causing vn_vptocnp() to always
call VOP_VPTOCNP() and finally vop_stdvptocnp() (observed with temporary
debug counters).

Export new _GENERIC_MINDIRSIZ and _GENERIC_MAXDIRSIZ on __BSD_VISIBLE,
and GENERIC_MINDIRSIZ and GENERIC_MAXDIRSIZ on _KERNEL.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D39764
2023-04-28 04:27:54 +03:00
Olivier Certner
6bce3f23d0 vfs: Export get_next_dirent() as vn_dir_next_dirent()
Move internal-to-'vfs_default.c' get_next_dirent() to 'vfs_vnops.c' and
export it for use by other parts of the VFS. This is a preparatory
change for using it in vfs_emptydir().

No functional change.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D39755
2023-04-28 04:27:54 +03:00
Mark Johnston
ec45f952a2 sockbuf: Add KMSAN checks to sbappend*()
Otherwise KMSAN only detects uninitialized memory when the contents of
the buffer are copied out to userspace or transmitted to a network
interface.  At that point the KMSAN violation will be far removed from
its origin, so let's try to make debugging such problems a bit easier.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38101
2023-04-27 12:58:56 -04:00
Mark Johnston
2c0209a2ca callout: Remove an unneeded MTX_NEW
Reported by:	hselasky
Fixes:		78cfa762eb ("callout: Move per-CPU callout state into the dpcpu region")
2023-04-26 11:15:56 -04:00
Mark Johnston
e72f7ed43e buf: Dynamically allocate per-CPU buffer queues
To reduce static bloat.  No functional change intended.

PR:		269572
Reviewed by:	mjg, kib, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D39808
2023-04-26 10:09:31 -04:00
Mark Johnston
78cfa762eb callout: Move per-CPU callout state into the dpcpu region
This eliminates some static bloat in amd64 kernels and reduces the
penalty of increasing MAXCPU.  The structures now also maintain NUMA
affinity.  No functional change intended.

PR:		269572
Reviewed by:	mjg, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D39807
2023-04-26 10:09:09 -04:00
Konstantin Belousov
37e6036b26 VOP_CLOSE(): MNTK_EXTENDED_SHARED filesystems do not need excl lock
All in-tree implementations of VOP_CLOSE() for filesystems proclaiming
MNTK_EXTENDED_SHARED, are fine with the shared lock for the closed
vnode.  I checked the following implementations:
	ffs
	ext2
	ufs
	null
	tmpfs
	devfs
	fdescfs
	cd9660
	zfs
It seems that initial addition of FWRITE check was due to necessity of
handling the VV_TEXT vnode vflag.  Since VOP_ADD_WRITECOUNT() only
requires shared lock, we can relax the locking requirement there.

Reviewed by:	markj, Olivier Certner <olce.freebsd@certner.fr>
Tested by:	Olivier Certner
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D39784
2023-04-26 01:39:25 +03:00
Olivier Certner
6a5e614015 vn_open_vnode(): fix locking around VOP_CLOSE() on advisory lock error
In the case of a FIFO or if trying to open a file for writing, an
exclusive lock is necessary.

Reviewed by:	kib
MFC after:	1 week
2023-04-25 01:37:58 +03:00
Konstantin Belousov
a718431c30 lookup(): ensure that openat("/", "..", O_RESOLVE_BENEATH) fails
PR:	269780
Reported by:	Dan Gohman <dev@sunfishcode.online>
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39773
2023-04-25 00:32:10 +03:00
Dimitry Andric
42162fb2fe kcsan: add __tsan_mem(cpy|move|set) aliases for clang >= 16
Summary:
After https://github.com/llvm/llvm-project/commit/b4257d3bf58c ("[tsan]
Replace mem intrinsics with calls to interceptors") intrinsic calls to
memcpy, memmove or memset will directly call sanitizer interceptors,
e.g. __tsan_memcpy, __tsan_memmove or __tsan_memset.

Building GENERIC-KCSAN with clang >= 16 would thus result in link errors
similar to:

  ld: error: undefined symbol: __tsan_memcpy
  >>> referenced by cam_compat.c:150 (/usr/src/sys/cam/cam_compat.c:150)
  >>>               cam_compat.o:(cam_compat_handle_0x17)
  >>> referenced by cam_compat.c:151 (/usr/src/sys/cam/cam_compat.c:151)
  >>>               cam_compat.o:(cam_compat_handle_0x17)
  >>> referenced by cam_compat.c:152 (/usr/src/sys/cam/cam_compat.c:152)
  >>>               cam_compat.o:(cam_compat_handle_0x17)
  >>> referenced 1692 more times

Similar to subr_msan.c, add aliases from the existing kcsan_* versions
of these functions to __tsan_* names.

Reviewed by:	markj
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D39772
2023-04-23 20:59:06 +02:00
Warner Losh
9abba78acc syscalls: regenerate
The 4.2 sigreturn was a bit of a enima so the 4.2 was remove. Regenerate
to cope the very minor changes in comments and one string.

Sponsored by:		Netflix
2023-04-20 23:39:23 -06:00
Warner Losh
602b575a88 syscall.master: Remove stray 4.2
Back in 4.3BSD, the system call table wasn't generated, and there was an
entry:
        "4.2 sigreturn",        /* 139 = old 4.2 sigreturn */
This got converted to
139     OBSOL   0 4.2 sigreturn
in 4.3 RENO. Since it was obsolete, nothing bad happened. In fact,
there was code in makeyscalls.sh to cope:
        {       comment = $4
                for (i = 5; i <= NF; i++)
                        comment = comment " " $i
                if (NF < 5)
                        $5 = $4
        }
so the generated comment in syscalls.c was almost correct:
        "obs_4.2",                      /* 139 = obsolete 4.2 sigreturn */
a bug that we have to this very day, despite makesyscalls.sh being
rewritten in lua.

However, this historical wart is the only place in our current
syscalls.master file where we have an extra field for the 'not
generated' class of system calls. Remove the historical wart so that the
re-write of makesyscalls.lua can be simpler (so, I hope, qemu's bsd-user
can large swathes of code automatically generated too). This should help
make things more understandable (changes to simplify makesyscalls.lue
aren't quite debugged, so have to wait for another day).

There's 3 different obsolete sigreturns (but only 1 that was ever in
FreeBSD 2.x and newer).

Sponsored by:		Netflix
2023-04-20 23:39:23 -06:00
Warner Losh
559b94a122 syscall.master: Fix comments
Have more accruate comments. While #if, #else, etc are copied to the
header files, lines that don't start with # are not.  And #include files
are only output to sysinc (which winds up at the front of init_sysent.c
which seems a bit odd). This is all radically undocumented, and likely
has drifted somewhat from 4.4BSD and what other systems do (they've
drifted too, fwiw).

Sponsored by:		Netflix
2023-04-20 16:18:02 -06:00
Igor Ostapenko
0e0c47ecd6 vfs cache: fix vfs.cache.stats.* name typos
Two vfs.cache.stats names are fixed:
- s/.dotdothis/.dotdothits/
- s/.posszaps/.poszaps/

Signed-off-by: Igor Ostapenko <pm@igoro.pro>
[mjg: massaged the header a little bit]
2023-04-19 18:47:38 +00:00
Konstantin Belousov
93ca6ff295 umtx: allow to configure minimal timeout (in nanoseconds)
PR:	270785
Reviewed by:	markj, mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39584
2023-04-19 02:22:28 +03:00
Konstantin Belousov
7aeea73e30 syncer vnode: add VOP_GETWRITEMOUNT() definition explicitly
Since syncer vnode vector does not provide a fallback to the default
one, its VOP_GETWRITEMOUNT() implementation implicitly returned
EOPNOTSUPP, which means that syncer ignored suspension.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2023-04-19 02:21:40 +03:00
Konstantin Belousov
d8a096621b sync_vnode(): add assert to check vn_start_write() correctness
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2023-04-19 02:21:40 +03:00
Gordon Bergling
105e397eb6 kern_sysctl: Remove double words in source code comments
- s/on on/on/

MFC after:	5 days
2023-04-18 07:14:57 +02:00
Jason A. Harmening
0c01203e47 vfs_lookup(): re-check v_mountedhere on lock upgrade
The VV_CROSSLOCK handling logic may need to upgrade the covered
vnode lock depending upon the requirements of the filesystem into
which vfs_lookup() is walking.  This may involve transiently
dropping the lock, which can allow the target mount to be unmounted.

Tested by:	pho
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D39272
2023-04-17 20:31:40 -05:00
Stephen J. Kiernan
4819e5aeda Add new privilege PRIV_KDB_SET_BACKEND
Summary:
Check for PRIV_KDB_SET_BACKEND before allowing a thread to change
the KDB backend.

Obtained from:	Juniper Networks, Inc.
Reviewers: sjg, emaste
Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D39538
2023-04-16 14:37:58 -04:00
Val Packett
77f0e198d9 procctl: add state flags to PROC_REAP_GETPIDS reports
For a process supervisor using the reaper API to track process subtrees,
it is very useful to know the state of the processes on the list.

Sponsored by:   https://www.patreon.com/valpackett
Reviewed by:    kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D39585
2023-04-16 13:48:20 +03:00
Bjoern A. Zeeb
42742fe725 KASAN: add bus_space*read*_8 for aarch64
Add the remaining bus_space*read*_8 functions conditionally for
only arm64 in order to not break KASAN builds with new code using
one of them.

Suggested by:	markj
Reviewed by:	markj
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D39581
2023-04-15 16:13:56 +00:00
Gordon Bergling
c159f76713 kern: remove a double word in a KASSERT in subr_trap
- s/with with/with/

MFC after:	5 days
2023-04-13 20:03:37 +02:00
Ed Maste
2ef2c26f3f link_elf: fix SysV hash function overflow
Quoting from https://maskray.me/blog/2023-04-12-elf-hash-function:

The System V Application Binary Interface (generic ABI) specifies the
ELF object file format. When producing an output executable or shared
object needing a dynamic symbol table (.dynsym), a linker generates a
.hash section with type SHT_HASH to hold a symbol hash table. A DT_HASH
tag is produced to hold the address of .hash.

The function is supposed to return a value no larger than 0x0fffffff.
Unfortunately, there is a bug. When unsigned long consists of more than
32 bits, the return value may be larger than UINT32_MAX. For instance,
elf_hash((const unsigned char *)"\xff\x0f\x0f\x0f\x0f\x0f\x12") returns
0x100000002, which is clearly unintended, as the function should behave
the same way regardless of whether long represents a 32-bit integer or
a 64-bit integer.

Reviewed by:	kib, Fangrui Song
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39517
2023-04-12 15:33:55 -04:00
Konstantin Belousov
c53e990b8d DEBUG_VFS_LOCKS: restore diagnostic for the witness use case
Reviewed by:	jah, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39477
2023-04-11 15:59:55 +03:00
Konstantin Belousov
75fc6f86c3 Add witness_is_owned(9)
which returns an indicator if the current thread owns the specified
lock.

Reviewed by:	jah, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39477
2023-04-11 15:59:49 +03:00
Konstantin Belousov
afa8f8971b vn_start_write(): consistently set *mpp to NULL on error or after failed sleep
This ensures that *mpp != NULL iff vn_finished_write() should be
called, regardless of the returned error, except for V_NOWAIT.
The only exception that must be maintained is the case where
vn_start_write(V_NOWAIT) is called with the intent of later dropping
other locks and then doing vn_start_write(V_XSLEEP), which needs the mp
value calculated from the non-waitable call above it.

Also note that V_XSLEEP is not supported by vn_start_secondary_write().

Reviewed by:	markj, mjg (previous version), rmacklem (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39441
2023-04-11 15:59:46 +03:00
Konstantin Belousov
b2f3288747 vn_start_write(): minor style
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39441
2023-04-11 15:59:39 +03:00
Konstantin Belousov
7b6fe2428a DEBUG_VFS_LOCKS: use witness if available
The assert_vop_locked messages are ignored, and file/line information
is not too useful. Fixing this without changing both witness and VFS
asserts KPIs is not possible.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39464
2023-04-10 00:34:12 +03:00
Konstantin Belousov
bb24eaea49 vn_lock_pair(): allow to request shared locking
If either of vnodes is shared locked, lock must not be recursed.

Requested by:	rmacklem
Reviewed by:	markj, rmacklem
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39444
2023-04-08 01:58:26 +03:00
Mateusz Guzik
02e6e8d218 vfs: extend vn_printf with vop vector 2023-04-07 20:39:06 +00:00
Mateusz Guzik
26b9648750 vfs: more informative panic for missing fplookup ops 2023-04-07 20:39:06 +00:00
Mateusz Guzik
f87a9f51ef vfs: validate that a mount point with FPLOOKUP has vop_fplookup ops 2023-04-06 15:20:41 +00:00
Mateusz Guzik
e237e2ba5f vfs: only allow doomed vnodes to return EOPNOTSUPP for fplookup vops
This helps asserting that they are provided by filesystems indicating
they do it.
2023-04-06 15:20:41 +00:00
Mateusz Guzik
5f6df17775 vfs: validate that vop vectors provide all or none fplookup vops
In order to prevent later susprises.
2023-04-06 15:20:41 +00:00
Mateusz Guzik
0baef43ed0 vfs: add missing vop_fplookup ops to syncer 2023-04-06 15:20:41 +00:00