As documented in listen.2 manual page, the kernel emits a LOG_DEBUG
syslog message if a socket listen queue overflows. For some appliances,
it may be desirable to change the priority to some higher value
like LOG_INFO while keeping other debugging suppressed.
OTOH there are cases when such overflows are normal and expected.
Then it may be desirable to suppress overflow logging altogether,
so that dmesg buffer is not flooded over long run.
In addition to existing sysctl kern.ipc.sooverinterval,
introduce new sysctl kern.ipc.sooverprio that defaults to 7 (LOG_DEBUG)
to preserve current behavior. It may be changed to any value
in a range of 0..7 for corresponding priority or to -1 to suppress logging.
Document it in the listen.2 manual page.
MFC after: 1 month
Get rid of calling Linux stat translation hook and specific to Linux
handling of non-vnode dirfd from kern_statat(),
Reviewed by: kib, mjg
Differential revision: https://reviews.freebsd.org/D35474
Fix vfs_emptydir(). It would consider directories containing directories
with name of the form 'X.' (X being any authorized byte) as empty. Also,
it would cause VOP_READDIR() to return an error on directories
containing enough whiteouts. While here, use a more decently sized
buffer as done elsewhere.
Remove ad-hoc iteration on the directory's content and instead use the
newly exported vn_dir_next_dirent() function (this is what fixes the
second problem mentioned above).
PR: 270988
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39775
Simplify the old interface (one less argument, simpler termination test)
and add documentation about it. Add more sanity checks (mostly under
INVARIANTS, but also in the general case to prevent infinite
loops). Drop the explicit test on minimum directory entry size (without
INVARIANTS).
Deal with the impacts in callers (dirent_exists() and vop_stdvptocnp()).
dirent_exists() has been simplified a bit, preserving the exact same
semantics but for the return code whose meaning has been reversed (0 now
means the entry exists, ENOENT that it doesn't and other values are
genuine errors). While here, suppress gratuitous casts of malloc return
values.
vn_dir_next_dirent() has been tested by a 'make -j4 buildkernel' with a
temporary modification to the VFS cache causing vn_vptocnp() to always
call VOP_VPTOCNP() and finally vop_stdvptocnp() (observed with temporary
debug counters).
Export new _GENERIC_MINDIRSIZ and _GENERIC_MAXDIRSIZ on __BSD_VISIBLE,
and GENERIC_MINDIRSIZ and GENERIC_MAXDIRSIZ on _KERNEL.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39764
Move internal-to-'vfs_default.c' get_next_dirent() to 'vfs_vnops.c' and
export it for use by other parts of the VFS. This is a preparatory
change for using it in vfs_emptydir().
No functional change.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39755
Otherwise KMSAN only detects uninitialized memory when the contents of
the buffer are copied out to userspace or transmitted to a network
interface. At that point the KMSAN violation will be far removed from
its origin, so let's try to make debugging such problems a bit easier.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38101
This eliminates some static bloat in amd64 kernels and reduces the
penalty of increasing MAXCPU. The structures now also maintain NUMA
affinity. No functional change intended.
PR: 269572
Reviewed by: mjg, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39807
All in-tree implementations of VOP_CLOSE() for filesystems proclaiming
MNTK_EXTENDED_SHARED, are fine with the shared lock for the closed
vnode. I checked the following implementations:
ffs
ext2
ufs
null
tmpfs
devfs
fdescfs
cd9660
zfs
It seems that initial addition of FWRITE check was due to necessity of
handling the VV_TEXT vnode vflag. Since VOP_ADD_WRITECOUNT() only
requires shared lock, we can relax the locking requirement there.
Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
Tested by: Olivier Certner
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D39784
Summary:
After https://github.com/llvm/llvm-project/commit/b4257d3bf58c ("[tsan]
Replace mem intrinsics with calls to interceptors") intrinsic calls to
memcpy, memmove or memset will directly call sanitizer interceptors,
e.g. __tsan_memcpy, __tsan_memmove or __tsan_memset.
Building GENERIC-KCSAN with clang >= 16 would thus result in link errors
similar to:
ld: error: undefined symbol: __tsan_memcpy
>>> referenced by cam_compat.c:150 (/usr/src/sys/cam/cam_compat.c:150)
>>> cam_compat.o:(cam_compat_handle_0x17)
>>> referenced by cam_compat.c:151 (/usr/src/sys/cam/cam_compat.c:151)
>>> cam_compat.o:(cam_compat_handle_0x17)
>>> referenced by cam_compat.c:152 (/usr/src/sys/cam/cam_compat.c:152)
>>> cam_compat.o:(cam_compat_handle_0x17)
>>> referenced 1692 more times
Similar to subr_msan.c, add aliases from the existing kcsan_* versions
of these functions to __tsan_* names.
Reviewed by: markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39772
The 4.2 sigreturn was a bit of a enima so the 4.2 was remove. Regenerate
to cope the very minor changes in comments and one string.
Sponsored by: Netflix
Back in 4.3BSD, the system call table wasn't generated, and there was an
entry:
"4.2 sigreturn", /* 139 = old 4.2 sigreturn */
This got converted to
139 OBSOL 0 4.2 sigreturn
in 4.3 RENO. Since it was obsolete, nothing bad happened. In fact,
there was code in makeyscalls.sh to cope:
{ comment = $4
for (i = 5; i <= NF; i++)
comment = comment " " $i
if (NF < 5)
$5 = $4
}
so the generated comment in syscalls.c was almost correct:
"obs_4.2", /* 139 = obsolete 4.2 sigreturn */
a bug that we have to this very day, despite makesyscalls.sh being
rewritten in lua.
However, this historical wart is the only place in our current
syscalls.master file where we have an extra field for the 'not
generated' class of system calls. Remove the historical wart so that the
re-write of makesyscalls.lua can be simpler (so, I hope, qemu's bsd-user
can large swathes of code automatically generated too). This should help
make things more understandable (changes to simplify makesyscalls.lue
aren't quite debugged, so have to wait for another day).
There's 3 different obsolete sigreturns (but only 1 that was ever in
FreeBSD 2.x and newer).
Sponsored by: Netflix
Have more accruate comments. While #if, #else, etc are copied to the
header files, lines that don't start with # are not. And #include files
are only output to sysinc (which winds up at the front of init_sysent.c
which seems a bit odd). This is all radically undocumented, and likely
has drifted somewhat from 4.4BSD and what other systems do (they've
drifted too, fwiw).
Sponsored by: Netflix
Two vfs.cache.stats names are fixed:
- s/.dotdothis/.dotdothits/
- s/.posszaps/.poszaps/
Signed-off-by: Igor Ostapenko <pm@igoro.pro>
[mjg: massaged the header a little bit]
Since syncer vnode vector does not provide a fallback to the default
one, its VOP_GETWRITEMOUNT() implementation implicitly returned
EOPNOTSUPP, which means that syncer ignored suspension.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The VV_CROSSLOCK handling logic may need to upgrade the covered
vnode lock depending upon the requirements of the filesystem into
which vfs_lookup() is walking. This may involve transiently
dropping the lock, which can allow the target mount to be unmounted.
Tested by: pho
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D39272
Summary:
Check for PRIV_KDB_SET_BACKEND before allowing a thread to change
the KDB backend.
Obtained from: Juniper Networks, Inc.
Reviewers: sjg, emaste
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D39538
For a process supervisor using the reaper API to track process subtrees,
it is very useful to know the state of the processes on the list.
Sponsored by: https://www.patreon.com/valpackett
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39585
Add the remaining bus_space*read*_8 functions conditionally for
only arm64 in order to not break KASAN builds with new code using
one of them.
Suggested by: markj
Reviewed by: markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39581
Quoting from https://maskray.me/blog/2023-04-12-elf-hash-function:
The System V Application Binary Interface (generic ABI) specifies the
ELF object file format. When producing an output executable or shared
object needing a dynamic symbol table (.dynsym), a linker generates a
.hash section with type SHT_HASH to hold a symbol hash table. A DT_HASH
tag is produced to hold the address of .hash.
The function is supposed to return a value no larger than 0x0fffffff.
Unfortunately, there is a bug. When unsigned long consists of more than
32 bits, the return value may be larger than UINT32_MAX. For instance,
elf_hash((const unsigned char *)"\xff\x0f\x0f\x0f\x0f\x0f\x12") returns
0x100000002, which is clearly unintended, as the function should behave
the same way regardless of whether long represents a 32-bit integer or
a 64-bit integer.
Reviewed by: kib, Fangrui Song
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39517
which returns an indicator if the current thread owns the specified
lock.
Reviewed by: jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39477
This ensures that *mpp != NULL iff vn_finished_write() should be
called, regardless of the returned error, except for V_NOWAIT.
The only exception that must be maintained is the case where
vn_start_write(V_NOWAIT) is called with the intent of later dropping
other locks and then doing vn_start_write(V_XSLEEP), which needs the mp
value calculated from the non-waitable call above it.
Also note that V_XSLEEP is not supported by vn_start_secondary_write().
Reviewed by: markj, mjg (previous version), rmacklem (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39441
The assert_vop_locked messages are ignored, and file/line information
is not too useful. Fixing this without changing both witness and VFS
asserts KPIs is not possible.
Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39464
If either of vnodes is shared locked, lock must not be recursed.
Requested by: rmacklem
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39444
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).
Once all this lands I will then update rack to begin using all these new features.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
Currently, sysctls which enable KDB in some way are flagged with
CTLFLAG_SECURE, meaning that you can't modify them if securelevel > 0.
This is so that KDB cannot be used to lower a running system's
securelevel, see commit 3d7618d8bf. However, the newer mac_ddb(4)
restricts DDB operations which could be abused to lower securelevel
while retaining some ability to gather useful debugging information.
To enable the use of KDB (specifically, DDB) on systems with a raised
securelevel, change the KDB sysctl policy: rather than relying on
CTLFLAG_SECURE, add a check of the current securelevel to kdb_trap().
If the securelevel is raised, only pass control to the backend if MAC
specifically grants access; otherwise simply check to see if mac_ddb
vetoes the request, as before.
Add a new secure sysctl, debug.kdb.enter_securelevel, to override this
behaviour. That is, the sysctl lets one enter a KDB backend even with a
raised securelevel, so long as it is set before the securelevel is
raised.
Reviewed by: mhorne, stevek
MFC after: 1 month
Sponsored by: Juniper Networks
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37122
On 64-bit platforms this sorts out worries about mitigating bugs which
overflow the counter, all while not pessimizng anything -- most notably
it avoids whacking per-thread operation in favor of refcount(9) API.
The struct already had two instances of 4 byte padding with 256 bytes in
size, cr_flags gets moved around to avoid growing it.
32-bit platforms could also get the extended counter, but I did not do
it as one day(tm) the mutex protecting centralized operation should be
replaced with atomics and 64-bit ops on 32-bit platforms remain quite
penalizing.
While worries of counter overflow are addressed, the following is not
(just like it would not be with conversion to refcount(9)):
- counter *underflows*
- buffer overruns from adjacent allocations
- UAF due to stale cred pointer
- .. and other goodies
As such, while lipstick was placed, the pig should not be participating
in any beauty pageants.
Prodded by: emaste
Differential Revision: https://reviews.freebsd.org/D39220
It takes the flags argument. Immediate use is to provide the KQUEUE_CLOEXEC
flag for kqueue(2).
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39271