freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	8840ae2288	tcp: don't store VNET in every tcpcb, take it from the inpcbinfo Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37125	2022-11-08 10:24:40 -08:00
Gleb Smirnoff	9eb0e8326d	tcp: provide macros to access inpcb and socket from a tcpcb There should be no functional changes with this commit. Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37123	2022-11-08 10:24:40 -08:00
Mark Johnston	2c10be9e06	arm64: Handle translation faults for thread structures The break-before-make requirement poses a problem when promoting or demoting mappings containing thread structures: a CPU may raise a translation fault while accessing curthread, and data_abort() accesses the thread again before pmap_fault() can translate the address and return. Normally this isn't a problem because we have a hack to ensure that slabs used by the thread zone are always accessed via the direct map, where promotions and demotions are rare. However, this hack doesn't work properly with UMA_MD_SMALL_ALLOC disabled, as is the case with KASAN configured (since our KASAN implementation does not shadow the direct map and so tries to force the use of the kernel map wherever possible). Fix the problem by modifying data_abort() to handle translation faults in the kernel map without dereferencing "td", i.e., curthread, and without enabling interrupts. pmap_klookup() has special handling for translation faults which makes it safe to call in this context. Then, revert the aforementioned hack. Reviewed by: kevans, alc, kib, andrew MFC after: 1 month Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37231	2022-11-02 13:46:25 -04:00
Andrew Gallatin	8b19898a78	Fix a panic on boot introduced by `555a861d68` First, an sbuf_new() in device_get_path() shadows the sb passed in by dev_wired_cache_add(), leaving its sb in an unfinished state, leading to a failed KASSERT(). Fixing this is as simple as removing the sbuf_new() from device_get_path() Second, we cannot simply take a pointer to the sbuf memory and store it in the device location cache, because that sbuf is freed immediately after we add data to the cache, leading to a use-after-free and eventually a double-free. Fixing this requires allocating memory for the path. After a discussion with jhb, we decided that one malloc was better than two in dev_wired_cache_add, which is why it changed so much. Reviewed by: jhb Sponsored by: Netflix MFC after: 14 days	2022-11-01 13:44:39 -04:00
Mark Johnston	1f6b6cf177	atomic: Intercept atomic_(load\|store)_bool for kernel sanitizers Fixes: `2bed73739a` ("atomic: Add plain atomic_load/store_bool()")	2022-10-29 11:10:58 -04:00
Konstantin Belousov	6b69465efb	vfs_domount(): ensure that v_mountedhere and VIRF_MOUNTPOINT are set under the vnode lock Fixes: `f7833196bd` Reported and tested by: pho Reviewed by: jah, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37198	2022-10-29 14:29:55 +03:00
John Baldwin	744bfb2131	Import the WireGuard driver from zx2c4.com. This commit brings back the driver from FreeBSD commit `f187d6dfbf` plus subsequent fixes from upstream. Relative to upstream this commit includes a few other small fixes such as additional INET and INET6 #ifdef's, #include cleanups, and updates for recent API changes in main. Reviewed by: pauamma, gbe, kevans, emaste Obtained from: git@git.zx2c4.com:wireguard-freebsd @ 3cc22b2 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D36909	2022-10-28 13:36:12 -07:00
Jason A. Harmening	f7833196bd	vfs_lookup(): Minor performance optimizations Refactor the symlink and mountpoint traversal logic to avoid repeatedly checking the vnode type; a symlink cannot be a mountpoint and vice versa. Avoid repeatedly checking cn_flags for NOCROSSMOUNT and simplify the check which determines whether the vnode is a mountpoint. Suggested by: mjg Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D35054	2022-10-26 19:33:33 -05:00
Jason A. Harmening	4390622c8d	vfs_busy(): fix wording in comment Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35054	2022-10-26 19:33:30 -05:00
Jason A. Harmening	706f15c5fa	Remove witness directives from crossmp locking VOPs These are of limited use since the crossmp vnode locking ops have not actually used a lock since commit `a2d3554542`. We in fact require that these operations are always issued with LK_SHARED. Additionally, these directives can produce a false positive in certain VV_CROSSLOCK cases which require upgrading of the covered vnode lock from shared to exclusive. While here, replace the runtime check of LK_SHARED with a KASSERT and expand the check to include LK_NOWAIT, which all callers pass. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D35054	2022-10-26 19:33:18 -05:00
Jason A. Harmening	080ef8a418	Add VV_CROSSLOCK vnode flag to avoid cross-mount lookup LOR When a lookup operation crosses into a new mountpoint, the mountpoint must first be busied before the root vnode can be locked. When a filesystem is unmounted, the vnode covered by the mountpoint must first be locked, and then the busy count for the mountpoint drained. Ordinarily, these two operations work fine if executed concurrently, but with a stacked filesystem the root vnode may in fact use the same lock as the covered vnode. By design, this will always be the case for unionfs (with either the upper or lower root vnode depending on mount options), and can also be the case for nullfs if the target and mount point are the same (which admittedly is very unlikely in practice). In this case, we have LOR. The lookup path holds the mountpoint busy while waiting on what is effectively the covered vnode lock, while a concurrent unmount holds the covered vnode lock and waits for the mountpoint's busy count to drain. Attempt to resolve this LOR by allowing the stacked filesystem to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary. Upon observing this flag, the vfs_lookup() will leave the covered vnode lock held while crossing into the mountpoint. Employ this flag for unionfs with the caveat that it can't be used for '-o below' mounts until other unionfs locking issues are resolved. Reported by: pho Tested by: pho Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35054	2022-10-26 19:33:03 -05:00
Mateusz Guzik	d346e3ac33	vfs: use cache_assert_no_entries instead of open-coding it	2022-10-26 15:54:19 +00:00
Warner Losh	deb1e3b719	physmem: Add physmem_excluded to query if a region is excluded In order to safely reuse excluded memory when it's reserved for special purpose, we need to test whether or not the memory has been reserved early in boot. physmem_excluded will return true when the entire range is excluded, false otherwise. Sponsored by: Netflix	2022-10-25 09:32:49 -06:00
Mateusz Guzik	d653aaec7a	cache: add cache_assert_no_entries	2022-10-24 15:37:43 +00:00
Hans Petter Selasky	fdd9548333	time(3): Fix spelling. Noted by: Gary Jennejohn <garyj@gmx.de> MFC after: 1 week Sponsored by: NVIDIA Networking	2022-10-23 18:42:11 +02:00
Hans Petter Selasky	35a33d14b5	time(3): Optimize tvtohz() function. List of changes: - Use integer multiplication instead of long multiplication, because the result is an integer. - Remove multiple if-statements and predict new if-statements. - Rename local variable name, "ticks" into "retval" to avoid shadowing the system "ticks" global variable. Reviewed by: kib@ and imp@ MFC after: 1 week Sponsored by: NVIDIA Networking Differential Revision: https://reviews.freebsd.org/D36859	2022-10-23 10:04:50 +02:00
Hans Petter Selasky	ee29897fc3	time(3): Declare the minimum and maximum hz values supported. Reviewed by: kib@ and imp@ MFC after: 1 week Sponsored by: NVIDIA Networking Differential Revision: https://reviews.freebsd.org/D37072	2022-10-23 10:04:50 +02:00
Konstantin Belousov	33ce178835	vn_bmap_seekhole: check that passed offset is non-negative Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37024	2022-10-19 20:24:07 +03:00
Konstantin Belousov	555a861d68	device_get_path(): take sbuf directly This allows to fix a bug where sbuf allocation done in the context of dev_wired_cache_match() must use non-sleepable allocations. Suggested by: jhb Reviewed by: jhb, takawata Discussed with: imp Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36899	2022-10-19 19:39:40 +03:00
Konstantin Belousov	8cf783bde3	device_get_path(): handle case when dev is root PR: 266862 Based on submission by: takawata Reviewed by: jhb, takawata Disscussed with: imp Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36899	2022-10-19 19:39:33 +03:00
Konstantin Belousov	d9c5a9ea49	device_get_path(): do not drop the error from BUS_GET_DEVICE_PATH() Later it would silently converted to ENOMEM always, because any error was reported as NULL return path. Reviewed by: jhb, takawata Discussed with: imp Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36899	2022-10-19 19:39:26 +03:00
Konstantin Belousov	23d2fcfbb2	subr_bus.c: some style Wrap long lines in devctl2_ioctl DEV_GET_PATH and dev_wired_cache_match() Reviewed by: jhb, takawata Discussed with: imp Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36899	2022-10-19 19:39:17 +03:00
Colin Percival	c32bd97641	kern: Support duplicate variables in early kenv Some virtual machines pass virtio MMIO device parameters via the kernel command line as a series of virtio_mmio.device=<parameters> options. These get translated into FreeBSD kernel environment variables; but unfortunately they all use the same variable name, which resulted in all but the first such parameter being ignored when the dynamic kernel environment is set up from the initial environment buffers. With this commit, duplicate environment settings will instead be stored as ${name}_1, ${name}_2... ${name}_9999. In the unlikely event that the same variable is set over 10000 times before the dynamic kernel environment is set up, we panic. Variable settings after the dynamic environment is initialized continue to override the previously-set value; the change is limited to the very early kernel boot (prior to SI_SUB_KMEM + 1) and changes behaviour from "ignore" to "store with a different name" only. Reviewed by: imp Feedback from: kevans Sponsored by: https://patreon.com/cperciva Differential Revision: https://reviews.freebsd.org/D36187	2022-10-17 23:02:20 -07:00
Ali Abdallah	ba4782022a	ksched: correct return code for invalid priority By convention, EINVAL is returned when validating arguments, not EPERM. This matches the documented behaviour of sched_setscheduler(3), and that of SCHED_OTHER. PR: 227735 MFC after: 1 week Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D37021	2022-10-17 15:12:13 -03:00
Mitchell Horne	39888ed7a3	kern_intr: Check for NULL event in intr_destroy() It likely won't happen, but is consistent with the other functions of this KPI. Reviewed by: imp, jhb MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D33479	2022-10-15 15:51:44 -03:00
Zhenlei Huang	43f8c763cd	if_me: Use dedicated network privilege Separate if_me privileges from if_gif. Reviewed by: kp Differential Revision: https://reviews.freebsd.org/D36691	2022-10-15 17:05:36 +02:00
Mitchell Horne	05b727fee5	Downgrade tty_intr_event from a global It can be static within uart_tty.c. It is an open question whether there remains any real benefit to having uart instances share a swi thread. Reviewed by: imp, markj, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D36938	2022-10-12 13:46:12 -03:00
Michael Tuexen	bc0d407676	Revert "listen(): improve POSIX compliance" This reverts commit `76e6e4d72f`. Several programs in the tree use -1 instead of INT_MAX to use the maximum value. Thanks to Eugene Grosbein for pointing this out.	2022-10-12 04:33:00 +02:00
Michael Tuexen	76e6e4d72f	listen(): improve POSIX compliance Ensure that a negative backlog argument is handled as it if was 0. Reviewed by: markj@, glebius@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31821	2022-10-11 22:46:51 +02:00
Bjoern A. Zeeb	99e6980fcf	device_get_property: add a HANDLE case This will resolve a reference and return the appropriate handle, a node on the simplebus or an ACPI_HANDLE for ACPI. For now we do not try to further abstract the return type. MFC after: 2 weeks Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D36793	2022-10-09 21:51:25 +00:00
Mateusz Guzik	143942f992	unr: remove UNR64_LOCKED All platforms support 64-bit atomics now.	2022-10-08 10:41:21 +00:00
Gleb Smirnoff	53af690381	tcp: remove INP_TIMEWAIT flag Mechanically cleanup INP_TIMEWAIT from the kernel sources. After `0d7445193a`, this commit shall not cause any functional changes. Note: this flag was very often checked together with INP_DROPPED. If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we will be able to remove most of this checks and turn them to assertions. Some of them can be turned into assertions right now, but that should be carefully done on a case by case basis. Differential revision: https://reviews.freebsd.org/D36400	2022-10-06 19:24:37 -07:00
Andrew Turner	9d4cff787e	Remove pre-armv6 support from devmap Remove an old code path that was used used by Armv4/5 so is unused now. Sponsored by: The FreeBSD Foundation	2022-10-05 09:56:17 +01:00
Hans Petter Selasky	0def80f1a5	time(3): Align fast clock times to avoid firing multiple timers. In non-periodic mode absolute timers fire at exactly the time given. When specifying a fast clock, align the firing time so that less timer interrupt events are needed. Reviewed by: rrs @ Differential Revision: https://reviews.freebsd.org/D36858 MFC after: 1 week Sponsored by: NVIDIA Networking	2022-10-03 17:53:17 +02:00
Alfredo Dal'Ava Junior	db79bf75ac	powerpc: cpuset: add local functions for copyin/copyout Add local functions to workaround an instruction segment trap (panic) when the indirect functions copyin and copyout are called by an external loadable kernel module (i.e. pfsync, zfs and linuxulator). The crash was triggered by change `47a57144af`, but kernel binary linked with LLD 9 works fine. LLVM bisect points that LLD behavior chaged after dc06b0bc9ad055d06535462d91bfc2a744b2f589. This is know to affect powerpc targets only and the final fix is still being discussed with the LLVM community. PR: 266730 Reviewed by: luporl, jhibbits (on IRC, previous version) MFC after: 2 days Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br) Differential Revision: https://reviews.freebsd.org/D36234	2022-10-03 12:03:09 -03:00
Doug Moore	e5f93d1078	show_sysctl_all: reduce copying, please coverity Modify db_show_sysctl_all so that it does not copy more than once the data of the input oid, and so that what it passes to db_show_oid does not alarm coverity. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D36847	2022-10-01 12:20:04 -05:00
Gleb Smirnoff	636420bde3	unix/dgram: don't leak file descriptors when socket write failed	2022-09-30 13:43:08 -07:00
Alexander V. Chernikov	7b660faa9e	sockbufs: add sbreserve_locked_limit() with custom maxsockbuf limit. Protocols such as netlink may need a large socket receive buffer, measured in tens of megabytes. This change allows netlink to set larger socket buffers (given the privs are in place), without requiring user to manuall bump maxsockbuf. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D36747	2022-09-28 10:20:09 +00:00
Alexander V. Chernikov	f66968564d	protocols: make socket buffers ioctl handler changeable Allow to set custom per-protocol handlers for the socket buffers ioctls by introducing pr_setsbopt callback with the default value set to the currently-used sbsetopt(). Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D36746	2022-09-28 10:20:09 +00:00
Doug Moore	5294bfa751	sysctl_search_oid: remove all-NULL precondition The implementation of sysctl_search_oid no longer relies on the initial value of nodes to be all NULL, so remove the comment that demands it and let the caller stop enforcing it. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D36768	2022-09-28 04:30:11 -05:00
Doug Moore	9f6f9007b9	name2oid: use find_oidname In name2oid, use sysctl _find_oidname instead of re-implementing it. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D36765	2022-09-27 16:17:55 -05:00
Doug Moore	e96ae5cb05	sysctl_search_oid: remove useless tests sysctl_search_old makes several tests in a loop that can be removed. The first test in the loop is only ever true on the first loop iteration, and is always true on that iteration, so its work can be done before the loop begins. The upper and lower bounds on the loop variable 'indx' are each tested on each iteration, but 'indx' is changed in one direction or the other only once within the loop, so only one bound needs to be checked. Two ways remain in the loop that nodes[indx] can change (after one of them is put before the loop start), and one of them applies exactly when indx has been incremented, so no separate test for that case requires testing. Restructure and add comments that makes clearer that this is a basic depth-first search. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D36741	2022-09-27 13:30:31 -05:00
Doug Moore	ed5183455e	register_oid: fix duplicate oid after `d3f96f6610` sysctl_register_oid must check the uniqueness of any newly computed oid_number in sysctl_register_oid. Reviewed by: asomers MFC with: `d3f96f6610` Differential Revision: https://reviews.freebsd.org/D36743	2022-09-27 12:24:01 -05:00
Hans Petter Selasky	c075ea46bc	sysctl(3): Implement SYSCTL_FOREACH() to iterate all OIDs in a sysctl list. To avoid using the sysctl list macros directly in external kernel modules. Reviewed by: asomers, manu and asiciliano Differential Revision: https://reviews.freebsd.org/D36748 MFC after: 1 week Sponsored by: NVIDIA Networking	2022-09-27 19:21:21 +02:00
Mitchell Horne	f2963b530e	kasan: disable kasan_mark() after a violation Specifically, when we receive a violation and we're configured to panic, kasan_enabled gets unset before we descend into panic(). At this point, there's no longer any reason to allow marking as kasan_shadow_check() is disabled -- we have some inherent risk of faulting or panicking if the system's in a bad enough state with no benefit. Reviewed by: markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D36742	2022-09-27 11:01:21 -05:00
Alan Somers	6622e299ac	Fix the build with SCHED_STATS after `d3f96f6610` MFC with: `d3f96f6610` Sponsored by: Axcient	2022-09-26 20:20:46 -06:00
Alan Somers	d3f96f6610	Fix O(n^2) behavior in sysctl Sysctl OIDs were internally stored in linked lists, triggering O(n^2) behavior when userland iterates over many of them. The slowdown is noticeable for MIBs that have > 100 children (for example, vm.uma). But it's unignorable for kstat.zfs when a pool has > 1000 datasets. Convert the linked lists into RB trees. This produces a ~25x speedup for listing kstat.zfs with 4100 datasets, and no measurable penalty for small dataset counts. Bump __FreeBSD_version for the KPI change. Sponsored by: Axcient Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D36500	2022-09-26 18:03:34 -06:00
Alan Somers	52360ca32f	copy_file_range: truncate write if it would exceed RLIMIT_FSIZE PR: 266611 MFC after: 2 weeks Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D36706	2022-09-26 15:22:29 -06:00
Mitchell Horne	818cae0ff7	kasan: provide bus peek/poke definitions Reviewed by: andrew, markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D36700	2022-09-26 14:25:05 -05:00
Konstantin Belousov	1b4b75171e	Add vn_rlimit_fsizex() and vn_rlimit_fsizex_res() The vn_rlimit_fsizex() function: - checks that the write does not exceed RLIMIT_FSIZE limit and fs maximum supported file size - truncates write length if it exceeds the RLIMIT_FSIZE or max file size, but there are some bytes to write - sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise POSIX mandates the truncated write in case when some bytes can be written but whole write request fails the RLIMIT_FSIZE check. The function is supposed to be used from VOP_WRITE()s. Due to pecularity in the VFS generic write syscall layer, uio_resid must correctly reflect the written amount (noted by markj). Provide the dual vn_rlimit_fsizex_res() function to correct uio_resid after the clamp done in vn_rlimit_fsizex() on VOP_WRITE() return. PR: 164793 Reviewed by: asomers, jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36625	2022-09-24 19:41:33 +03:00
Konstantin Belousov	2ac083f60f	Add vn_rlimit_trunc() Reviewed by: asomers, jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36625	2022-09-24 19:41:18 +03:00
Konstantin Belousov	cc65a412ae	filesystems: return error from vn_rlimit_fsize() instead of EFBIG Reviewed by: asomers, jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36625	2022-09-24 19:41:14 +03:00
Mark Johnston	c2d27b0ec7	sched_4bsd: Fix a racy thread state modification When a thread switching off-CPU is migrating to a remote CPU, sched_switch() may trigger a rescheduling of the thread currently running on that CPU. When doing so, it must ensure that that thread is locked before modifying thread state. If the thread's lock is not the scheduler lock, then the thread is in the process of switching off-CPU and no extra effort is needed, and the initiator does not hold the thread's lock and thus should not modify any thread state. Reported and tested by: Steve Kargl MFC after: 1 week	2022-09-23 20:09:06 -04:00
John Baldwin	f49fd63a6a	kmem_malloc/free: Use void * instead of vm_offset_t for kernel pointers. Reviewed by: kib, markj Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D36549	2022-09-22 15:09:19 -07:00
John Baldwin	7ae99f80b6	pmap_unmapdev/bios: Accept a pointer instead of a vm_offset_t. This matches the return type of pmap_mapdev/bios. Reviewed by: kib, markj Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D36548	2022-09-22 15:08:52 -07:00
Zhenlei Huang	440217b0af	debugnet: Fix parameter order in the calls to m_get() Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D36643	2022-09-21 06:55:20 -04:00
Mateusz Guzik	a3ab1102e3	vfs: silence a bogus LOR in freevnode Reported by: imp	2022-09-19 02:14:50 +00:00
Mateusz Guzik	a75d1ddd74	vfs: introduce V_PCATCH to stop abusing PCATCH	2022-09-17 15:41:37 +00:00
Mateusz Guzik	9e4f35ac25	vfs: deperl msleep flag calculation in vn_start_*write	2022-09-17 17:17:20 +02:00
Mateusz Guzik	1c7084fe56	vfs: clean up parse_mount_dev_present	2022-09-17 12:42:46 +00:00
Mateusz Guzik	b77bdfdb67	vfs: fix non-INVARIANTS build after `5b5b7e2ca2` Reported by: gj	2022-09-17 10:45:12 +00:00
Mateusz Guzik	aede6a9670	vfs: fixup parse_mount_dev_present after `5b5b7e2ca2` Reported by: kib	2022-09-17 10:35:00 +00:00
Mateusz Guzik	5b5b7e2ca2	vfs: always retain path buffer after lookup This removes some of the complexity needed to maintain HASBUF and allows for removing injecting SAVENAME by filesystems. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D36542	2022-09-17 09:10:38 +00:00
Mateusz Guzik	3df3d88cc5	vfs: move cn_nameptr assignment out of namei_getpath	2022-09-17 09:08:34 +00:00
Mateusz Guzik	41a0a99f85	vfs: slightly reorganize error handling in chroot This avoids duplicated NDFREE_NOTHING which will be of importance later.	2022-09-17 09:08:34 +00:00
Warner Losh	7cd4984e67	SPDX: Not BSD-4-Clause This is not BSD-4-Clause. It's closer to a modified BSD-2-Clause with 2 added clauses (and the first one has added clauses). Remove SPDX-License-Idnetifier since this license doesn't match anything in SPDX.	2022-09-16 21:49:16 -06:00
Konstantin Belousov	ff41239f58	Add AT_USRSTACK{BASE, LIM} AT vectors, and ELF_BSDF_VMNOOVERCOMMIT flag Reviewed by: brooks, imp (previous version) Discussed with: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36540	2022-09-16 23:23:26 +03:00
Mateusz Guzik	50176b0296	locks: whack a failed experiment in form of restrict_starvation This was never enabled and only pollutes the code. The issue will be addressed later in a different manner. Sponsored by: Rubicon Communications, LLC ("Netgate")	2022-09-16 17:29:37 +00:00
Gordon Bergling	4771011b8f	kern_jail: Fix a typo in a source code comment - s/paramter/parameter/ MFC after: 3 days	2022-09-15 10:25:19 +02:00
Warner Losh	9baa0817ec	SPDX: Not BSD-4-Clause This license has 4 clauses, and shares some text with the BSD-4-Clause license. However, it omits the standard disclaimer and has 2 clauses all its own. Remove this tag, since it was made in error and this doesn't match the SPDX copy of the BSD-4-Clause license. Sponsored by: Netflix	2022-09-14 21:29:31 -06:00
Mateusz Guzik	d04c7f10d4	vfs: make delmntque return with the interlock held saves on relocking dance -- the lock is taken immediately afterwards anyway.	2022-09-14 23:30:19 +00:00
Mateusz Guzik	43fbd0e7a7	lockf: elide vnode interlock in the common case in lf_purgelocks The interlock was already taken and released when dooming, thus by API contract locking state cannot be legally installed. At the same time the state is almost never there to begin with.	2022-09-14 23:04:22 +00:00
Mateusz Guzik	a755fb921e	vfs: retire the V_MNTREF flag Reviewed by: kib, mckusick Differential Revision: https://reviews.freebsd.org/D36521	2022-09-14 18:16:36 +00:00
Mateusz Guzik	61a1d5dde2	vfs: stop using the V_MNTREF flag Reviewed by: kib, mckusick Differential Revision: https://reviews.freebsd.org/D36521	2022-09-14 18:16:23 +00:00
Mateusz Guzik	f7dc4a71da	vfs: plug spurious error checks in namei error is guaranteed 0 at that point	2022-09-13 23:18:30 +00:00
Mateusz Guzik	b4137c9ed1	vfs: make NDVALIDATE private to vfs_lookup.c it is not used elsewhere.	2022-09-12 22:50:48 +00:00
Allan Jude	b20ec58669	vfs.typenumhash: fix sysctl description a string continuation was missing a space, resulting in two works being smushed together. Sponsored by: Klara, Inc.	2022-09-10 22:47:51 +00:00
Mateusz Guzik	1760a6950a	Fixup build after recent getsock changes	2022-09-10 20:40:43 +00:00
Mateusz Guzik	3be2225fc8	Remove fflag argument from getsock_cap Interested callers can obtain in other own easily enough and there is no reason to branch on it.	2022-09-10 19:47:47 +00:00
Mateusz Guzik	3212ad15ab	Add getsock All but one consumers of getsock_cap only pass 4 arguments. Take advantage of it.	2022-09-10 19:47:47 +00:00
Mateusz Guzik	a2ad70923f	Add branch prediction hints to getsock_cap	2022-09-10 19:41:52 +00:00
Gleb Smirnoff	e80062a2d4	tcp: avoid call to soisconnected() on transition to ESTABLISHED This call existed since pre-FreeBSD times, and it is hard to understand why it was there in the first place. After `6f3caa6d81` it definitely became necessary always and commit message from `f1ee30ccd6` confirms that. Now that `6f3caa6d81` is effectively backed out by `07285bb4c2`, the call appears to be useful only for sockets that landed on the incomplete queue, e.g. sockets that have accept_filter(9) enabled on them. Provide a new TCP flag to mark connections that are known to be on the incomplete queue, and call soisconnected() only for those connections. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36488	2022-09-08 09:16:04 -07:00
Doug Moore	d0354fa7b6	rb_tree: reduce duplication in balancing code Change RB_INSERT_COLOR and RB_REMOVE_COLOR so that the blocks of code that are identical except for left and right being exchanged are made only one block with a variable to indicate left- or right-handedness. Rename RB macros so that those not intended for external use begin with an underscore. Add comments to the balancing code so that another might understand it. Reviewed by: alc, kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D36393	2022-09-07 23:46:19 -05:00
Mateusz Guzik	3e0b486886	vfs: flip a condition around in kern_statat error tends to be 0.	2022-09-07 20:06:24 +00:00
Hans Petter Selasky	0e391a3197	ktls: Add missing NULL pointer check for TLS RX hardware offload. The send tag pointer may be NULL when the ktls_reset_receive_tag() function is invoked. Add check for this. Reviewed by: gallatin @ Sponsored by: NVIDIA Networking	2022-09-06 13:49:23 +02:00
Mateusz Guzik	69413598d2	signal: use proc_iterate to save on work Most notably poudriere performs kill -9 -1 in jails for each port being built. This reduces the scan from hundrends of processes to literally 1. Reviewed by: jamie, markj Differential Revision: https://reviews.freebsd.org/D34522	2022-09-05 11:54:47 +00:00
Mateusz Guzik	5ecb5444aa	jail: add process linkage It allows iteration over processes belonging to given jail instead of having to walk the entire allproc list. Note the iteration can miss processes which remains bug-compatible with previous code. Reviewed by: jamie (previous version), markj (previous version) Differential Revision: https://reviews.freebsd.org/D34522	2022-09-05 11:54:47 +00:00
Gordon Bergling	d744e271eb	kern: Remove a double word in a source code comment - s/that that/that/ MFC after: 3 days	2022-09-04 17:32:10 +02:00
Gordon Bergling	49a033d8cf	kern: Correct some typos in source code comments - s/occured/occurred/ - s/the the/the/ MFC after: 3 days	2022-09-04 13:00:01 +02:00
Gordon Bergling	2b7d656f17	kern: Fix a typo in asource code comment - s/overriden/overridden/ MFC after: 3 days	2022-09-03 15:26:55 +02:00
Gleb Smirnoff	24af7808fa	protosw: repair protocol selection logic in socket(2) Pointy hat to: glebius Fixes: `61f7427f02`	2022-08-30 21:19:46 -07:00
Gleb Smirnoff	61f7427f02	protosw: cleanup protocols that existed merely to provide pr_input Since 4.4BSD the protosw was used to implement socket types created by socket(2) syscall and at the same to demultiplex incoming IPv4 datagrams (later copied to IPv6). This story ended with `78b1fc05b2`. These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch packets in ip_input(), they would also be returned by pffindproto() if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP). Thus, for raw sockets to work correctly, all the entries were pointing at raw_usrreq differentiating only in the value of pr_protocol. With `78b1fc05b2` all these entries are no longer needed, as ip_protox is independent of protosw. Any socket syscall requesting SOCK_RAW type would end up with rip_protosw. And this protosw has its pr_protocol set to 0, allowing to mark socket with any protocol. For IPv6 raw socket the change required two small fixes: o Validate user provided protocol value o Always use protocol number stored in inp in rip6_attach, instead of protosw value, which is now always 0. Differential revision: https://reviews.freebsd.org/D36380	2022-08-30 15:09:21 -07:00
Gleb Smirnoff	8624f4347e	divert: declare PF_DIVERT domain and stop abusing PF_INET The divert(4) is not a protocol of IPv4. It is a socket to intercept packets from ipfw(4) to userland and re-inject them back. It can divert and re-inject IPv4 and IPv6 packets today, but potentially it is not limited to these two protocols. The IPPROTO_DIVERT does not belong to known IP protocols, it doesn't even fit into u_char. I guess, the implementation of divert(4) was done the way it is done basically because it was easier to do it this way, back when protocols for sockets were intertwined with IP protocols and domains were statically compiled in. Moving divert(4) out of inetsw accomplished two important things: 1) IPDIVERT is getting much closer to be not dependent on INET. This will be finalized in following changes. 2) Now divert socket no longer aliases with raw IPv4 socket. Domain/proto selection code won't need a hack for SOCK_RAW and multiple entries in inetsw implementing different flavors of raw socket can merge into one without requirement of raw IPv4 being the last member of dom_protosw. Differential revision: https://reviews.freebsd.org/D36379	2022-08-30 15:09:21 -07:00
Gleb Smirnoff	244e1aeaec	domains: merge domain_init() into domain_add() domain_init() called at SI_SUB_PROTO_DOMAIN/SI_ORDER_SECOND is always called right after domain_add(), that had been called at SI_ORDER_FIRST. Note that protocols aren't initialized yet at this point, since they are usually scheduled to initialize at SI_ORDER_THIRD. After this merge it becomes clear that DOMF_SUPPORTED / DOMF_INITED can be garbage collected as they are set & checked in the same function. For initialization of the domain system itself it is now clear that domaininit() can be garbage collected and static initializer is enough.	2022-08-29 19:15:01 -07:00
Gleb Smirnoff	e18c5816ea	domains: use queue(9) SLIST for linked list of domains	2022-08-29 19:15:01 -07:00
Gleb Smirnoff	d7574c7432	domains: init pr_domain in pr_init()	2022-08-29 19:15:01 -07:00
Gleb Smirnoff	c414347bc5	mbufs: isolate max_linkhdr and max_protohdr handling in the mbuf code o Statically initialize max_linkhdr to default value without relying on domain(9) code doing that. o Statically initialize max_protohdr to a sane value, without relying on TCP being always compiled in. o Retire max_datalen. Set, but not used. o Don't make the domain(9) system responsible in validating these values and updating max_hdr. Instead provide KPI max_linkhdr_grow() and max_protohdr_grow(). o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from TCP. Those are the only protocols today that may want to grow. Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D36376	2022-08-29 19:14:25 -07:00
Mark Johnston	32faf071bd	devstat: Remove DTrace io probes lacking a BIO reference The io:::start and end probes trace individual I/O requests. Also remove the unimplemented wait-start and wait-done probes. PR: 266098 MFC after: 1 week	2022-08-29 13:22:36 -04:00
Doug Moore	5d91386826	rb_tree: avoid extra reads in rebalancing In RB_INSERT_COLOR and RB_REMOVE_COLOR, avoid reading a parent pointer from memory, and then reading the left-color bit from memory, and then reading the right-color bit from memory, since they're all in the same field. The compiler can't infer that only the first read is really necessary, so write the code in a way so that it doesn't have to. Drop RB_RED_LEFT and RB_RED_RIGHT macros that reach into memory to get those bits. Drop RB_COLOR, the only thing left using RB_RED_LEFT and RB_RED_RIGHT after the other changes, and go straight to DIAGNOSTIC code in subr_stats to implement RB_COLOR for its single, dubious use there. Reviewed by: alc MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D36353	2022-08-29 11:11:31 -05:00
John Baldwin	e3885a7893	soo_stat: Ensure error is always initialized. In kernels without MAC, error is not set for sockets whose protocol layer does not implement the pr_sense hook. Reported by: Jenkins (powerpc kernel builds) Fixes: `7c04ca1fad` sockets: for stat(2) on a socket don't report hiwat as block size	2022-08-26 11:17:55 -07:00
Gleb Smirnoff	837b7203f0	domains: use struct domain as argument	2022-08-26 10:35:35 -07:00
firk	768f6373eb	Fix compat10 semaphore interface race Wrong has-waiters and missing unconditional _count==0 check may cause infinite waiting with already non-zero count. 1) properly clear _has_waiters flag when waiting failed to start 2) always check _count before start waiting PR: 265997 Reviewed by: kib MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36272	2022-08-26 20:34:29 +03:00
Gleb Smirnoff	7c04ca1fad	sockets: for stat(2) on a socket don't report hiwat as block size The code appeared in `d8392c6c39` with not good explanation. It is very unlikely any software in the world needs that. Differential revision: https://reviews.freebsd.org/D36283	2022-08-26 08:16:15 -07:00
Mateusz Guzik	49afea1059	proc: read the pid prior to unlocking in report_alive_proc1 In principle another thread could have reaped the process by that time.	2022-08-25 17:26:49 +00:00
Konstantin Belousov	fce3b1c327	fork_exit(): style comment Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36302	2022-08-24 22:12:53 +03:00
Brooks Davis	840327e5dd	mbuf: Don't support PAGE_SIZE < 4K The Vax supported such things, but FreeBSD does not. This further implies that MJUMPAGESIZE > MCLBYTES so assert this and remove code handling them being equal. Reviewed by: kp, imp, jhb Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D36320	2022-08-24 18:34:07 +01:00
Mateusz Guzik	9488262679	rms: add rms_assert_rlock_ok So that callers which opportunistically elide the lock can still assert that they can take it. Reviewed by: Differential Revision:	2022-08-23 19:15:48 +00:00
Robert Wing	3454a7caa0	kqueue: retire knlist_init_rw_reader() Last usage was removed in `afa85850e7`. Reviewed by: pauamma, melifaro, kib Differential Revision: https://reviews.freebsd.org/D36205	2022-08-20 21:17:39 -08:00
Konstantin Belousov	f829268bcc	Remove TDF_DOING_SA We cannot see a thread with the flag set in unsuspend, after we stopped doing SINGLE_ALLPROC from user processes. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:34:30 +03:00
Konstantin Belousov	5e5675cb4b	Remove struct proc p_singlethr member It does not serve any purpose after we stopped doing thread_single(SINGLE_ALLPROC) from stoppable user processes. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:34:30 +03:00
Konstantin Belousov	2842ec6d99	REAP_KILL_PROC: kill processes in the threaded taskqueue context There is a problem still left after the fixes to REAP_KILL_PROC. The handling of the stopping signals by sig_suspend_threads() can occur outside the stopping process context by tdsendsignal(), and it uses mostly the same mechanism of aborting sleeps as suspension. In other words, it badly interacts with thread_single(SINGLE_ALLPROC). But unlike single threading from the process context, we cannot wait by sleep for other single threading requests to pass, because we own spinlock(s). Fix this by moving both the thread_single(p2, SINGLE_ALLPROC), and the signalling, to the threaded taskqueue which cannot be single-threaded itself. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:34:11 +03:00
Konstantin Belousov	5e9bba94bd	fork_norfproc(): unlock p1 before retrying Reported and reviewed by: markj Tested by: pho Syzkaller: 647212368c3f32c6f13f Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:18 +03:00
Konstantin Belousov	0a4f2ac3b7	kern_sig.c: style Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:18 +03:00
Konstantin Belousov	cdb58f9d04	ksiginfo_tryfree(): change return type to bool The function result is already used as bool. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:18 +03:00
Konstantin Belousov	cc29f221aa	ksiginfo_alloc(): change to directly take M_WAITOK/NOWAIT flags Also style, and remove unneeded cast. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:17 +03:00
Konstantin Belousov	5c78797e42	reap_kill_proc_locked(): remove outdated part of the comment Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:17 +03:00
Konstantin Belousov	30b16a6bcf	exit1(): update comment about thread_single() We do not check single-threading conditions in trap, or when sleeping uninterruptible. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:17 +03:00
Konstantin Belousov	f835be5822	sleepq_set_timeout_sbt(): correct comment to not talk about ticks It is sbt now. Also, explain what flags are. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:17 +03:00
Konstantin Belousov	da39a100db	sleepq_check_ast_sc_locked(): update comment The relock order is important not only for a signal delivery, but also for the suspension requests. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:17 +03:00
Konstantin Belousov	bd76586bb7	fork_norfproc(): style Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D36207	2022-08-20 20:33:17 +03:00
Mateusz Guzik	497240def8	Retire clone_drain_lock It is only ever xlocked in drain_dev_clone_events and the only consumer of that routine does not need it -- eventhandler code already makes sure the relevant callback is no longer running. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D36268	2022-08-20 09:44:05 +00:00
Gleb Smirnoff	820bafd0bc	unix/dgram: don't panic if socket buffer has negative space That's a legitimate scenario, although unlikely. Reported by: https://syzkaller.appspot.com/bug?extid=6e8be1ec8d77578a3df4	2022-08-19 12:15:38 -07:00
Mateusz Guzik	545db925c3	pipe: fix EOF case for non-blocking fds In particular unbreaks 'go build'.	2022-08-18 21:23:53 +00:00
Gleb Smirnoff	e7d02be19d	protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see `7b187005d1`), and later reiterated in 2006 by rwatson@ (see `6fbb9cf860`). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ `dff3237ee5`). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232	2022-08-17 11:50:32 -07:00
Gleb Smirnoff	f6dc5aa342	unix: use private enum as argument for unp_connect2() instead of using historic PRU_ flags that are now not used by anything rather than TCP debugging.	2022-08-17 11:50:31 -07:00
Gleb Smirnoff	81a34d374e	protosw: retire pr_drain and use EVENTHANDLER(9) directly The method was called for two different conditions: 1) the VM layer is low on pages or 2) one of UMA zones of mbuf allocator exhausted. This change 2) into a new event handler, but all affected network subsystems modified to subscribe to both, so this change shall not bring functional changes under different low memory situations. There were three subsystems still using pr_drain: TCP, SCTP and frag6. The latter had its protosw entry for the only reason to register its pr_drain method. Reviewed by: tuexen, melifaro Differential revision: https://reviews.freebsd.org/D36164	2022-08-17 11:50:31 -07:00
Gleb Smirnoff	1922eb3e9c	protosw: retire pr_slowtimo and pr_fasttimo They were useful many years ago, when the callwheel was not efficient, and the kernel tried to have as little callout entries scheduled as possible. Reviewed by: tuexen, melifaro Differential revision: https://reviews.freebsd.org/D36163	2022-08-17 11:50:31 -07:00
Gleb Smirnoff	78b1fc05b2	protosw: separate pr_input and pr_ctlinput out of protosw The protosw KPI historically has implemented two quite orthogonal things: protocols that implement a certain kind of socket, and protocols that are IPv4/IPv6 protocol. These two things do not make one-to-one correspondence. The pr_input and pr_ctlinput methods were utilized only in IP protocols. This strange duality required IP protocols that doesn't have a socket to declare protosw, e.g. carp(4). On the other hand developers of socket protocols thought that they need to define pr_input/pr_ctlinput always, which lead to strange dead code, e.g. div_input() or sdp_ctlinput(). With this change pr_input and pr_ctlinput as part of protosw disappear and IPv4/IPv6 get their private single level protocol switch table ip_protox[] and ip6_protox[] respectively, pointing at array of ipproto_input_t functions. The pr_ctlinput that was used for control input coming from the network (ICMP, ICMPv6) is now represented by ip_ctlprotox[] and ip6_ctlprotox[]. ipproto_register() becomes the only official way to register in the table. Those protocols that were always static and unlikely anybody is interested in making them loadable, are now registered by ip_init(), ip6_init(). An IP protocol that considers itself unloadable shall register itself within its own private SYSINIT(). Reviewed by: tuexen, melifaro Differential revision: https://reviews.freebsd.org/D36157	2022-08-17 11:50:31 -07:00
Gleb Smirnoff	489482e276	ipsec: isolate knowledge about protocols that are last header Retire PR_LASTHDR protosw flag. Reviewed by: ae Differential revision: https://reviews.freebsd.org/D36155	2022-08-17 08:24:28 -07:00
Mateusz Guzik	9ac6eda6c6	pipe: try to skip locking the pipe if a non-blocking fd is used Reviewed by: markj (previous version) Differential Revision: https://reviews.freebsd.org/D36082	2022-08-17 14:23:34 +00:00
Ed Maste	fbafa98a94	Disallow invalid PT_GNU_STACK Stack must be at least readable and writable. PR: 242570 Reviewed by: kib, markj MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35867	2022-08-16 15:52:21 -04:00
Mateusz Guzik	84a0be4a23	vfs: plug a dead store in kern_linkat_vp Reported by: clang --analyze	2022-08-16 10:46:29 +00:00
Mateusz Guzik	eb9a1f9c68	vfs: plug dead store in vn_io_fault1 Reported by: clang --analyze	2022-08-16 10:46:20 +00:00
Dimitry Andric	9762d48b7f	Adjust function definition in kern_poll.c to avoid clang 15 warning With clang 15, the following -Werror warning is produced: sys/kern/kern_poll.c:374:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] netisr_pollmore() ^ void This is because netisr_pollmore() is declared with a (void) argument list, but defined with an empty argument list. Make the definition match the declaration. MFC after: 3 days	2022-08-14 21:27:34 +02:00
Alexander V. Chernikov	9b967bd65d	domains: allow domains to be unloaded Add domain_remove() SYSUNINT callback that removes the domain from the domain list if it has DOMF_UNLOADABLE flag set. This change is required to support netlink ( D36002 ). Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D36173	2022-08-14 09:22:33 +00:00
Gleb Smirnoff	f277746e13	protosw: change prototype for pr_control For some reason protosw.h is used during world complation and userland is not aware of caddr_t, a relic from the first version of C. Broken buildworld is good reason to get rid of yet another caddr_t in kernel. Fixes: `886fc1e804`	2022-08-12 12:08:18 -07:00
Gleb Smirnoff	948f31d7b0	netinet: do not broadcast PRC_REDIRECT_HOST on ICMP redirect This is expensive and useless call. It has been useless since Alexander melifaro@ moved the forwarding table to nexthops with passive invalidation. What happens now is that cached route in a inpcb would get invalidated on next ip_output(). These were the last users of pfctlinput(), so garbage collect it. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36156	2022-08-12 08:31:29 -07:00
Gleb Smirnoff	8c77967ecc	protosw: retire pr_output method The only place to execute this method was raw_usend(). Only those protocols that used raw socket were able to actually enter that method. All pr_output assignments being deleted by this commit were a dead code for many years. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36126	2022-08-11 09:19:37 -07:00
Andrew Turner	8e9ca1379e	Adjust function definition in subr_devmap.c to avoid clang 15 warning With clang 15, the following -Werror warning is produced: sys/kern/subr_devmap.c:87:19: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] devmap_print_table() ^ void This is because devmap_print_table() and devmap_lastaddr() are declared with a (void) argument list, but defined with an empty argument list. Make the definition match the declaration. Sponsored by: The FreeBSD Foundation	2022-08-11 14:30:32 +01:00
Alexander V. Chernikov	102e6817f0	devd: move all devd notification logic to a separate file. Currently, subr_bus.c shares logic for (a) maintaining all HW devices (e.g. discovery/attach/detach logic) and (b) generic devctl notification layer for devices/PMU/GEOM/interfaces/etc). These two subsystems share really tiny interaction interface, composed of 3 notification functions. With that in mind, move devctl layer to a separate file, establishing a clear notification interface between the sub.c bus layer and the provider (devctl). The primary driver of this change is netlink implementation (D36002). The idea is to propagate device-level events to netlink as well, so all netlink customers can subscribe to these changes. The long-term goal is to deprecate devctl and to use netlink as the kernel<> userland transport provided netlink gets enough traction. Reviewed by: imp, markj Differential Revision: https://reviews.freebsd.org/D36091 MFC after: 1 month	2022-08-10 18:56:01 +00:00
Gleb Smirnoff	07285bb4c2	tcp: utilize new solisten_clone() and solisten_enqueue() This streamlines cloning of a socket from a listener. Now we do not drop the inpcb lock during creation of a new socket, do not do useless state transitions, and put a fully initialized socket+inpcb+tcpcb into the listen queue. Before this change, first we would allocate the socket and inpcb+tcpcb via tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock pcb and put this onto incomplete queue (see `6f3caa6d81`). Then, after sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED, insert into inpcb hash, finalize initialization of tcpcb. And then, in call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call soisconnected(). This call would lock the listening socket once again with a LOR protection sequence and then we would relocate the socket onto the complete queue and only now it is ready for accept(2). Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36064	2022-08-10 11:09:34 -07:00
Gleb Smirnoff	8f5a0a2e4f	sockets: provide solisten_clone(), solisten_enqueue() as alternative KPI to sonewconn(). The latter has three stages: - check the listening socket queue limits - allocate a new socket - call into protocol attach method - link the new socket into the listen queue of the listening socket The attach method, originally designed for a creation of socket by the socket(2) syscall has slightly different semantics than attach of a socket cloned by listener. Make it possible for protocols to call into the first stage, then perform a different attach, and then call into the final stage. The first stage, that checks limits and clones a socket is called solisten_clone(), and the function that enqueues the socket is solisten_enqueue(). Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D36063	2022-08-10 11:09:34 -07:00
Konstantin Belousov	00d17cf342	elf_note_prpsinfo: handle more failures from proc_getargv() Resulting sbuf_len() from proc_getargv() might return 0 if user mangled ps_strings enough. Also, sbuf_len() API contract is to return -1 if the buffer overflowed. The later should not occur because get_ps_strings() checks for catenated length, but check for this subtle detail explicitly as well to be more resilent. The end result is that p_comm is used in this situations. Approved by: so Security: FreeBSD-SA-22:09.elf Reported by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: delphij, markj admbugs: 988 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35391	2022-08-09 15:44:45 -04:00
Konstantin Belousov	1b0a4974c5	thread_create(): call cpu_copy_thread() after td_pflags is zeroed By calling the function too early we might still have the td_pflags value cached from the previous struct thread use. cpu_copy_thread() depends on correct value for TDP_KTHREAD at least on x86. Reported, bisected, and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36069	2022-08-08 19:44:17 +03:00
Gordon Bergling	fa1ac9693a	vnode(9): Fix a typo in a source code comment - s/paramater/parameter/ MFC after: 3 days	2022-08-07 16:08:43 +02:00
Ed Maste	f0687f3e0e	Clarify code comments on ASLR default settings Sponsored by: The FreeBSD Foundation	2022-08-05 10:01:16 -04:00
Mark Johnston	d07675a935	file: Move code to share fdtol structs into kern_descrip.c This ensures the filedesc-to-leader code is consistently encapsulated in kern_descrip.c. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35988	2022-08-04 09:39:25 -04:00
Konstantin Belousov	c53fec7603	sig_suspend_threads(): remove 'sending' arg The TDA_AST flag is set on td2 unconditionally (as it was TDF_ASTPENDING before AST rework), so it is not used practically for some time. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36033	2022-08-03 16:56:23 +03:00
Konstantin Belousov	f2fd7d8bfc	ast_sig(): add missed TDAI() Mask checked was completely wrong Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36033	2022-08-03 16:56:23 +03:00
Mark Johnston	852695416c	domain: Use designated constants for timeout periods No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-08-02 20:31:29 -04:00
Konstantin Belousov	4a662c9064	ktrace: change AST handler to require AST flag set When it was inline it made sense to depend on the existing nested check in KTRUSERRET() rather than adding a new td_flags flag. However, since we now have a TDA_KTRACE flag anyway, we might as well check it and avoid the call. Suggested by: jhb Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888	2022-08-02 21:11:10 +03:00
Konstantin Belousov	c46771a7b7	kern/subr_trap.c: cleanup no longer needed headers Also bump Foundation' copyright year Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888	2022-08-02 21:11:10 +03:00
Konstantin Belousov	cc1ec77231	Adjust g_waitidle() visibility and definition Explicitly pass the struct thread argument. Move the function prototype from sys/systm.h to geom/geom.h, we do not need almost each kernel source to see the prototype, it is now used only by kern/vfs_mountroot.c outside geom/geom_event.c, where the function is defined. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888	2022-08-02 21:11:10 +03:00
Konstantin Belousov	4fced8642f	sigfastblock_setpend() and fastblock_mask can be static now Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888	2022-08-02 21:11:10 +03:00
Konstantin Belousov	c6d31b8306	AST: rework Make most AST handlers dynamically registered. This allows to have subsystem-specific handler source located in the subsystem files, instead of making subr_trap.c aware of it. For instance, signal delivery code on return to userspace is now moved to kern_sig.c. Also, it allows to have some handlers designated as the cleanup (kclear) type, which are called both at AST and on thread/process exit. For instance, ast(), exit1(), and NFS server no longer need to be aware about UFS softdep processing. The dynamic registration also allows third-party modules to register AST handlers if needed. There is one caveat with loadable modules: the code does not make any effort to ensure that the module is not unloaded before all threads processed through AST handler in it. In fact, this is already present behavior for hwpmc.ko and ufs.ko. I do not think it is worth the efforts and the runtime overhead to try to fix it. Reviewed by: markj Tested by: emaste (arm64), pho Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888	2022-08-02 21:11:09 +03:00
Alexander V. Chernikov	be1f485d7d	sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg(). Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg(). Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation. Linux extended it a while (~15+ years) ago to act as input flag, resulting in returning the full packet size regarless of the input buffer size. It's a (relatively) popular pattern to do recvmsg( MSG_PEEK \| MSG_TRUNC) to get the packet size, allocate the buffer and issue another call to fetch the packet. In particular, it's popular in userland netlink code, which is the primary driving factor of this change. This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users). PR: kern/176322 Reviewed by: pauamma(doc) Differential Revision: https://reviews.freebsd.org/D35909 MFC after: 1 month	2022-07-30 18:21:51 +00:00
John Baldwin	ea8f128c7c	pmap_mapdev: Consistently use vm_paddr_t for the first argument. The devmap variants used vm_offset_t for some reason, and a few places explicitly cast bus addresses to vm_offset_t. (Probably those casts along with similar casts for vm_size_t should just be removed and instead permit the compiler to DTRT.) Reviewed by: markj Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D35961	2022-07-28 15:55:10 -07:00
Dimitry Andric	a387bd1b6a	Adjust function definition in vfs_bio.c to avoid clang 15 warnings With clang 15, the following -Werror warning is produced: sys/kern/vfs_bio.c:3430:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] buf_daemon() ^ void This is because buf_daemon() is declared with a (void) argument list, but defined with an empty argument list. Make the definition match the declaration. MFC after: 3 days	2022-07-26 19:59:57 +02:00
Dimitry Andric	78cfed2de7	Adjust function definitions in sysv_msg.c to avoid clang 15 warnings With clang 15, the following -Werror warnings are produced: sys/kern/sysv_msg.c:213:8: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] msginit() ^ void sys/kern/sysv_msg.c:316:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] msgunload() ^ void This is because msginit() and msgunload() are declared with (void) argument lists, but defined with empty argument lists. Make the definitions match the declarations. MFC after: 3 days	2022-07-26 19:59:57 +02:00
Dimitry Andric	b54e962aca	Adjust function definition in subr_bus.c to avoid clang 15 warnings With clang 15, the following -Werror warning is produced: sys/kern/subr_bus.c:871:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] bus_topo_assert() ^ void This is because bus_topo_assert() is declared with a (void) argument list, but defined with an empty argument list. Make the definition match the declaration. MFC after: 3 days	2022-07-26 19:59:57 +02:00
Dimitry Andric	3c8f0790dd	Adjust function definition in subr_autoconf.c to avoid clang 15 warnings With clang 15, the following -Werror warning is produced: sys/kern/subr_autoconf.c:119:34: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] run_interrupt_driven_config_hooks() ^ void This is because run_interrupt_driven_config_hooks() is declared with a (void) argument list, but defined with an empty argument list. Make the definition match the declaration. MFC after: 3 days	2022-07-26 19:59:57 +02:00
Dimitry Andric	f2eb09b089	Adjust function definitions in kern_resource.c to avoid clang 15 warnings With clang 15, the following -Werror warnings are produced: sys/kern/kern_resource.c:1212:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] lim_alloc() ^ void sys/kern/kern_resource.c:1365:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] uihashinit() ^ void This is because lim_alloc() and uihashinit() are declared with (void) argument lists, but defined with empty argument lists. Make the definitions match the declarations. MFC after: 3 days	2022-07-26 19:59:57 +02:00
Dimitry Andric	db8ea61ae2	Adjust function definitions in kern_dtrace.c to avoid clang 15 warnings With clang 15, the following -Werror warnings are produced: sys/kern/kern_dtrace.c:64:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] kdtrace_proc_size() ^ void sys/kern/kern_dtrace.c:87:20: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] kdtrace_thread_size() ^ void This is because kdtrace_proc_size() and kdtrace_thread_size() are declared with (void) argument lists, but defined with empty argument lists. Make the definitions match the declarations. MFC after: 3 days	2022-07-26 19:59:57 +02:00
Dimitry Andric	9806e82a23	Adjust function definitions in kern_cons.c to avoid clang 15 warnings With clang 15, the following -Werror warnings are produced: sys/kern/kern_cons.c:201:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] cninit_finish() ^ void sys/kern/kern_cons.c:376:7: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] cngrab() ^ void sys/kern/kern_cons.c:389:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] cnungrab() ^ void sys/kern/kern_cons.c:402:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] cnresume() ^ void This is because cninit_finish(), cngrab(), cnungrab(), and cnresume() are declared with (void) argument lists, but defined with empty argument lists. Make the definitions match the declarations. MFC after: 3 days	2022-07-26 19:59:56 +02:00
Ka Ho Ng	8c9aa94b42	Convert runtime param checks to KASSERTs for fo_fspacectl Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D35880	2022-07-23 15:16:23 -04:00
Colin Percival	84ec7df0d7	Add kern.reboot_wait_time sysctl Historic FreeBSD behaviour (dating back to 1994-04-02) when rebooting is to print "Rebooting..." and then /* wait 1 sec for printf's to complete and be read */ Prior to April 1994, there was a 100 ms delay (added 1993-11-12). Since (a) most users will already be aware that the system is rebooting and do not need to take time to read an additional message to that effect, and (b) most FreeBSD systems don't have anyone actively looking at the console anyway, this delay no longer serves much purpose. This commit adds a kern.reboot_wait_time sysctl which defaults to 0; historic behaviour can be regained by setting it to 1. Reviewed by: imp Relnotes: FreeBSD now reboots faster; to restore the traditional wait after printing "Rebooting..." to the console, set kern.reboot_wait_time=1 (or more). Sponsored by: https://www.patreon.com/cperciva Differential Revision: https://reviews.freebsd.org/D35796	2022-07-18 17:23:25 -07:00
Mitchell Horne	2449b9e5fe	mac: kdb/ddb framework hooks Add three simple hooks to the debugger allowing for a loaded MAC policy to intervene if desired: 1. Before invoking the kdb backend 2. Before ddb command registration 3. Before ddb command execution We extend struct db_command with a private pointer and two flag bits reserved for policy use. Reviewed by: markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D35370	2022-07-18 22:06:13 +00:00
Mitchell Horne	c84c5e00ac	ddb: annotate some commands with DB_CMD_MEMSAFE This is not completely exhaustive, but covers a large majority of commands in the tree. Reviewed by: markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D35583	2022-07-18 22:06:09 +00:00
Mark Johnston	bd980ca847	sched_ule: Ensure we hold the thread lock when modifying td_flags The load balancer may force a running thread to reschedule and pick a new CPU. To do this it sets some flags in the thread running on a loaded CPU. But the code assumed that a running thread's lock is the same as that of the corresponding runqueue, and there are small windows where this is not true. In this case, we can end up with non-atomic modifications to td_flags. Since this load balancing is best-effort, simply give up if the thread's lock doesn't match; in this case the thread is about to enter the scheduler anyway. Reviewed by: kib Reported by: glebius Fixes: `e745d729be` ("sched_ule(4): Improve long-term load balancer.") MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35821	2022-07-18 15:52:27 -04:00
Kornel Dulęba	939f0b6323	Implement shared page address randomization It used to be mapped at the top of the UVA. If the randomization is enabled any address above .data section will be randomly chosen and a guard page will be inserted in the shared page default location. The shared page is now mapped in exec_map_stack, instead of exec_new_vmspace. The latter function is called before image activator has a chance to parse ASLR related flags. The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page address. The feature is enabled by default for 64 bit applications on all architectures. It can be toggled kern.elf64.aslr.shared_page sysctl. Approved by: mw(mentor) Sponsored by: Stormshield Obtained from: Semihalf Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35349	2022-07-18 16:27:37 +02:00
Kornel Dulęba	361971fbca	Rework how shared page related data is stored Store the shared page address in struct vmspace. Also instead of storing absolute addresses of various shared page segments save their offsets with respect to the shared page address. This will be more useful when the shared page address is randomized. Approved by: mw(mentor) Sponsored by: Stormshield Obtained from: Semihalf Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35393	2022-07-18 16:27:32 +02:00
Kornel Dulęba	f6ac79fb12	Introduce the PROC_SIGCODE() macro Use a getter macro instead of fetching the sigcode address directly from a sysent of a given process. It assumes that the sigcode is stored in the shared page, which is true in all cases, except for a.out binaries. This will be later useful when the shared page address randomization is introduced. No functional change intended. Approved by: mw(mentor) Sponsored by: Stormshield Obtained from: Semihalf Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35392	2022-07-18 16:27:26 +02:00
Mark Johnston	46eab86035	callout: Simplify the inner loop in callout_process() a bit - Use LIST_FOREACH_SAFE. - Simplify control flow. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-17 13:58:19 -04:00
Mark Johnston	aac7c7ac54	callout: Remove a redundant parameter to callout_cc_add() The passed cpuid is always equal to the one stored in the callout structure. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-17 13:58:19 -04:00
Mateusz Guzik	6eeba7dbd6	ule: unbreak UP builds Sponsored by: Rubicon Communications, LLC ("Netgate")	2022-07-16 12:45:09 +00:00
Dmitry Chagin	fc90f3a281	ktrace: Increase precision of timestamps. Replace struct timeval in header with struct timespec. To differentiate header formats, add a new KTR_VERSIONED flag set in the header type field similar to the existing KTRDROP flag. To make it easier to extend ktrace headers in the future, extend the existing header with a version field (version 0 is reserved for older records without KTR_VERSIONED) as well as new fields holding the thread ID and CPU ID. Reviewed by: jhb, pauamma Differential Revision: https://reviews.freebsd.org/D35774 MFC after: 2 weeks	2022-07-16 12:46:12 +03:00
John Baldwin	2cf7870864	Collapse interrupt thread priorities. Allow high priority hardware interrupts to run at PI_REALTIME via INTR_TYPE_CLK, but collapse all other hardware interrupt threads to the next priority level (PI_INTR). Collapse all SWI priorities to the same priority level (PI_SOFT) just below PI_INTR. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35646	2022-07-14 13:14:33 -07:00
John Baldwin	40efe74352	4bsd: Simplistic time-sharing for interrupt threads. If an interrupt thread runs for a full quantum without yielding the CPU, demote its priority and schedule a preemption to give other ithreads a turn. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35645	2022-07-14 13:14:17 -07:00
John Baldwin	954cffe95d	ule: Simplistic time-sharing for interrupt threads. If an interrupt thread runs for a full quantum without yielding the CPU, demote its priority and schedule a preemption to give other ithreads a turn. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35644	2022-07-14 13:13:57 -07:00
John Baldwin	ed998d1c24	ithreads: Support priority adjustment by schedulers. Use sched_wakeup instead of sched_add when marking an ithread runnable. This allows schedulers to reset their internal time slice tracking state and restore the base ithread priority when an ithread resumes from idle. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35643	2022-07-14 13:13:35 -07:00
John Baldwin	fea89a2804	Add sched_ithread_prio to set the base priority of an interrupt thread. Use it instead of sched_prio when setting the priority of an interrupt thread. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35642	2022-07-14 13:13:10 -07:00
Mark Johnston	6cbc4ceb7a	sched_ule: Use the correct atomic_load variant for tdq_lowpri Reported by: tuexen Fixes: `11484ad8a2` ("sched_ule: Use explicit atomic accesses for tdq fields")	2022-07-14 15:34:02 -04:00
Mark Johnston	11484ad8a2	sched_ule: Use explicit atomic accesses for tdq fields Different fields in the tdq have different synchronization protocols. Some are constant, some are accessed only while holding the tdq lock, some are modified with the lock held but accessed without the lock, some are accessed only on the tdq's CPU, and some are not synchronized by the lock at all. Convert ULE to stop using volatile and instead use atomic_load_* and atomic_store_* to provide the desired semantics for lockless accesses. This makes the intent of the code more explicit, gives more freedom to the compiler when accesses do not need to be qualified, and lets KCSAN intercept unlocked accesses. Thus: - Introduce macros to provide unlocked accessors for certain fields. - Use atomic_load/store for all accesses of tdq_cpu_idle, which is not synchronized by the mutex. - Use atomic_load/store for accesses of the switch count, which is updated by sched_clock(). - Add some comments to fields of struct tdq describing how accesses are synchronized. No functional change intended. Reviewed by: mav, kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35737	2022-07-14 10:45:33 -04:00
Mark Johnston	0927ff7814	sched_ule: Enable preemption of curthread in the load balancer The load balancer executes from statclock and periodically tries to move threads among CPUs in order to balance load. It may move a thread to the current CPU (the loader balancer always runs on CPU 0). When it does so, it may need to schedule preemption of the interrupted thread. Use sched_setpreempt() to do so, same as sched_add(). PR: 264867 Reviewed by: mav, kib, jhb MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35744	2022-07-14 10:27:58 -04:00
Mark Johnston	6d3f74a14a	sched_ule: Fix racy loads of pc_curthread Thread switching used to be atomic with respect to the current CPU's tdq lock. Since commit `686bcb5c14` that is no longer the case. Now sched_switch() does this: 1. lock tdq (might already be locked) 2. maybe put the current thread in the tdq, choose a new thread to run 2a. update tdq_lowpri 3. unlock tdq 4. switch CPU context, update curthread Some code paths in ULE will load pc_curthread from a remote CPU with that CPU's tdq lock held, usually to inspect its priority. But, as of the aforementioned commit this is racy. The problem I noticed is in tdq_notify(), which optionally sends an IPI to a remote CPU when a new thread is added to its runqueue. If the new thread's priority is higher (lower) than the currently running thread's priority, then we deliver an IPI. But inspecting pc_curthread->td_priority doesn't work, since pc_curthread might be between steps 3 and 4 above. If pc_curthread's priority is higher than that of the newly added thread, but pc_curthread is switching to a lower-priority thread, then tdq_notify() might fail to deliever an IPI, leaving a high priority thread stuck on the runqueue for longer than it should. This can cause multi-millisecond stalls in interactive/ithread/realtime threads. Fix this problem by modifying tdq_add() and tdq_move() to return the value of tdq_lowpri before the addition of the new thread. This ensures that tdq_notify() has the correct priority value to compare against. The other two uses of pc_curthread are susceptible to the same race. To fix the one in sched_rem()->tdq_setlowpri() we need to have an exact value for curthread. Thus, introduce a new tdq_curthread field to the tdq which gets updated any time a new thread is selected to run on the CPU. Because this field is synchronized by the thread lock, its priority reflects the correct lowpri value for the tdq. PR: 264867 Fixes: `686bcb5c14` ("schedlock 4/4") Reviewed by: mav, kib, jhb MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35736	2022-07-14 10:27:51 -04:00
Mark Johnston	ef221ff645	time: Make realitexpire() local to kern_time.c MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-13 09:57:28 -04:00
Mark Johnston	38e1d32dab	callout: Simplify cpuid validation in callout_reset_sbt_on() - Remove a flag variable. - Convert a runtime check of the passed cpuid to a KASSERT. - Remove the cc_inited flag. An attempt to schedule a callout before SI_SUB_CPU will crash anyway since the per-CPU mutexes won't have been initialized, and that flag was only checked in the case where a cpuid was explicitly specified by the caller. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2022-07-13 09:47:33 -04:00
Mark Johnston	ece453d5fa	eventtimer: Simplify KTR traces Stop including the current CPU in all event messages, since it's already saved in KTR log entries and thus is redundant. All eventtimer traces occur in a context where CPU migration is not possible. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-11 15:58:43 -04:00
Mark Johnston	a889a65ba3	eventtimer: Fix several races in the timer reload code In handleevents(), lock the timer state before fetching the time for the next event. A concurrent callout_cc_add() call might be changing the next event time, and the race can cause handleevents() to program an out-of-date time, causing the callout to run later (by an unbounded period, up to the idle hardclock period of 1s) than requested. In cpu_idleclock(), call getnextcpuevent() with the timer state mutex held, for similar reasons. In particular, cpu_idleclock() runs with interrupts enabled, so an untimely timer interrupt can result in a stale next event time being programmed. Further, an interrupt can cause cpu_idleclock() to use a stale value for "now". In cpu_activeclock(), disable interrupts before loading "now", so as to avoid going backwards in time when calling handleevents(). It's ok to leave interrupts enabled when checking "state->idle", since the race at worst will cause handleevents() to be called unnecessarily. But use an atomic load to indicate that the test is racy. PR: 264867 Reviewed by: mav, jhb, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35735	2022-07-11 15:58:43 -04:00
Mark Johnston	ebb3cb6195	eventtimer: Pass a pcpu state pointer to getnext(cpu)event() Callers have already loaded the pointer, so these functions don't need to fetch it again. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-11 15:58:43 -04:00
Mark Johnston	ba71333f60	sched_ule: Fix a typo in a comment PR: 226107 MFC after: 1 week	2022-07-11 15:58:43 -04:00
Mark Johnston	ef80894c9d	sched_ule: Purge an obsolete comment The referenced bitmask was removed in commit `62fa74d95a`. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-11 15:58:43 -04:00
Mark Johnston	35dd6d6cb5	sched_ule: Eliminate a superfluous local variable in tdq_move() No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-11 15:58:43 -04:00
Gleb Smirnoff	c261510ef5	sockets: fix setsockopt(SO_RCVTIMEO) on a listening socket MFC after: 3 weeks	2022-07-08 11:33:24 -07:00
Mitchell Horne	258958b3c7	ddb: use _FLAGS command macros where appropriate Some command definitions were forced to use DB_FUNC in order to specify their required flags, CS_OWN or CS_MORE. Use the new macros to simplify these. Reviewed by: markj, jhb MFC after: 3 days Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D35582	2022-07-05 11:56:55 -03:00
Gleb Smirnoff	d8596171c5	sockets: use only soref()/sorele() as socket reference count o Retire SS_FDREF as it is basically a debug flag on top of already existing soref()/sorele(). o Convert SS_PROTOREF into soref()/sorele(). o Change reference model for the listen queues, see below. o Make sofree() private. The correct KPI to use is only sorele(). o Make soabort() respect the model and sorele() instead of sofree(). Note on listening queues. Until now the sockets on a queue had zero reference count. And the reference were given only upon accept(2). The assumption was that there is no way to see the queued socket from anywhere except its head. This is not true, since queued sockets already have pcbs, which are linked at least into the global pcb lists. With this change we put the reference right in the sonewconn() and on accept(2) path we just hand the existing reference to the file descriptor. Differential revision: https://reviews.freebsd.org/D35679	2022-07-04 12:40:51 -07:00
Gleb Smirnoff	bc7605647c	sockets: use positive flag for file descriptor socket reference Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations. Mark sockets created by socreate() with SS_FDREF. This change is mostly illustrative. With it we see that SS_FDREF is a debugging flag, since: * socreate() takes a reference with soref(). * on accept path solisten_dequeue() takes a reference with soref() and then soaccept() sets SS_FDREF. * soclose() checks SS_FDREF, removes it and does sorele(). Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D35678	2022-07-04 12:40:51 -07:00
Warner Losh	b69996d1d5	tty: Default to printing kernel stack traceback only on INVARIANT kernels Change the default from printing a breif kernel thread stack informaton back to omitting it for non-invariant kernels in response to SIGINFO/^T. Full and brief stack support can be selected with the kern.tty_info_kstacks sysctl. MFC After: 2 weeks Sponsored by: Netflix Reviewed by: grembo, jhb Differential Revision: https://reviews.freebsd.org/D35576	2022-07-02 08:02:12 -06:00
John Baldwin	0bd73da206	busdma_bounce: Use PRI_ITHD scheduling class for worker thread. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35641	2022-06-30 10:06:04 -07:00
John Baldwin	0288d4277f	Add register sets for NT_THRMISC and NT_PTLWPINFO. For the kernel this is mostly a non-functional change. However, this will be useful for simplifying gcore(1). Reviewed by: markj MFC after: 2 weeks Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D35666	2022-06-30 10:04:56 -07:00
Gleb Smirnoff	66c8e3fccf	socket: fix listen(2) on an already listening socket Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35669 Fixes: `141fe2dcee`	2022-06-30 07:50:29 -07:00
Konstantin Belousov	ad175a107b	vfs_mount.c: convert explicit panics and KASSERTs to MPASSERT/MPPASS Reviewed by: imp, mjg Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35652	2022-06-29 21:31:47 +03:00
Konstantin Belousov	1e54362824	vfs_op_exit(): assert that mnt_vfs_ops stays non-zero for unmount or suspend Reviewed by: mjg Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35639	2022-06-29 21:31:47 +03:00
Jamie Gritton	7060da62ff	jail: Remove a prison's shared memory when it dies Add shm_remove_prison(), that removes all POSIX shared memory segments belonging to a prison. Call it from prison_cleanup() so a prison won't be stuck in a dying state due to the resources still held. PR: 257555 Reported by: grembo	2022-06-29 10:47:39 -07:00
Jamie Gritton	a9f7455c38	jail: add prison_cleanup() to release resources held by a dying jail Currently, when a jail starts dying, either by losing its last user reference or by being explicitly killed, osd_jail_call(...PR_METHOD_REMOVE...) is called. Encapsulate this into a function prison_cleanup() that can then do other cleanup.	2022-06-29 10:33:05 -07:00
Gleb Smirnoff	48a55bbfe9	unix: change error code for recvmsg() failed due to RLIMIT_NOFILE Instead of returning EMSGSIZE pass the error code from fdallocn() directly to userland. That would be EMFILE, which makes much more sense. This error code is not listed in the specification[1], but the specification doesn't cover such edge case at all. Meanwhile the specification lists EMSGSIZE as the error code for invalid value of msg_iovlen, and FreeBSD follows that, see sys_recmsg(). Differentiating these two cases will make a developer/admin life much easier when debugging. [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/recvmsg.html Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35640	2022-06-29 09:42:58 -07:00
Kristof Provost	ab91feabcc	ovpn: Introduce OpenVPN DCO support OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing (i.e. tunneling and cryptography) into the kernel, rather than using tap devices. This avoids significant copying and context switching overhead between kernel and user space and improves OpenVPN throughput. In my test setup throughput improved from around 660Mbit/s to around 2Gbit/s. Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D34340	2022-06-28 11:33:10 +02:00
Mateusz Guzik	7388fb714a	cache: drop the vfs.cache_rename_add tunable The functionality has been in use since Jan 2021 -- long enough(tm).	2022-06-27 09:56:20 +02:00
Gleb Smirnoff	458f475df8	unix/dgram: smart socket buffers for one-to-many sockets A one-to-many unix/dgram socket is a socket that has been bound with bind(2) and can get multiple connections. A typical example is /var/run/log bound by syslogd(8) and receiving multiple connections from libc syslog(3) API. Until now all of these connections shared the same receive socket buffer of the bound socket. This made the socket vulnerable to overflow attack. See `240d5a9b1c` for a historical attempt to workaround the problem. This commit creates a per-connection socket buffer for every single connected socket and eliminates the problem. The new behavior will optimize seldom writers over frequent writers. See added test case scenarios and code comments for more detailed description of the new behavior. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35303	2022-06-24 09:09:11 -07:00
Gleb Smirnoff	1093f16487	unix/dgram: reduce mbuf chain traversals in send(2) and recv(2) o Use m_pkthdr.memlen from m_uiotombuf() o Modify unp_internalize() to keep track of allocated space and memory as well as pointer to the last buffer. o Modify unp_addsockcred() to keep track of allocated space and memory as well as pointer to the last buffer. o Record the datagram len/memlen/ctllen in the first (from) mbuf of the chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram(). Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35302	2022-06-24 09:09:11 -07:00
Gleb Smirnoff	9b841b0e23	m_uiotombuf: write total memory length of the allocated chain in pkthdr Data allocated by m_uiotombuf() usually goes into a socket buffer. We are interested in the length of useful data to be added to sb_acc, as well as total memory used by mbufs. The later would be added to sb_mbcnt. Calculating this value at allocation time allows to save on extra traversal of the mbuf chain. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35301	2022-06-24 09:09:11 -07:00
Gleb Smirnoff	a7444f807e	unix/dgram: use minimal possible socket buffer for PF_UNIX/SOCK_DGRAM This change fully splits away PF_UNIX/SOCK_DGRAM from other socket buffer implementations, without any behavior changes. Generic socket implementation is reduced down to one STAILQ and very little code. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35300	2022-06-24 09:09:11 -07:00
Gleb Smirnoff	a4fc41423f	sockets: enable protocol specific socket buffers Split struct sockbuf into common shared fields and protocol specific union, where protocols are free to implement whatever buffer they want. Such protocols should mark themselves with PR_SOCKBUF and are expected to initialize their buffers in their pr_attach and tear them down in pr_detach. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35299	2022-06-24 09:09:10 -07:00
Gleb Smirnoff	315167c0de	unix: provide an option to return locked from unp_connectat() Use this new version in unix/dgram socket when sending to a target address. This removes extra lock release/acquisition and possible counter-intuitive ENOTCONN. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35298	2022-06-24 09:09:10 -07:00
Gleb Smirnoff	5dc8dd5f3a	unix/dgram: inline sbappendaddr_locked() into uipc_sosend_dgram() This allows to remove one M_NOWAIT allocation and also makes it more clear what's going on. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35297	2022-06-24 09:09:10 -07:00
Gleb Smirnoff	e3fbbf965e	unix/dgram: add a specific receive method - uipc_soreceive_dgram With this second step PF_UNIX/SOCK_DGRAM has protocol specific implementation. This gives some possibility performance optimizations. However, it still operates on the same struct socket as all other sockets do. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35296	2022-06-24 09:09:10 -07:00
Gleb Smirnoff	f384a97c83	unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 2 Just remove one level of indentation as the case clause always match. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35295	2022-06-24 09:09:10 -07:00
Gleb Smirnoff	7e5b6b391e	unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 1 Remove the dead code. The new uipc_sosend_dgram() handles send() on PF_UNIX/SOCK_DGRAM in full. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35294	2022-06-24 09:09:10 -07:00
Gleb Smirnoff	3464958246	unix/dgram: add a specific send method - uipc_sosend_dgram() This is first step towards splitting classic BSD socket implementation into separate classes. The first to be split is PF_UNIX/SOCK_DGRAM as it has most differencies to SOCK_STREAM sockets and to PF_INET sockets. Historically a protocol shall provide two methods for sendmsg(2): pru_sosend and pru_send. The former is a generic send method, e.g. sosend_generic() which would internally call the latter, uipc_send() in our case. There is one important exception, though, the sendfile(2) code will call pru_send directly. But sendfile doesn't work on SOCK_DGRAM, so we can do the trick. We will create socket class specific uipc_sosend_dgram() which will carry only important bits from sosend_generic() and uipc_send(). Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35293	2022-06-24 09:09:10 -07:00
Mitchell Horne	29afffb942	subr_bus: restore bus_null_rescan() Partially revert the previous change; we need to keep this method as a specific override for pci_driver subclasses which should not use pci_rescan_method() -- cardbus and ofw_pcibus. However, change the return value to ENODEV for the same reasoning given in the original commit, and use this as the default rescan method in bus_if.m. Reported by: jhb Fixes: `36a8572ee8` ("bus_if: provide a default null rescan method") MFC with: `36a8572ee8`	2022-06-23 16:07:00 -03:00
Mitchell Horne	8701571df9	set_cputicker: use a bool The third argument to this function indicates whether the supplied ticker is fixed or variable, i.e. requiring calibration. Give this argument a type and name that better conveys this purpose. Reviewed by: kib, markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35459	2022-06-23 15:15:11 -03:00
Mitchell Horne	36a8572ee8	bus_if: provide a default null rescan method There is an existing helper method in subr_bus.c, but almost no drivers know to use it. It also returns the same error as an empty method, making it not very useful. Move this to bus_if.m and return a more sensible error code. This gives a slightly more meaningful error message when attempting 'devctl rescan' on buses and devices alike: "Device not configured" --> "Operation not supported by device" Reviewed by: imp MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35501	2022-06-23 15:15:10 -03:00
Chuck Silvers	5bd21cbbd1	vfs: fix vfs_bio_clrbuf() for PAGE_SIZE > block size Calculate the desired page valid mask using math that will not overflow the types used. Sponsored by: Netflix Reviewed by: mckusick, kib, markj Differential Revision: https://reviews.freebsd.org/D34837	2022-06-21 17:58:52 -07:00
Mark Johnston	9553bc89db	aio: Improve UMA usage - Remove the AIO proc zone. This zone gets one allocation per AIO daemon process, which isn't enough to warrant a dedicated zone. Plus, unlike other AIO structures, aiops are small (32 bytes with LP64), so UMA doesn't provide better space efficiency than malloc(9). Change one of the malloc types in vfs_aio.c to make it more general. - Don't set the NOFREE flag on the other AIO zones. This flag means that memory allocated to the AIO subsystem is never freed back to the VM, so it's always preferable to avoid using it when possible. NOFREE was set without explanation when AIO was converted to use UMA 20 years ago, but it does not appear to be required; all of the structures allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep track of references and get freed only when none exist. Plus, these structures will contain dangling pointer after they're freed (e.g., the "cred", "fd_file" and "uiop" fields of struct kaiocb), so use-after-frees are dangerous even when the structures themselves are type-stable. Reviewed by: asomers MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35493	2022-06-20 12:48:13 -04:00
Damjan Jovanovic	8c309d48aa	struct kinfo_file changes needed for lsof to work using only usermode APIs` Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them. Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state, and populate it. Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in kf_sock for INET/INET6 sockets, and populate all other fields for all transport layer protocols, not just TCP. Bump __FreeBSD_version. Differential revision: https://reviews.freebsd.org/D34184 Reviewed by: jhb, kib, se MFC after: 1 week	2022-06-18 12:34:25 +03:00
Damjan Jovanovic	8ae7694913	KERN_LOCKF: report kl_file_fsid consistently with stat(2) PR: 264723 Reviewed by: kib Discussed with: markj MFC after: 1 week	2022-06-18 12:34:17 +03:00
Mark Johnston	f6379f7fde	socket: Fix a race between kevent(2) and listen(2) When locking the knote list for a socket, we check whether the socket is a listening socket in order to select the appropriate mutex; a listening socket uses the socket lock, while data sockets use socket buffer mutexes. If SOLISTENING(so) is false and the knote lock routine locks a socket buffer, then it must re-check whether the socket is a listening socket since solisten_proto() could have changed the socket's identity while we were blocked on the socket buffer lock. Reported by: syzkaller Reviewed by: glebius MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35492	2022-06-16 10:20:04 -04:00
Mark Johnston	756bc3adc5	kasan: Create a shadow for the bootstack prior to hammer_time() When the kernel is compiled with -asan-stack=true, the address sanitizer will emit inline accesses to the shadow map. In other words, some shadow map accesses are not intercepted by the KASAN runtime, so they cannot be disabled even if the runtime is not yet initialized by kasan_init() at the end of hammer_time(). This went unnoticed because the loader will initialize all PML4 entries of the bootstrap page table to point to the same PDP page, so early shadow map accesses do not raise a page fault, though they are silently corrupting memory. In fact, when the loader does not copy the staging area, we do get a page fault since in that case only the first and last PML4Es are populated by the loader. But due to another bug, the loader always treated KASAN kernels as non-relocatable and thus always copied the staging area. It is not really practical to annotate hammer_time() and all callees with __nosanitizeaddress, so instead add some early initialization which creates a shadow for the boot stack used by hammer_time(). This is only needed by KASAN, not by KMSAN, but the shared pmap code handles both. Reported by: mhorne Reviewed by: kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35449	2022-06-15 11:39:10 -04:00
Doug Ambrisko	ce00b11940	mount: revert the active vnode reporting feature Revert the computing of active vnode reporting since statfs is used by a lot of tools. Only report the vnodes used. Reported by: mjg	2022-06-15 07:24:55 -07:00
Mark Johnston	7565431f30	mount: Fix an incorrect assertion in kernel_mount() The pointer to the mount values may be null if an error occurred while copying them in, so fix the assertion condition to reflect that possibility. While here, move some initialization code into the error == 0 block. No functional change intended. Reported by: syzkaller MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2022-06-14 12:00:59 -04:00
Mark Johnston	630f633f2a	vm_object: Use the vm_object_(set\|clear)_flag() helpers ... rather than setting and clearing flags inline. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35469	2022-06-14 12:00:59 -04:00
Mark Johnston	e8955bd643	pipe: Use a distinct wait channel for I/O serialization Suppose a thread tries to read from an empty pipe. pipe_read() does the following: 1. pipelock(), possibly sleeping 2. check for buffered data 3. pipeunlock() 4. set PIPE_WANTR and sleep 5. goto 1 pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it sleeps until the lock holder calls pipeunlock(). Both sleeps use the same wait channel. So if there are multiple threads in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping in step 4. Then T1 goes to sleep in step 4, and T2 acquires and releases the pipelock, waking up T1 again. This can go on indefinitely, livelocking the process (and potentially starving a would-be writer). Fix the problem by using a separate wait channel for pipelock(). Reported by: Paul Floyd <paulf2718@gmail.com> Reviewed by: mjg, kib PR: 264441 MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35415	2022-06-14 12:00:59 -04:00
Cy Schubert	d781401512	kern_thread.c: Fix i386 build Chase `4493a13e3b` by updating static assertions of struct proc.	2022-06-13 19:35:33 -07:00
Konstantin Belousov	1575804961	reap_kill_proc(): avoid singlethreading any other process if we are exiting This is racy because curproc process lock is not used, but allows the process to exit faster. It is userspace issue to create such race anyway, and not fullfilling the guarantee that all reaper descendants are signalled should be fine. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	e0343eacf3	reap_kill_subtree(): hold the reaper when entering it into the queue to handle later We drop proctree_lock, which allows the process to exit while memoized in the list to proceed. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	1d4abf2cfa	reap_kill_subtree_once(): handle proctree_lock unlock in reap_kill_proc() Recorded reaper might loose its reaper status, so we should not assert it, but check and avoid signalling if this happens. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 week Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	addf103ce6	reap_kill_proc: do not retry on thread_single() failure The failure means that the process does single-threading itself, which makes our action not needed. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	008b2e6544	Make stop_all_proc_block interruptible to avoid deadlock with parallel suspension If we try to single-thread a process which thread entered procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking stop_all_proc_blocker, we must be able to finish single-threading. This requires the sleep to be interruptible. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Mark Johnston	2d5ef216b6	thread_single_end(): consistently maintain p_boundary_count for ALLPROC mode Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 week Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	1b4701fe1e	thread_unsuspend(): do not unuspend the suspended leader thread doing SINGLE_ALLPROC markj wrote: tdsendsignal() may unsuspend a target thread. I think there is at least one bug there: suppose thread T is suspended in thread_single(SINGLE_ALLPROC) when trying to kill another process with REAP_KILL. Suppose a different thread sends SIGKILL to T->td_proc. Then, tdsendsignal() calls thread_unsuspend(T, T->td_proc). thread_unsuspend() incorrectly decrements T->td_proc->p_suspcount to -1. Later, when T->td_proc exits, it will wait forever in thread_single(SINGLE_EXIT) since T->td_proc->p_suspcount never reaches 1. Since the thread suspension is bounded by time needed to do thread_single(), skipping the thread_unsuspend_one() call there should not affect signal delivery if this thread is selected as target. Reported by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	b9009b1789	thread_single(): remove already checked conditional expression Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	4493a13e3b	Do not single-thread itself when the process single-threaded some another process Since both self single-threading and remote single-threading rely on suspending the thread doing thread_single(), it cannot be mixed: thread doing thread_suspend_switch() might be subject to thread_suspend_one() and vice versa. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	dd883e9a7e	weed_inhib(): correct the condition to re-suspend a thread suspended for SINGLE_ALLPROC mode. There is no need to check for boundary state. It is only required to see that the suspension comes from the ALLPROC mode. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:03 +03:00
Konstantin Belousov	b9893b3533	weed_inhib(): do not double-suspend already suspended thread if the loop reiterates In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:02 +03:00
Konstantin Belousov	d7a9e6e740	thread_single: wait for P_STOPPED_SINGLE to pass to avoid ALLPROC mode to try to race with any other single-threading mode. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:02 +03:00
Konstantin Belousov	02a2aacbe2	issignal(): ignore signals when process is single-threading for exit Places that will wait for curproc->p_singlethr to become zero (in the next commit, the counter of number of external single-threading is to be introduced), must wait for it interruptible, otherwise we deadlock. On the other hand, a signal delivered during this window, if directed to the waiting thread, would cause the wait loop to become a busy loop. Since we are exiting, it is safe to ignore the signals. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:02 +03:00
Konstantin Belousov	d3000939c7	P2_WEXIT: avoid thread_single() for exiting process earlier before the process itself does thread_single(SINGLE_EXIT). We cannot single-thread such process in ALLPROC (external) mode, and properly detect and report the failure to do so due to the process becoming zombie is easier to prevent than handle. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310	2022-06-13 22:30:02 +03:00
Doug Ambrisko	6468cd8e0e	mount: add vnode usage per file system with mount -v This avoids the need to drop into the ddb to figure out vnode usage per file system. It helps to see if they are or are not being freed. Suggestion to report active vnode count was from kib@ Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35436	2022-06-13 07:56:38 -07:00
Hans Petter Selasky	b8394039dc	mbuf(9): Fix size of mbuf for all 32-bit platforms (i386, ARM, PowerPC and RISCV) Do this by reducing the size of the MBUF_PEXT_MAX_PGS, causing "struct mbuf" to be bigger than M_SIZE, and also add a missing padding field to ensure 64-bit alignment. Reviewed by: gallatin@ Reported by: Elliott Mitchell Differential revision: https://reviews.freebsd.org/D35339 MFC after: 1 week Sponsored by: NVIDIA Networking	2022-06-07 22:09:10 +02:00
Hans Petter Selasky	fe8c78f0d2	ktls: Add full support for TLS RX offloading via network interface. Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet header to figure out if an incoming mbuf has been fully offloaded or not. This information follows the packet stream via the LRO engine, IP stack and finally to the TCP stack. The TCP stack preserves the mbuf packet header also when re-assembling packets after packet loss. When the mbuf goes into the socket buffer the packet header is demoted and the offload information is transferred to "m_flags" . Later on a worker thread will analyze the mbuf flags and decide if the mbufs making up a TLS record indicate a fully-, partially- or not decrypted TLS record. Based on these three cases the worker thread will either pass the packet on as-is or recrypt the decrypted bits, if any, or decrypt the packet as usual. During packet loss the kernel TLS code will call back into the network driver using the send tag, informing about the TCP starting sequence number of every TLS record that is not fully decrypted by the network interface. The network interface then stores this information in a compressed table and starts asking the hardware if it has found a valid TLS header in the TCP data payload. If the hardware has found a valid TLS header and the referred TLS header is at a valid TCP sequence number according to the TCP sequence numbers provided by the kernel TLS code, the network driver then informs the hardware that it can resume decryption. Care has been taken to not merge encrypted and decrypted mbuf chains, in the LRO engine and when appending mbufs to the socket buffer. The mbuf's leaf network interface pointer is used to figure out from which network interface the offloading rule should be allocated. Also this pointer is used to track route changes. Currently mbuf send tags are used in both transmit and receive direction, due to convenience, but may get a new name in the future to better reflect their usage. Reviewed by: jhb@ and gallatin@ Differential revision: https://reviews.freebsd.org/D32356 Sponsored by: NVIDIA Networking	2022-06-07 12:58:09 +02:00

... 3 4 5 6 7 ...

19588 Commits