freebsd-dev

Author	SHA1	Message	Date
Kyle Evans	29e400e994	domain: make it safer to add domains post-domainfinalize I can see two concerns for adding domains after domainfinalize: 1.) The slow/fast callouts have already been setup. 2.) Userland could create a socket while we're in the middle of initialization. We can address #1 fairly easily by tracking whether the domain's been initialized for at least the default vnet. There are still some concerns about the callbacks being invoked while a vnet is in the process of being created/destroyed, but this is a pre-existing issue that the callbacks must coordinate anyways. We should also address #2, but technically this has been an issue anyways because we don't assert on post-domainfinalize additions; we don't seem to hit it in practice. Future work can fix that up to make sure we don't find partially constructed domains, but care must be taken to make sure that at least, e.g., the usages of pffindproto in ip_input.c can still find them. Differential Revision: https://reviews.freebsd.org/D25459	2021-08-16 00:59:56 -05:00
Kyle Evans	239aebee61	domain: give domains a chance to probe for availability This gives any given domain a chance to indicate that it's not actually supported on the current system. If dom_probe isn't supplied, we assume the domain is universally applicable as most of them are. Keeping fully-initialized and registered domains around that physically can't work on a large majority of FreeBSD deployments is sub-optimal and leads to errors that aren't consistent with the reality of why the socket can't be created (e.g. ESOCKTNOSUPPORT) because such scenario has to be caught upon pru_attach, at which point kicking back the more-appropriate EAFNOSUPPORT would seem weird. The initial consumer of this will be hvsock, which is only available on HyperV guests. Reviewed by: cem (earlier version), bcr (manpages) Differential Revision: https://reviews.freebsd.org/D25062	2021-08-16 00:59:56 -05:00
Konstantin Belousov	9446d9e88f	fstatat(2): handle non-vnode file descriptors for AT_EMPTY_PATH Set NIRES_EMPTYPATH earlies, to have use of EMPTYPATH recorded even if we are going to return error. When namei_setup() refused to accept dirfd, which is not of the vnode type, and indicated by ENOTDIR error return, fall back to kern_fstat(dirfd). Reported by: dchagin Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31530	2021-08-14 00:17:18 +03:00
Ka Ho Ng	454bc887f2	uipc_shm: Implements fspacectl(2) support This implements fspacectl(2) support on shared memory objects. The semantic of SPACECTL_DEALLOC is equivalent to clearing the backing store and free the pages within the affected range. If the call succeeds, subsequent reads on the affected range return all zero. tests/sys/posixshm/posixshm_tests.c is expanded to include a fspacectl(2) functional test. Sponsored by: The FreeBSD Foundation Reviewed by: kevans, kib Differential Revision: https://reviews.freebsd.org/D31490	2021-08-12 23:04:18 +08:00
Ka Ho Ng	a638dc4ebc	vfs: Add ioflag to VOP_DEALLOCATE(9) The addition of ioflag allows callers passing IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation. The vop_stddeallocate fallback implementation is updated to pass the ioflag to the file system implementation. vn_deallocate(9) internally is also changed to pass ioflag to the VOP_DEALLOCATE call. Sponsored by: The FreeBSD Foundation Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D31500	2021-08-12 23:03:49 +08:00
Ka Ho Ng	c15384f896	vfs: Add get_write_ioflag helper to calculate ioflag Converted vn_write to use this helper. Sponsored by: The FreeBSD Foundation MFC after: 3 days Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31513	2021-08-12 17:35:34 +08:00
Dmitry Chagin	71854d9b2b	fork: Remove the unnecessary spaces. MFC after: 2 weeks	2021-08-12 11:58:17 +03:00
Dmitry Chagin	de8374df28	fork: Allow ABI to specify fork return values for child. At least Linux x86 ABI's does not use carry bit and expects that the dx register is preserved. For this add a new sv_set_fork_retval hook and call it from cpu_fork(). Add a short comment about touching dx in x86_set_fork_retval(), for more details see phab comments from kib@ and imp@. Reviewed by: kib Differential revision: https://reviews.freebsd.org/D31472 MFC after: 2 weeks	2021-08-12 11:45:25 +03:00
Eric van Gyzen	13a58148de	netdump: send key before dump, in case dump fails Previously, if an encrypted netdump failed, such as due to a timeout or network failure, the key was not saved, so a partial dump was completely useless. Send the key first, so the partial dump can be decrypted, because even a partial dump can be useful. Reviewed by: bdrewery, markj MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D31453	2021-08-11 10:54:56 -05:00
Mark Johnston	10a8e93da1	kmsan: Export kmsan_mark_mbuf() and kmsan_mark_bio() Sponsored by: The FreeBSD Foundation	2021-08-11 16:33:41 -04:00
Andrew Gallatin	95c51fafa4	ktls: Init reset tag task for cloned sessions When cloning a ktls session (which is needed when we need to switch output NICs for a NIC TLS session), we need to also init the reset task, like we do when creating a new tls session. Reviewed by: jhb Sponsored by: Netflix	2021-08-11 14:06:43 -04:00
Mitchell Horne	4ccaa87f69	kdb: Handle process enumeration before procinit() Make kdb_thr_first() and kdb_thr_next() return sane values if the allproc list and pidhashtbl haven't been initialized yet. This can happen if the debugger is entered very early on, for example with the '-d' boot flag. This allows remote gdb to attach at such a time, and fixes some ddb commands like 'show threads'. Be explicit about the static initialization of these variables. This part has no functional change. Reviewed by: markj, imp (previous version) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D31495	2021-08-11 14:44:22 -03:00
Ka Ho Ng	4a9b832a2a	vfs: Rename ioflg to ioflag in vn_deallocate This includes a style fix around ioflag checking as well. Sponsored by: The FreeBSD Foundation Reviewed by: kib, bcr Differential Revision: https://reviews.freebsd.org/D31505	2021-08-11 17:45:47 +08:00
Alexander Motin	67f508db84	Mark some sysctls as CTLFLAG_MPSAFE. MFC after: 2 weeks	2021-08-10 22:18:26 -04:00
Mark Johnston	100949103a	uma: Add KMSAN hooks For now, just hook the allocation path: upon allocation, items are marked as initialized (absent M_ZERO). Some zones are exempted from this when it would otherwise raise false positives. Use kmsan_orig() to update the origin map for UMA and malloc(9) allocations. This allows KMSAN to print the return address when an uninitialized UMA item is implicated in a report. For example: panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe Sponsored by: The FreeBSD Foundation	2021-08-10 21:27:54 -04:00
Mark Johnston	693c9516fa	busdma: Add KMSAN integration Sanitizer instrumentation of course cannot automatically update shadow state when devices write to host memory. KMSAN thus hooks into busdma, both to update shadow state after a device write, and to verify that the kernel does not publish uninitalized bytes to devices. To implement this, when KMSAN is configured, each dmamap embeds a memory descriptor describing the region currently loaded into the map. bus_dmamap_sync() uses the operation flags to determine whether to validate the loaded region or to mark it as initialized in the shadow map. Note that in cases where the amount of data written is less than the buffer size, the entire buffer is marked initialized even when it is not. For example, if a NIC writes a 128B packet into a 2KB buffer, the entire buffer will be marked initialized, but subsequent accesses past the first 128 bytes are likely caused by bugs. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31338	2021-08-10 21:27:54 -04:00
Mark Johnston	b0f71f1bc5	amd64: Add MD bits for KMSAN Interrupt and exception handlers must call kmsan_intr_enter() prior to calling any C code. This is because the KMSAN runtime maintains some TLS in order to track initialization state of function parameters and return values across function calls. Then, to ensure that this state is kept consistent in the face of asynchronous kernel-mode excpeptions, the runtime uses a stack of TLS blocks, and kmsan_intr_enter() and kmsan_intr_leave() push and pop that stack, respectively. Use these functions in amd64 interrupt and exception handlers. Note that handlers for user->kernel transitions need not be annotated. Also ensure that trap frames pushed by the CPU and by handlers are marked as initialized before they are used. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31467	2021-08-10 21:27:53 -04:00
Mark Johnston	8978608832	amd64: Populate the KMSAN shadow maps and integrate with the VM - During boot, allocate PDP pages for the shadow maps. The region above KERNBASE is currently not shadowed. - Create a dummy shadow for the vm page array. For now, this array is not protected by the shadow map to help reduce kernel memory usage. - Grow shadows when growing the kernel map. - Increase the default kernel stack size when KMSAN is enabled. As with KASAN, sanitizer instrumentation appears to create stack frames large enough that the default value is not sufficient. - Disable UMA's use of the direct map when KMSAN is configured. KMSAN cannot validate the direct map. - Disable unmapped I/O when KMSAN configured. - Lower the limit on paging buffers when KMSAN is configured. Each buffer has a static MAXPHYS-sized allocation of KVA, which in turn eats 2*MAXPHYS of space in the shadow map. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31295	2021-08-10 21:27:53 -04:00
Mark Johnston	5dda15adbc	kern: Ensure that thread-local KMSAN state is available Sponsored by: The FreeBSD Foundation	2021-08-10 21:27:53 -04:00
Mark Johnston	a422084abb	Add the KMSAN runtime KMSAN enables the use of LLVM's MemorySanitizer in the kernel. This enables precise detection of uses of uninitialized memory. As with KASAN, this feature has substantial runtime overhead and is intended to be used as part of some automated testing regime. The runtime maintains a pair of shadow maps. One is used to track the state of memory in the kernel map at bit-granularity: a bit in the kernel map is initialized when the corresponding shadow bit is clear, and is uninitialized otherwise. The second shadow map stores information about the origin of uninitialized regions of the kernel map, simplifying debugging. KMSAN relies on being able to intercept certain functions which cannot be instrumented by the compiler. KMSAN thus implements interceptors which manually update shadow state and in some cases explicitly check for uninitialized bytes. For instance, all calls to copyout() are subject to such checks. The runtime exports several functions which can be used to verify the shadow map for a given buffer. Helpers provide the same functionality for a few structures commonly used for I/O, such as CAM CCBs, BIOs and mbufs. These are handy when debugging a KMSAN report whose proximate and root causes are far away from each other. Obtained from: NetBSD Sponsored by: The FreeBSD Foundation	2021-08-10 21:27:53 -04:00
Mark Johnston	eca9ac5a32	vfs: Avoid a comparison with an uninitialized field in setutimes() Some filesystems, e.g., devfs, do not populate va_birthtime in their GETATTR implementations. To handle this, make sure that va_birthtime is initialized to the quasi-standard value of { VNOVAL, 0 } before calling VOP_GETATTR. Reported by: KMSAN Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31468	2021-08-09 13:27:20 -04:00
Alexander Motin	696fca3fd4	Optimize res_find(). When the device name is provided, we can simply run strncmp() for each line to quickly skip unrelated ones, that is much faster than sscanf() and only then strcmp(). MFC after: 2 weeks	2021-08-08 21:54:49 -04:00
Ed Maste	9feff969a0	Remove "All Rights Reserved" from FreeBSD Foundation sys/ copyrights These ones were unambiguous cases where the Foundation was the only listed copyright holder (in the associated license block). Sponsored by: The FreeBSD Foundation	2021-08-08 10:42:24 -04:00
Mateusz Guzik	b30e7cb7fa	cache: add OPENREAD and OPENWRITE to fast path lookup	2021-08-07 13:02:38 +02:00
Rick Macklem	c18c74a87c	namei: Add cn_flags bits for OPENREAD and OPENWRITE VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN to indicate that the lookup is for the last component of a pathname when doing open. If the cn_flags also indicates if the open is for Reading, Writing or Both, the NFSv4 client can do an NFSv4 Open operation in the same compound RPC as Lookup, often avoiding the additional Open RPC now done when VOP_OPEN() is called. This patch defines two new cn_flags bits called OPENREAD and OPENWRITE and sets these in open2nameif() based on FREAD, FWRITE flag bits. This will allow a subsequent patch to the NFSv4 client to do the Open operation in the same RPC as Lookup. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31431	2021-08-06 18:41:11 -07:00
Andrew Gallatin	09066b9866	ktls: Use the new PNOLOCK flag Use the new PNOLOCK flag to tsleep() to indicate that we are managing potential races, and don't need to sleep with a lock, or have a backstop timeout. Reviewed by: jhb Sponsored by: Netflix	2021-08-05 17:19:12 -04:00
Andrew Gallatin	1b97a054f3	tsleep: Add a PNOLOCK flag Add a PNOLOCK flag so that, in the race circumstance where wakeup races are externally mitigated, tsleep() can be called with a sleep time of 0 without triggering an an assertion. Reviewed by: jhb Sponsored by: Netflix	2021-08-05 17:16:30 -04:00
Andrew Gallatin	2694c869ff	ktls: fix a panic with INVARIANTS `98215005b7` introduced a new thread that uses tsleep(..0) to sleep forever. This hit an assert due to sleeping with a 0 timeout. So spell "forever" using SBT_MAX instead, which does not trigger the assert. Pointy hat to: gallatin Pointed out by: emaste Sponsored by: Netflix	2021-08-05 13:09:06 -04:00
Ka Ho Ng	da9fe3529b	Regen after `0dc332bff2`	2021-08-05 23:22:02 +08:00
Ka Ho Ng	0dc332bff2	Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9). fspacectl(2) is a system call to provide space management support to userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the deallocation. vn_deallocate(9) is a public KPI for kmods' use. The purpose of proposing a new system call, a KPI and a VOP call is to allow bhyve or other hypervisor monitors to emulate the behavior of SCSI UNMAP/NVMe DEALLOCATE on a plain file. fspacectl(2) comprises of cmd and flags parameters to specify the space management operation to be performed. Currently cmd has to be SPACECTL_DEALLOC, and flags has to be 0. fo_fspacectl is added to fileops. VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation of VOP_DEALLOCATE(9) is provided. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28347	2021-08-05 23:20:42 +08:00
Ka Ho Ng	abbb57d5a6	vfs: Introduce vn_bmap_seekhole_locked() vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole(). This variant requires shared vnode lock being held around the call. Sponsored by: The FreeBSD Foundation Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D31404	2021-08-05 22:52:26 +08:00
Andrew Gallatin	98215005b7	ktls: start a thread to keep the 16k ktls buffer zone populated Ktls recently received an optimization where we allocate 16k physically contiguous crypto destination buffers. This provides a large (more than 5%) reduction in CPU use in our workload. However, after several days of uptime, the performance benefit disappears because we have frequent allocation failures from the ktls buffer zone. It turns out that when load drops off, the ktls buffer zone is trimmed, and some 16k buffers are freed back to the OS. When load picks back up again, re-allocating those 16k buffers fails after some number of days of uptime because physical memory has become fragmented. This causes allocations to fail, because they are intentionally done without M_NORECLAIM, so as to avoid pausing the ktls crytpo work thread while the VM system defragments memory. To work around this, this change starts one thread per VM domain to allocate ktls buffers with M_NORECLAIM, as we don't care if this thread is paused while memory is defragged. The thread then frees the buffers back into the ktls buffer zone, thus allowing future allocations to succeed. Note that waking up the thread is intentionally racy, but neither of the races really matter. In the worst case, we could have either spurious wakeups or we could have to wait 1 second until the next rate-limited allocation failure to wake up the thread. This patch has been in use at Netflix on a handful of servers, and seems to fix the issue. Differential Revision: https://reviews.freebsd.org/D31260 Reviewed by: jhb, markj, (jtl, rrs, and dhw reviewed earlier version) Sponsored by: Netflix	2021-08-05 10:19:12 -04:00
John Baldwin	c51e4962a3	Document kern.log_wakeups_per_second. PR: 148680 MFC after: 2 weeks	2021-08-04 11:50:34 -07:00
Konstantin Belousov	0ef5eee9d9	Add vn_lktype_write() and remove repetetive code that calculates vnode locking type for write. Reviewed by: khng, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31405	2021-08-04 19:40:13 +03:00
Kyle Evans	04cc0c393c	malloc(9): provide missing malloc_aligned implementation Pointy hat: kevans Fixes: `6162cf885c` ("malloc(9): Document/complete aligned variants")	2021-08-02 21:12:39 -05:00
Eric van Gyzen	428624130a	Fix lockstat:::thread-spin dtrace probe with LOCK_PROFILING The spinning start time is missing from the calculation due to a misplaced #endif. Return the #endif where it's supposed to be. Submitted by: Alexander Alexeev <aalexeev@isilon.com> Reviewed by: bdrewery, mjg MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D31384	2021-08-02 14:44:23 -05:00
Adam Fenn	8ca384eb1d	devclass_alloc_unit: move "at" hint test to after device-in-use test Only perform this expensive operation when the unit number is a potential candidate (i.e. not already in use), thereby reducing device scan time on systems with many devices, unit numbers, and drivers. Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. X-NetApp-PR: #61 Differential Revision: https://reviews.freebsd.org/D31381	2021-08-02 11:27:17 -05:00
Alexander Motin	ca34553b6f	sched_ule(4): Pre-seed sched_random(). I don't think it changes anything, but why not. While there, make cpu_search_highest() use all 8 lower load bits for noise, since it does not use cs_prefer and the code is not shared with cpu_search_lowest() any more. MFC after: 1 month	2021-08-02 10:55:28 -04:00
Alexander Motin	8bb173fb5b	sched_ule(4): Use trylock when stealing load. On some load patterns it is possible for several CPUs to try steal thread from the same CPU despite randomization introduced. It may cause significant lock contention when holding one queue lock idle thread tries to acquire another one. Use of trylock on the remote queue allows both reduce the contention and handle lock ordering easier. If we can't get lock inside tdq_trysteal() we just return, allowing tdq_idled() handle it. If it happens in tdq_idled(), then we repeat search for load skipping this CPU. On 2-socket 80-thread Xeon system I am observing dramatic reduction of the lock spinning time when doing random uncached 4KB reads from 12 ZVOLs, while IOPS increase from 327K to 403K. MFC after: 1 month	2021-08-01 22:42:01 -04:00
Alexander Motin	2668bb2add	sched_ule(4): Reduce duplicate search for load. When sched_highest() called for some CPU group returns nothing, idle thread calls it for the parent CPU group. But the parent CPU group also includes the CPU group we've just searched, and unless there is a race going on, it is unlikely we find anything new this time. Avoid the double search in case of parent group having only two sub- groups (the most prominent case). Instead of escalating to the parent group run the next search over the sibling subgroup and escalate two levels up after if that fail too. In case of more than two siblings the difference is less significant, while searching the parent group can result in better decision if we find several candidate CPUs. On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU time spent inside cpu_search_highest() in both SMT (2x20x2) and non- SMT (2x20) cases. MFC after: 1 month	2021-08-01 22:07:51 -04:00
Mark Johnston	6f179693c5	Add interceptors for atomic operations on userspace memory Implement them for KASAN. KCSAN interceptors are left unimplemented for now. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-29 21:14:36 -04:00
Mark Johnston	a90d053b84	Simplify kernel sanitizer interceptors KASAN and KCSAN implement interceptors for various primitive operations that are not instrumented by the compiler. KMSAN requires them as well. Rather than adding new cases for each sanitizer which requires interceptors, implement the following protocol: - When interceptor definitions are required, define SAN_NEEDS_INTERCEPTORS and SANITIZER_INTERCEPTOR_PREFIX. - In headers that declare functions which need to be intercepted by a sanitizer runtime, use SANITIZER_INTERCEPTOR_PREFIX to provide declarations. - When SAN_RUNTIME is defined, do not redefine the names of intercepted functions. This is typically the case in files which implement sanitizer runtimes but is also needed in, for example, files which define ifunc selectors for intercepted operations. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-29 21:13:32 -04:00
Mark Johnston	9e575fadf4	link_elf_obj: Invoke fini callbacks This is required for KASAN: when a module is unloaded, poisoned regions (e.g., pad areas between global variables) are left as such, so if they are reused as KLDs are loaded, false positives can arise. Reported by: pho, Jenkins Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31339	2021-07-29 09:46:25 -04:00
Dmitry Chagin	9e32efa79b	umtx: Split do_unlock_pi on two counterparts. The umtx_pi_frop() will be used by Linux emulation layer. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31238 MFC after: 2 weeks	2021-07-29 12:47:39 +03:00
Dmitry Chagin	09f55e6002	umtx: Expose some of the pi umtx structures and API to the rest of the kernel. Differential Revision: https://reviews.freebsd.org/D31237 MFC after: 2 weeks	2021-07-29 12:46:58 +03:00
Dmitry Chagin	8e4d22c01d	umtx: Add umtxq_requeue Linux emulation layer extension. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31235 MFC after: 2 weeks	2021-07-29 12:43:07 +03:00
Dmitry Chagin	7caa29115b	umtx: Add bitset conditional wakeup functionality. The bitset is a Linux emulation layer extension. This 32-bit mask, in which at least one bit must be set, is used to select which threads should be woken up. The bitset is stored in the umtx_q structure, which is used to enqueue the waiter into the umtx waitqueue. Put the bitset into the hole, that appeared on LP64 due to data alignment, to prevent the growth of the struct umtx_q. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31234 MFC after: 2 weeks	2021-07-29 12:42:49 +03:00
Dmitry Chagin	1fdcc87cfd	umtx: Expose some of the umtx structures and API to the rest of the kernel. Differential Revision: https://reviews.freebsd.org/D31233 MFC after: 2 weeks	2021-07-29 12:42:17 +03:00
Dmitry Chagin	307a3dd35c	umtx: Expose struct abs_timeout to the rest of the kernel. Add umtx_ prefix to all abs_timeout facility and add declaration for it. For consistency with others abs_timeout mark inline abs_timeout_init2. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31249 MFC after: 2 weeks	2021-07-29 12:41:58 +03:00
Dmitry Chagin	af29f39958	umtx: Split umtx.h on two counterparts. To prevent umtx.h polluting by future changes split it on two headers: umtx.h - ABI header for userspace; umtxvar.h - the kernel staff. While here fix umtx_key_match style. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31248 MFC after: 2 weeks	2021-07-29 12:41:29 +03:00
Kyle Evans	e3707726c1	kern: remove deprecated makesyscalls.sh makesyscalls was rewritten in Lua and introduced in `d3276301ab`. In the time since, no objections have risen and a warning was introduced long ago on invocation of makesyscalls.sh that it would be removed before FreeBSD 13. Belatedly follow through on that.	2021-07-28 22:22:23 -05:00
Alexander Motin	aefe0a8c32	Refactor/optimize cpu_search_(). Remove cpu_search_both(), unused for many years. Without it there is less sense for the trick of compiling common cpu_search() into separate cpu_search_lowest() and cpu_search_highest(), so split them completely, making code more readable. While there, split iteration over children groups and CPUs, complicating code for very small deduplication. Stop passing cpuset_t arguments by value and avoid some manipulations. Since MAXCPU bump from 64 to 256, what was a single register turned into 32-byte memory array, requiring memory allocation and accesses. Splitting struct cpu_search into parameter and result parts allows to even more reduce stack usage, since the first can be passed through on recursion. Remove CPU_FFS() from the hot paths, precalculating first and last CPU for each CPU group in advance during initialization. Again, it was not a problem for 64 CPUs before, but for 256 FFS needs much more code. With these changes on 80-thread system doing ~260K uncached ZFS reads per second I observe ~30% reduction of time spent in cpu_search_(). MFC after: 1 month	2021-07-28 22:00:29 -04:00
Warner Losh	824897a3ae	genoffset: simplify and rewrite in sh genoffset used the fully generic ASSYM macro to generate the offsets needed for the thread_lite structure. However, since these are offsets into a structure, they will always be necessarily small and positive. As such, just create a simple character array of the right size and use a naming convention such that we can recover the field name, structure name and type. Use nm -t d and sort -n to sort these into order, then loop over the resutls to generate the thread_lite structure. MFC After: 2 weeks Reviewed by: kib, markj (earlier versions) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31203	2021-07-28 13:50:09 -06:00
Warner Losh	46dd3ef033	genassym.sh: Fix two minor issues found by shellcheck o Remove redunant $ in $(( )) expression. o Quote arg passed to work so paths with spaces, etc will work. MFC After: 2 weeks Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31335	2021-07-28 13:49:16 -06:00
Roy Marples	7045b1603b	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652	2021-07-28 09:35:09 -07:00
Konstantin Belousov	273728b125	Regen	2021-07-28 13:21:22 +03:00
Konstantin Belousov	9b6b793bd7	Revert most of `ce42e79310` to restore ABI compatibility for pre-10.x binaries. It restores _umtx_lock() and _umtx_unlock() syscalls, and UMTX_OP_LOCK/ UMTX_OP_UNLOCK umtx_op(2) operations. UMUTEX_ERROR_CHECK flag is left out for now, I do not think it makes a difference. PR: 218571 Reviewed by: brooks (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31220	2021-07-28 13:21:12 +03:00
John Baldwin	be79f30d6c	m_dup: Handle unmapped mbufs as an input mbuf. Use m_copydata() instead of a direct bcopy() when copying data out of a source mbuf into a newly-allocated mbuf. PR: 256610 Reported by: Niels Bakker <niels=freebsd@bakker.net> Reviewed by: markj MFC after: 2 weeks	2021-07-26 14:09:16 -07:00
Jason A. Harmening	2bc16e8aaf	VFS: remove MNTK_MARKER We no longer allow upper filesystems to be unregistered from the base mount while vfs_notify_upper() or any other upper operation is pending. New upper mounts can still be registered during this period, but they will be added at the end of the upper mount tailq. We therefore no longer need to allocate marker nodes during vfs_notify_upper() to keep our place in the iteration. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016	2021-07-24 12:52:32 -07:00
Jason A. Harmening	c746ed724d	Allow stacked filesystems to be recursively unmounted In certain emergency cases such as media failure or removal, UFS will initiate a forced unmount in order to prevent dirty buffers from accumulating against the no-longer-usable filesystem. The presence of a stacked filesystem such as nullfs or unionfs above the UFS mount will prevent this forced unmount from succeeding. This change addreses the situation by allowing stacked filesystems to be recursively unmounted on a taskqueue thread when the MNT_RECURSE flag is specified to dounmount(). This call will block until all upper mounts have been removed unless the caller specifies the MNT_DEFERRED flag to indicate the base filesystem should also be unmounted from the taskqueue. To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs have been combined with the existing 'mnt_uppers' list used by nullfs and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper(). The format of the mnt_uppers list has also been changed to accommodate filesystems such as unionfs in which a given mount may be stacked atop more than one lower mount. Additionally, management of lower FS reclaim/unlink notifications has been split into a separate list managed by a separate set of KPIs, as registration of an upper FS no longer implies interest in these notifications. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016	2021-07-24 12:52:00 -07:00
Warner Losh	6475667f7b	devctl: don't publish the mount options Mount options aren't solely ASCII strings. In addition, experience to date suggests that the mount options are much less useful than was originally supposed and the mount flags suffice to make decisions. Drop the reporting of options for the mount/remount/unmount events. Reviewed by: markj Reported by: KASAN Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31287	2021-07-24 09:03:53 -06:00
Mark Johnston	ebf9886654	imgact_elf: Avoid redefining suword() Otherwise this interferes with the definition for sanitizer interceptors. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 15:40:54 -04:00
Mark Johnston	048cd371f3	vfs: Initialize "lastfail" in vfs_mountroot_wait() This variable is only used to rate-limit "Root mount waiting for: ..." messages using ppsratecheck(). Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 12:04:02 -04:00
Mark Johnston	ea3fbe0707	KASAN: Disable checking before triggering a panic KASAN hooks will not generate reports if panicstr != NULL, but then there is a window after the initial panic() call where another report may be raised. This can happen if a false positive occurs; to simplify debugging of such problems, avoid recursing. Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:14 -04:00
Mark Johnston	0dcef81de9	Add required sysctl name length checks to various handlers Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:13 -04:00
Mark Johnston	cae3f9dd01	select: Define select_flags[] as const MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:13 -04:00
Mark Johnston	90959dd1e5	acct: Zero pad bytes in accounting records Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:29:57 -04:00
Mark Johnston	5c18bf9d5f	ktrace: Zero request structures when populating the pool Otherwise uninitialized pad bytes may be copied into the ktrace log file. Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:29:53 -04:00
Alan Somers	6c95065590	Escape any '.' characters in sysctl node names ZFS creates some sysctl nodes that include a pool name, and '.' is an allowed character in pool names. But it's the separator in the sysctl tree, so it can't be included in a sysctl name. Replace it with "%25". Handily, "%" is illegal in ZFS pool names, so there's no ambiguity there. PR: 257316 MFC after: 3 weeks Sponsored by: Axcient Reviewed by: freqlabs Differential Revision: https://reviews.freebsd.org/D31265	2021-07-22 10:22:48 -06:00
Kyle Evans	23ecfa9d5b	kern: mountroot: avoid fd leak in .md parsing parse_dir_md() opens /dev/mdctl but only closes the resulting fd on success, not upon failure of the ioctl or when we exceed the md unit max. Reviewed by: kib (slightly previous version) Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. X-NetApp-PR: #62 Differential Revision: https://reviews.freebsd.org/D31229	2021-07-21 10:18:09 -05:00
Edward Tomasz Napierala	a40cf4175c	Implement unprivileged chroot This builds on recently introduced NO_NEW_PRIVS flag to implement unprivileged chroot, enabled by `security.bsd.unprivileged_chroot`. It allows non-root processes to chroot(2), provided they have the NO_NEW_PRIVS flag set. The chroot(8) utility gets a new flag, -n, which sets NO_NEW_PRIVS before chrooting. Reviewed By: kib Sponsored By: EPSRC Relnotes: yes Differential Revision: https://reviews.freebsd.org/D30130	2021-07-20 08:57:53 +00:00
Dmitry Chagin	1ca6b15bbd	Drop "All rights reserved" from my copyright statements. Add email and fixup years while here. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D30912 MFC after: 2 weeks	2021-07-20 10:05:50 +03:00
Dmitry Chagin	5fd9cd53d2	linux(4): Modify sv_onexec hook to return an error. Temporary add stubs to the Linux emulation layer which calls the existing hook. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D30911 MFC after: 2 weeks	2021-07-20 09:56:25 +03:00
Dmitry Chagin	62ba4cd340	Call sv_onexec hook after the process VA is created. For future use in the Linux emulation layer call sv_onexec hook right after the new process address space is created. It's safe, as sv_onexec used only by Linux abi and linux_on_exec() does not depend on a state of process VA. Reviewed by: kib Differential revision: https://reviews.freebsd.org/D30899 MFC after: 2 weeks	2021-07-20 09:55:14 +03:00
Dmitry Chagin	b39fa4770d	Remove bogus cast from exec_sysvec_init(). Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D30910 MFC after: 2 weeks	2021-07-20 09:54:09 +03:00
Dmitry Chagin	21629e2a45	Modify exec_sysvec_init() to allow non-native abi to setup their sysentvecs. For future use in the Linux emulation layer modify the exec_sysvec_init() to allow non-native abi to fill sv_timekeep_base and sv_shared_page_obj. Reviewed by: kib Differential revision: https://reviews.freebsd.org/D30898 MFC after: 2 weeks	2021-07-20 09:53:21 +03:00
Kyle Evans	db0f264393	kenv: allow listing of static kernel environments The early environment is typically cleared, so these new options need the PRESERVE_EARLY_KENV kernel config(8) option. These environments are reported as missing by kenv(1) if the option is not present in the running kernel. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D30835	2021-07-18 23:06:19 -05:00
Kyle Evans	7a129c973b	kern: add an option for preserving the early kenv Some downstream configurations do not store secrets in the early (loader/static) environments and desire a way to preserve these for diagnostic reasons. Provide an option to do so. Reviewed by: imp, jhb (earlier version) Differential Revision: https://reviews.freebsd.org/D30834	2021-07-18 23:05:48 -05:00
David Chisnall	cf98bc28d3	Pass the syscall number to capsicum permission-denied signals The syscall number is stored in the same register as the syscall return on amd64 (and possibly other architectures) and so it is impossible to recover in the signal handler after the call has returned. This small tweak delivers it in the `si_value` field of the signal, which is sufficient to catch capability violations and emulate them with a call to a more-privileged process in the signal handler. This reapplies `3a522ba1bc` with a fix for the static assertion failure on i386. Approved by: markj (mentor) Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D29185	2021-07-16 18:06:44 +01:00
Mark Johnston	c1aff72cfa	callout: Make cc_cpu local to kern_timeout.c No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-15 22:41:10 -04:00
Mark Johnston	2e5f615295	lio_listio: Don't post a completion notification if none was requested One is allowed to use LIO_NOWAIT without specifying a sigevent. In this case, lj->lioj_signal is left uninitialized, but several code paths examine liov_signal.sigev_notify to figure out which notification to post. Unconditionally initialize that field to SIGEV_NONE. Add a dumb test case which triggers the bug. Reported by: KMSAN+syzkaller Reviewed by: asomers MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31197	2021-07-15 22:41:10 -04:00
Konstantin Belousov	0bdb2cbf9d	procctl(PROC_ASLR_STATUS): fix vmspace leak Reported by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-07-15 03:02:50 +03:00
Mark Johnston	2783335cae	blist: Correct the node count computed in blist_create() Commit `bb4a27f927` added the ability to allocate a span of blocks crossing a meta node boundary. To ensure that blst_next_leaf_alloc() does not walk past the end of the tree, an extra all-zero meta node needs to be present at the end of the allocation, and blst_next_leaf_alloc() is implemented such that the presence of this node terminates the search. blist_create() computes the number of nodes required. It had two problems: 1. When the size of the blist is a power of BLIST_RADIX, we would unnecessarily allocate an extra level in the tree. 2. When the size of the blist is a multiple of BLIST_RADIX, we would fail to allocate a terminator node. In this case, blst_next_leaf_alloc() could scan beyond the bounds of the allocation. This was found using KASAN. Modify blist_create() to handle these cases correctly. Reported by: pho Reviewed by: dougm MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D31158	2021-07-13 17:47:27 -04:00
Mark Johnston	45e2357113	malloc: Pass the allocation size to malloc_large() by value Its callers do not make use the modified size that malloc_large() was returning, so there's no need to pass a pointer. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-13 17:47:02 -04:00
Mateusz Guzik	844aa31c6d	cache: add cache_enter_time_flags	2021-07-12 07:03:14 +02:00
David Chisnall	d2b558281a	Revert "Pass the syscall number to capsicum permission-denied signals" This broke the i386 build. This reverts commit `3a522ba1bc`.	2021-07-10 20:26:01 +01:00
David Chisnall	3a522ba1bc	Pass the syscall number to capsicum permission-denied signals The syscall number is stored in the same register as the syscall return on amd64 (and possibly other architectures) and so it is impossible to recover in the signal handler after the call has returned. This small tweak delivers it in the `si_value` field of the signal, which is sufficient to catch capability violations and emulate them with a call to a more-privileged process in the signal handler. Approved by: markj (mentor) Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D29185	2021-07-10 17:19:52 +01:00
Alexander Motin	63ca9ea4f3	Use sleepq_signal(SLEEPQ_DROP) in cv_signal(). Same as wakeup_one()/wakeup_any() commit before it reduces the lock hold time and so contention. MFC after: 1 week	2021-07-09 20:57:58 -04:00
Mark Johnston	588c7a06df	KASAN: Implement __asan_unregister_globals() It will be called during KLD unload to unpoison the redzones following global variables. Otherwise, virtual address ranges previously used for a KLD may be left tainted, triggering false positives when they are recycled. Reported by: pho Sponsored by: The FreeBSD Foundation	2021-07-09 20:38:50 -04:00
Michal Meloun	e88c3b1b02	intrng: remove now redundant shadow variable. Should not be a functional change. Submitted by: ehem_freebsd@m5p.com Discussed in: https://reviews.freebsd.org/D29310 MFC after: 4 weeks	2021-07-08 08:46:41 +02:00
Michal Meloun	a49f208d94	intrng: Releasing interrupt source should clear interrupt table full state. The first release of an interrupt in a situation where the interrupt table is full should schedule a full table check the next time an interrupt is allocated. A full check is necessary to ensure maximum separation between the order of allocation and the order of release. Submitted by: ehem_freebsd@m5p.com (initial version) Discussed in: https://reviews.freebsd.org/D29310 MFC after: 4 weeks	2021-07-08 08:16:46 +02:00
Andrew Gallatin	4150a5a87e	ktls: fix NOINET build Reported by: mjguzik Sponsored by: Netflix	2021-07-07 10:40:02 -04:00
Randall Stewart	d7955cc0ff	tcp: HPTS performance enhancements HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083	2021-07-07 07:22:35 -04:00
Konstantin Belousov	28a66fc3da	Do not call FreeBSD-ABI specific code for all ABIs Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle robust mutexes, for native FreeBSD ABI. Similarly, there is no sense in calling sigfastblock_clear() for non-native ABIs. Requested by: dchagin Reviewed by: dchagin, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Konstantin Belousov	55976ce11a	Move sv_onexit() sysentvec hook slightly later after itimers are stopped. This makes it more usable for e.g. native FreeBSD ABI sysentvecs. Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Konstantin Belousov	71ab344524	Add sv_onexec_old() sysent hook for exec event Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure. The old vmspace for the process is still intact during the call. Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Mateusz Guzik	c2c34ee540	mbuf: add m_get_raw and m_gethdr_raw The intent is to eliminate the MT_NOINIT flag and consequently a branch from the constructor. Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31080	2021-07-07 11:05:46 +00:00
Mateusz Guzik	0a718a6e6e	mbuf: replace all direct uma_zfree(zone_mbuf) calls with m_free_raw Reviewed by: donner Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31082	2021-07-07 11:05:46 +00:00
Andrew Gallatin	28d0a740dd	ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by: hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908	2021-07-06 10:28:32 -04:00
Jessica Clarke	55c57a7811	rman: Remove an outdated comment that no longer applies Since commit `2dd1bdf183` in 2016 the r_start and r_end fields have been rman_res_t, which was briefly unsigned long, but commit `da1b038af9` changed the typedef to be uintmax_t instead. C99 is also something we assume these days. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D30808	2021-07-05 16:15:03 +01:00
Mateusz Guzik	904a08f342	ktls: switch bare zone_mbuf use to m_free_raw Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30955	2021-07-02 08:30:22 +00:00
Mateusz Guzik	05462babd4	mbuf: add m_free_raw to be used instead of directly calling uma_zfree The intent is to remove all direct zone_mbuf consumers so that ctor/dtor from that zone can be reimplemented as wrappers around uma, avoiding an indirect function call. Reviewed by: kbowling Discussed with: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30959	2021-07-02 08:30:22 +00:00
Mateusz Guzik	fb32c8dbeb	iflib: retire MB_DTOR_SKIP The flag was added in 2016 but remains unused. Reviewed by: kbowling Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30958	2021-07-02 08:30:22 +00:00
Edward Tomasz Napierala	db8d680ebe	procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS This introduces a new, per-process flag, "NO_NEW_PRIVS", which is inherited, preserved on exec, and cannot be cleared. The flag, when set, makes subsequent execs ignore any SUID and SGID bits, instead executing those binaries as if they not set. The main purpose of the flag is implementation of Linux PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged chroot. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30939	2021-07-01 09:42:07 +01:00
Dmitry Chagin	5d9f790191	Eliminate p_elf_machine from struct proc. Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now cached in the struct proc at p_elf_brandinfo member. Note to MFC: D30918, KBI Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D30926 MFC after: 2 weeks	2021-06-29 20:18:29 +03:00
Dmitry Chagin	615f22b2fb	Add a link to the Elf_Brandinfo into the struc proc. To allow the ABI to make a dicision based on the Brandinfo add a link to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits of Elf_Brandinfo flags is private to the ABI. Note to MFC: it breaks KBI. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D30918 MFC after: 2 weeks	2021-06-29 20:15:08 +03:00
Edward Tomasz Napierala	435754a59e	Add infrastructure required for Linux coredump support This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`, and `sv_elf_core_prepare_notes` fields to `struct sysentvec`, and modifies imgact_elf.c to make use of them instead of hardcoding FreeBSD-specific values. It also updates all of the ABI definitions to preserve current behaviour. This makes it possible to implement non-native ELF coredump support without unnecessary code duplication. It will be used for Linux coredumps. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30921	2021-06-29 08:49:12 +01:00
Edward Tomasz Napierala	61b4c62718	imgact_elf.c: style, remove unnecessary casts Remove unnecessary type casts and redundant brackets. No functional changes. Suggested By: kib Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30841	2021-06-27 17:05:59 +01:00
Alexander Motin	6df35af4d8	Allow sleepq_signal() to drop the lock. Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the sleep queue chain lock before returning. Reduced lock scope allows significantly reduce lock contention inside taskqueue_enqueue() for ZFS worker threads doing ~350K disk reads/s on 40-thread system. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2021-06-25 14:12:21 -04:00
Konstantin Belousov	802cf4ab0e	namei: add NDPREINIT() macro Its intent is to do the initialization of the future part of struct nameidata which should be used across several namei() and VOPs. Right now it is NOP. Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Warner Losh	ddfc9c4c59	newbus: Move from bus_child_{pnpinfo,location}_src to bus_child_{pnpinfo,location} with sbuf Now that the upper layers all go through a layer to tie into these information functions that translates an sbuf into char * and len. The current interface suffers issues of what to do in cases of truncation, etc. Instead, migrate all these functions to using struct sbuf and these issues go away. The caller is also in charge of any memory allocation and/or expansion that's needed during this process. Create a bus_generic_child_{pnpinfo,location} and make it default. It just returns success. This is for those busses that have no information for these items. Migrate the now-empty routines to using this as appropriate. Document these new interfaces with man pages, and oversight from before. Reviewed by: jhb, bcr Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29937	2021-06-22 20:52:06 -06:00
Edward Tomasz Napierala	06250515cf	imgact_elf: compute auxv buffer size instead of using magic value The new buffer is somewhat larger, but there should be no functional changes. Reviewed By: kib, imp Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30821	2021-06-21 17:07:07 +01:00
Colin Percival	fe51b5a76d	kern_tslog: Include tslog data from loader The i386 loader (and hopefully others to come) now passes tslog data as a "preloaded module". Include this in the data returned by the debug.tslog sysctl. Reviewed by: kevans	2021-06-20 20:09:47 -07:00
Warner Losh	0a99422970	Move mips and arm to 1000Hz by default. armv6 and armv7 systems already were 1000Hz. The other armv5 were a mix of 100 and 1000. This changes them to 1000. Should there be issues, we can add options HZ=100 to the systems that have bad performance at the drop of a hat. mips is a lot more complicated. But most of the systems are already 1000HZ. The hardware exceptions are all fast enough to run at 1000Hz. MALTA is our primary emulator, and history has shown emulators tend to like 100Hz better, so run those systems at 100Hz. As with arm, any system that shows a huge performance regression can reverted to 100Hz easily. This was going to be committed well in advance of the 13 branch, but it was delayed and forgotten til now. Discussed on: #bsdmips ages ago Sponsored by: Netflix	2021-06-16 20:00:14 -06:00
John Baldwin	faf0224ff2	ktls: Don't mark existing received mbufs notready for TOE TLS. The TOE driver might receive decrypted TLS records that are enqueued to the socket buffer after ktls_try_toe() returns and before ktls_enable_rx() locks the receive buffer to call sb_mark_notready(). In that case, sb_mark_notready() would incorrectly treat the decrypted TLS record as an encrypted record and schedule it for decryption. This always resulted in the connection being dropped as the data in the control message did not look like a valid TLS header. To fix, don't try to handle software decryption of existing buffers in the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS driver needs to decrypt existing data in the socket buffer, the driver will need to manage that in its tod_alloc_tls_session method. Sponsored by: Chelsio Communications	2021-06-15 17:45:21 -07:00
Konstantin Belousov	a12e901a5a	Add a knob to disable dequeueing SIGCHLD on waiting for live process It seems that Linux does not dequeue siginfo for SIGCHLD when wait*(2) reports status of the running process. In particular, sigwaitinfo(2) and other signal querying syscalls can observe the siginfo after wait. FreeBSD dequeued siginfo from the beginning, so we cannot change the default ABI to be more compatible. Still, add a knob to enable to change to the other behavior for debugging purposes. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	bc38762474	Add a knob to not drop signal with default ignored or ignored actions Traditionally, BSD drops signals with the default action during send, not even putting them to the destination process queue. This semantic is not shared with other operating systems (Linux), which do queue such signals. In particular, sigtimedwait(2) and related syscalls can observe the delivery. Add a global knob kern.sig_discard_ign which can be set to false to force enqueuing of the signals with default action. Also add an ABI flag to indicate that signals should be queued. Note that it is not practical to run with the knob turned on, because almost all software that care about the delivery of such signals, is aware of the difference, and misbehaves if the signals are actually queued. The purpose of the knob as is is to allow for easier diagnostic of the programs that need the adjustments, to confirm the cause of problem. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	acced8b043	sigwait: add comment explaining EINTR/ERESTART details Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	afb36e289c	sigwait(2) and sigtimedwait(2) must not be restarted. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:18 +03:00
Mark Johnston	a100217489	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:32 -04:00
Mark Johnston	f4bb1869dd	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:27 -04:00
Andrew Gallatin	ed5e13cfc2	ktls: Fix interaction with RATELIMIT uipc_ktls.c was missing opt_ratelimit.h, so it was never noticing that RATELIMIT was enabled. Once it was enabled, it failed to compile as ktls_modify_txrtlmt() had accrued a compilation error when it was not being compiled in. Sponsored by: Netflix	2021-06-14 10:51:16 -04:00
Dmitry Chagin	e884512ad1	Split kern_poll() on two counterparts. The kern_poll_kfds() operates on clear kernel data, kfds points to an array in the kernel, while kern_poll() operates on user supplied pollfd. Move nfds check to kern_poll_maxfds(). No functional changes, it's for future use in the Linux emulation layer. Reviewd by: kib Differential Revision: https://reviews.freebsd.org/D30690 MFC after: 2 weeks	2021-06-10 15:11:25 +03:00
Dmitry Chagin	f570a6723e	Fix copyright, remove "all rights reserved". The eventfd code was written by me, rdivacky@ copyrigth applicable only to epoll part of the Linuxulator code. Roman is ok to retire his copyright from sys/kern/sys_eventfd.c and 'All rights reserved.' lines from sys/compat/linux/linux_event.[c\|h] and sys/kern/sys_eventfd.c files. Reviewed by: kib, emaste Approved by: rdivacky Differential Revision: https://reviews.freebsd.org/D30677 MFC after: 2 weeks	2021-06-08 08:18:00 +03:00
Mark Johnston	887c753c9f	Fix handling of D_GIANTOK It was meant to suppress only the printf(), not the subsequent injection of Giant-protected thunks for various file operations. Fixes: `fbeb4ccac9` Reported by: pho Tested by: pho MFC after: 6 days Pointy hat: markj	2021-06-07 16:45:50 -04:00
Mark Johnston	fbeb4ccac9	Suppress D_NEEDGIANT warnings for some drivers During boot we warn that the kbd and openfirm drivers are Giant-locked and may be deleted. Generally, the warning helps signal that certain old drivers are not being maintained and are subject to removal, but this doesn't really apply to certain drivers which are harder to detangle from Giant. Add a flag, D_GIANTOK, that devices can specify to suppress the misleading warning. Use it in the kbd and openfirm drivers. Reviewed by: imp, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30649	2021-06-06 16:44:46 -04:00
Konstantin Belousov	2d423f7671	sysent: allow ABI to disable setid on exec. Reviewed by: dchagin Tested by: trasz MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28154	2021-06-06 21:42:52 +03:00
Konstantin Belousov	19e6043a44	kern_exec.c: Add execve_nosetid() helper Reviewed by: dchagin Tested by: trasz MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28154	2021-06-06 21:42:41 +03:00
Jason A. Harmening	59409cb90f	Add a generic mechanism for preventing forced unmount This is aimed at preventing stacked filesystems like nullfs and unionfs from "losing" their lower mounts due to forced unmount. Otherwise, VFS operations that are passed through to the lower filesystem(s) may crash or otherwise cause unpredictable behavior. Introduce two new functions: vfs_pin_from_vp() and vfs_unpin(). which are intended to be called on the lower mount(s) when the stacked filesystem is mounted and unmounted, respectively. Much as registration in the mnt_uppers list previously did, pinning will prevent even forced unmount of the lower FS and will allow the stacked FS to freely operate on the lower mount either by direct use of the struct mount* or indirect use through a properly-referenced vnode's v_mount field. vfs_pin_from_vp() is modeled after vfs_ref_from_vp() in that it uses the mount interlock coupled with re-checking vp->v_mount to ensure that it will fail in the face of a pending unmount request, even if the concurrent unmount fully completes. Adopt these new functions in both nullfs and unionfs. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30401	2021-06-05 18:20:36 -07:00
wiklam	43521b46fc	Correcting comment about "sched_interact_score". Reviewed by: jrtc@, imp@ Pull Request: https://github.com/freebsd/freebsd-src/pull/431 Sponsored by: Netflix	2021-06-02 21:50:57 -06:00
Warner Losh	9f3d1a98dd	regen after tweaks to getgroups and setgroups Sponsored by: Netflix	2021-06-02 13:24:50 -06:00
Moritz Buhl	4bc2174a1b	kern: fail getgroup and setgroup with negative int Found using https://github.com/NetBSD/src/blob/trunk/tests/lib/libc/sys/t_getgroups.c getgroups/setgroups want an int and therefore casting it to u_int resulted in `getgroups(-1, ...)` not returning -1 / errno = EINVAL. imp@ updated syscall.master and made changes markj@ suggested PR: 189941 Tested by: imp@ Reviewed by: markj@ Pull Request: https://github.com/freebsd/freebsd-src/pull/407 Differential Revision: https://reviews.freebsd.org/D30617	2021-06-02 13:22:57 -06:00
Mateusz Guzik	c9f8dcda85	kqueue: replace kq_ncallouts loop with atomic_fetchadd	2021-06-02 15:14:58 +00:00
Rich Ercolani	a19ae1b099	vfs: fix MNT_SYNCHRONOUS check in vn_write `ca1ce50b2b` ("vfs: add more safety against concurrent forced unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS if O_FSYNC is set. Reviewed By: mjg Differential Revision: https://reviews.freebsd.org/D30610	2021-06-02 13:42:02 +00:00
Kyle Evans	2d741f33bd	kern: ether_gen_addr: randomize on default hostuuid, too Currently, this will still hash the default (all zero) hostuuid and potentially arrive at a MAC address that has a high chance of collision if another interface of the same name appears in the same broadcast domain on another host without a hostuuid, e.g., some virtual machine setups. Instead of using the default hostuuid, just treat it as a failure and generate a random LA unicast MAC address. Reviewed by: bz, gbe, imp, kbowling, kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29788	2021-06-01 22:59:21 -05:00
Mark Johnston	283e60fb31	ktrace: Fix an inverted comparison added in commit `f3851b235` Fixes: `f3851b235` ("ktrace: Fix a race with fork()") Reported by: dchagin, phk	2021-06-01 09:15:35 -04:00
Konstantin Belousov	d3f7975fcb	thread_reap_barrier(): remove unused variable Noted by: alc Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week	2021-05-31 23:03:42 +03:00
Konstantin Belousov	f62c7e54e9	Add thread_reap_barrier() Reviewed by: hselasky,markj Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30468	2021-05-31 18:09:22 +03:00
Konstantin Belousov	3a68546d23	quisce_cpus(): add special handling for PDROP Currently passing PDROP to the quisce_cpus() function does not make sense. Add special meaning for it, by not waiting for the idle thread to schedule. Also avoid allocating u_int[MAXCPU] on the stack. Reviewed by: hselasky, markj Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30468	2021-05-31 18:09:22 +03:00
Konstantin Belousov	845d77974b	kern_thread.c: wrap too long lines Reviewed by: hselasky, markj Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30468	2021-05-31 18:09:22 +03:00
Konstantin Belousov	e266a0f7f0	kern linker: do not allow more than one kldload and kldunload syscalls simultaneously kld_sx is dropped e.g. for executing sysinits, which allows user to initiate kldunload while module is not yet fully initialized. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D30456 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-05-31 18:09:22 +03:00
Konstantin Belousov	27006229f7	vinvalbuf: do not panic if we were unable to flush dirty buffers Return EBUSY instead and let caller to handle the issue. For vgone()/vnode reclamation, caller first does vinvalbuf(V_SAVE), which return EBUSY in case dirty buffers where not flushed. Then caller calls vinvalbuf(0) due to non-zero return, which gets rid of all dirty buffers without dependencies. PR: 238565 Reviewed by: asomers, mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30555	2021-05-31 01:20:53 +03:00
Jason A. Harmening	a4b07a2701	VFS_QUOTACTL(9): allow implementation to indicate busy state changes Instead of requiring all implementations of vfs_quotactl to unbusy the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param to VFS_QUOTACTL(9). The implementation may then indicate to the caller whether it needed to unbusy the mount. Also, add stbool.h to libprocstat modules which #define _KERNEL before including sys/mount.h. Otherwise they'll pull in sys/types.h before defining _KERNEL and therefore won't have the bool definition they need for mp_busy. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30556	2021-05-30 14:53:47 -07:00
Jason A. Harmening	271fcf1c28	Revert commits `6d3e78ad6c` and `54256e7954` Parts of libprocstat like to pretend they're kernel components for the sake of including mount.h, and including sys/types.h in the _KERNEL case doesn't fix the build for some reason. Revert both the VFS_QUOTACTL() change and the follow-up "fix" for now.	2021-05-29 17:48:02 -07:00
Mateusz Guzik	3cf75ca220	vfs: retire unused vn_seqc_write_begin_unheld*	2021-05-29 22:04:09 +00:00
Mateusz Guzik	d81aefa8b7	vfs: use the sentinel trick in locked lookup path parsing	2021-05-29 22:04:09 +00:00
Mateusz Guzik	478c52f1e3	vfs: slightly rework vn_rlimit_fsize	2021-05-29 22:04:09 +00:00
Mateusz Guzik	9bfddb3ac4	fd: use PROC_WAIT_UNLOCKED when clearing p_fd/p_pd	2021-05-29 22:04:09 +00:00
Jason A. Harmening	6d3e78ad6c	VFS_QUOTACTL(9): allow implementation to indicate busy state changes Instead of requiring all implementations of vfs_quotactl to unbusy the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param to VFS_QUOTACTL(9). The implementation may then indicate to the caller whether it needed to unbusy the mount. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30218	2021-05-29 14:05:39 -07:00
Mark Johnston	f3851b235b	ktrace: Fix a race with fork() ktrace(2) may toggle trace points in any of 1. a single process 2. all members of a process group 3. all descendents of the processes in 1 or 2 In the first two cases, we do not permit the operation if the process is being forked or not visible. However, in case 3 we did not enforce this restriction for descendents. As a result, the assertions about the child in ktrprocfork() may be violated. Move these checks into ktrops() so that they are applied consistently. Allow KTROP_CLEAR for nascent processes. Otherwise, there is a window where we cannot clear trace points for a nascent child if they are inherited from the parent. Reported by: syzbot+d96676592978f137e05c@syzkaller.appspotmail.com Reported by: syzbot+7c98fcf84a4439f2817f@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30481	2021-05-27 15:52:20 -04:00
Mark Johnston	e00bae5c18	kevent: Prohibit negative change and event list lengths Previously, a negative change list length would be treated the same as an empty change list. A negative event list length would result in bogus copyouts. Make kevent(2) return EINVAL for both cases so that application bugs are more easily found, and to be more robust against future changes to kevent internals. Reviewed by: imp, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30480	2021-05-27 15:52:20 -04:00
Mark Johnston	f885100773	ktrace: Handle negative array sizes in ktrstructarray ktrstructarray() may be used to create copies of kevent(2) change and event arrays. It is called before parameter validation is done and so should check for bogus array lengths before allocating a copy. Reported by: syzkaller Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30479	2021-05-27 15:52:20 -04:00
Edward Tomasz Napierala	905d192d6f	Unstaticize parts of coredumping code This makes it possible to call __elfN(size_segments) and __elfN(puthdr) from Linux coredump code. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30455	2021-05-26 11:51:57 +01:00
John Baldwin	6b313a3a60	Include the trailer in the original dst_iov. This avoids creating a duplicate copy on the stack just to append the trailer. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30139	2021-05-25 16:59:19 -07:00
John Baldwin	21e3c1fbe2	Assume OCF is the only KTLS software backend. This removes support for loadable software backends. The KTLS OCF support is now always included in kernels with KERN_TLS and the ktls_ocf.ko module has been removed. The software encryption routines now take an mbuf directly and use the TLS mbuf as the crypto buffer when possible. Bump __FreeBSD_version for software backends in ports. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30138	2021-05-25 16:59:19 -07:00
John Baldwin	883a0196b6	crypto: Add a new type of crypto buffer for a single mbuf. This is intended for use in KTLS transmit where each TLS record is described by a single mbuf that is itself queued in the socket buffer. Using the existing CRYPTO_BUF_MBUF would result in bus_dmamap_load_crp() walking additional mbufs in the socket buffer that are not relevant, but generating a S/G list that potentially exceeds the limit of the tag (while also wasting CPU cycles). Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30136	2021-05-25 16:59:18 -07:00
John Baldwin	6663f8a23e	sglist: Add sglist_append_single_mbuf(). This function appends the contents of a single mbuf to an sglist rather than an entire mbuf chain. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30135	2021-05-25 16:59:18 -07:00
John Baldwin	aa341db39b	Rename m_unmappedtouio() to m_unmapped_uiomove(). This function doesn't only copy data into a uio but instead is a variant of uiomove() similar to uiomove_fromphys(). Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30444	2021-05-25 16:59:18 -07:00
John Baldwin	3f9dac85cc	Extend m_copyback() to support unmapped mbufs. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30133	2021-05-25 16:59:18 -07:00
John Baldwin	3c7a01d773	Extend m_apply() to support unmapped mbufs. m_apply() invokes the callback function separately on each segment of an unmapped mbuf: the TLS header, individual pages, and the TLS trailer. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30132	2021-05-25 16:59:18 -07:00
Edward Tomasz Napierala	3b9971c8da	Clean up some of the core dumping code. No functional changes. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30397	2021-05-25 16:30:32 +01:00
Konstantin Belousov	fd3ac06f45	ptrace: add an option to not kill debuggees on debugger exit Requested by: markj Reviewed by: jhb (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differrential revision: https://reviews.freebsd.org/D30351	2021-05-25 18:22:34 +03:00
Konstantin Belousov	d7a7ea5be6	sys_process.c: extract ptrace_unsuspend() Reviewed by: jhb Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differrential revision: https://reviews.freebsd.org/D30351	2021-05-25 18:22:27 +03:00
Mateusz Guzik	a269183875	vfs: elide vnode locking when it is only needed for audit if possible	2021-05-23 19:37:16 +00:00
Mark Johnston	6f6cd1e8e8	ktrace: Remove vrele() at the end of ktr_writerequest() As of commit `fc369a353` we no longer ref the vnode when writing a record. Drop the corresponding vrele() call in the error case. Fixes: `fc369a353` ("ktrace: fix a race between writes and close") Reported by: syzbot+9b96ea7a5ff8917d3fe4@syzkaller.appspotmail.com Reported by: syzbot+6120ebbb354cd52e5107@syzkaller.appspotmail.com Reviewed by: kib MFC after: 6 days Differential Revision: https://reviews.freebsd.org/D30404	2021-05-23 14:13:01 -04:00
Mateusz Guzik	e2ab16b1a6	lockprof: move panic check after inspecting the state	2021-05-23 17:55:27 +00:00
Mateusz Guzik	6a467cc5e1	lockprof: pass lock type as an argument instead of reading the spin flag	2021-05-23 17:55:27 +00:00
Hans Petter Selasky	ef0f7ae934	The old thread priority must be stored as part of the EPOCH(9) tracker. Else recursive use of EPOCH(9) may cause the wrong priority to be restored. Bump the __FreeBSD_version due to changing the thread and epoch tracker structure. Differential Revision: https://reviews.freebsd.org/D30375 Reviewed by: markj@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-23 10:53:25 +02:00
Mateusz Guzik	138f78e94b	umtx: convert umtxq_lock to a macro Then LOCK_PROFILING starts reporting callers instead of the inline.	2021-05-22 21:01:05 +00:00
Mateusz Guzik	e71d5c7331	Fix limit testing after `1762f674cc` ktrace commit. The previous: if ((uoff_t)uio->uio_offset + uio->uio_resid > lim) signal(....); was replaced with: if ((uoff_t)uio->uio_offset + uio->uio_resid < lim) return; signal(....); Making (uoff_t)uio->uio_offset + uio->uio_resid == lim trip over the limit, when it did not previously. Unbreaks running 13.0 buildworld.	2021-05-22 20:18:21 +00:00
Konstantin Belousov	fc369a353b	ktrace: fix a race between writes and close It was possible that termination of ktrace session occured during some record write, in which case write occured after the close of the vnode. Use ktr_io_params refcounting to avoid this situation, by taking the reference on the structure instead of vnode. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30400	2021-05-22 23:14:13 +03:00
Mateusz Guzik	48235c377f	Fix a braino in previous. Instead of trying to partially ifdef out ktrace handling, define the missing identifier to 0. Without this fix lack of ktrace in the kernel also means there is no SIGXFSZ signal delivery.	2021-05-22 19:53:40 +00:00
Mateusz Guzik	154f0ecc10	Fix tinderbox build after `1762f674cc` ktrace commit.	2021-05-22 19:41:19 +00:00
Mateusz Guzik	a0842e69aa	lockprof: add contested-only profiling This allows tracking all wait times with much smaller runtime impact. For example when doing -j 104 buildkernel on tmpfs: no profiling: 2921.70s user 282.72s system 6598% cpu 48.562 total all acquires: 2926.87s user 350.53s system 6656% cpu 49.237 total contested only: 2919.64s user 290.31s system 6583% cpu 48.756 total	2021-05-22 19:28:37 +00:00
Mateusz Guzik	fca5cfd584	lockprof: retire lock_prof_skipcount The implementation uses a global variable for ALL calls, defeating the point of sampling in the first place. Remove it as it clearly remains unused.	2021-05-22 19:28:37 +00:00
Mateusz Guzik	cf74b2be53	vfs: retire the now unused vnlru_free routine	2021-05-22 18:42:30 +00:00
Mark Johnston	e4b16f2fb1	ktrace: Avoid recursion in namei() sys_ktrace() calls namei(), which may call ktrnamei(). But sys_ktrace() also calls ktrace_enter() first, so if the caller is itself being traced, the assertion in ktrace_enter() is triggered. And, ktrnamei() does not check for recursion like most other ktrace ops do. Fix the bug by simply deferring the ktrace_enter() call. Also make the parameter to ktrnamei() const and convert to ANSI. Reported by: syzbot+d0a4de45e58d3c08af4b@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30340	2021-05-22 12:07:32 -04:00
Konstantin Belousov	f784da883f	Move mnt_maxsymlinklen into appropriate fs mount data structures Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC-Note: struct mount layout Differential revision: https://reviews.freebsd.org/D30325	2021-05-22 15:16:09 +03:00
Konstantin Belousov	ea2b64c241	ktrace: add a kern.ktrace.filesize_limit_signal knob When enabled, writes to ktrace.out that exceed the max file size limit cause SIGXFSZ as it should be, but note that the limit is taken from the process that initiated ktrace. When disabled, write is blocked, but signal is not send. Note that in either case ktrace for the affected process is stopped. Requested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:09 +03:00
Konstantin Belousov	02645b886b	ktrace: use the limit of the trace initiator for file size limit on writes Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:09 +03:00
Konstantin Belousov	1762f674cc	ktrace: pack all ktrace parameters into allocated structure ktr_io_params Ref-count the ktr_io_params structure instead of vnode/cred. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	a6144f713c	ktrace: do not stop tracing other processes if our cannot write to this vnode Other processes might still be able to write, make the decision to stop based on the per-process situation. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	9bb84c23e7	accounting: explicitly mark the exiting thread as doing accounting and use the mark to stop applying file size limits on the write of the accounting record. This allows to remove hack to clear process limits in acct_process(), and avoids the bug with the clearing being ineffective because limits are also cached in the thread structure. Reported and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	70c05850e2	kern_descrip.c: Style Wrap too long lines. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	d713bf7927	vn_need_pageq_flush(): simplify There is no need to own vnode interlock, since v_object is type stable and can only change to/from NULL, and no other checks in the function access fields protected by the interlock. Remove the need variable, the result of the test is directly usable as return value. Tested by: mav, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-05-22 12:29:44 +03:00
Edward Tomasz Napierala	33621dfc19	Refactor core dumping code a bit This makes it possible to use core_write(), core_output(), and sbuf_drain_core_output(), in Linux coredump code. Moving them out of imgact_elf.c is necessary because of the weird way it's being built. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30369	2021-05-22 09:59:00 +01:00
Mark Johnston	916c61a5ed	Fix handling of errors from pru_send(PRUS_NOTREADY) PRUS_NOTREADY indicates that the caller has not yet populated the chain with data, and so it is not ready for transmission. This is used by sendfile (for async I/O) and KTLS (for encryption). In particular, if pru_send returns an error, the caller is responsible for freeing the chain since other implicit references to the data buffers exist. For async sendfile, it happens that an error will only be returned if the connection was dropped, in which case tcp_usr_ready() will handle freeing the chain. But since KTLS can be used in conjunction with the regular socket I/O system calls, many more error cases - which do not result in the connection being dropped - are reachable. In these cases, KTLS was effectively assuming success. So: - Change sosend_generic() to free the mbuf chain if pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the chain at that point. - Similarly, in vn_sendfile() change the !async I/O && KTLS case to free the chain. - If async I/O is still outstanding when pru_send fails in vn_sendfile(), set an error in the sfio structure so that the connection is aborted and the mbuf chain is freed. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349	2021-05-21 17:45:19 -04:00
Hans Petter Selasky	c82c200622	Accessing the epoch structure should happen after the INIT_CHECK(). Else the epoch pointer may be NULL. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-21 11:21:32 +02:00
Hans Petter Selasky	f33168351b	Properly define EPOCH(9) function macro. No functional change intended. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-21 11:21:32 +02:00
Hans Petter Selasky	cc9bb7a9b8	Rework for-loop in EPOCH(9) to reduce indentation level. No functional change intended. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-21 11:21:32 +02:00
Lv Yunlong	b295c5ddce	socket: Release cred reference later in sodealloc() We dereference so->so_cred to update the per-uid socket buffer accounting, so the crfree() call must be deferred until after that point. PR: 255869 MFC after: 1 week	2021-05-18 15:25:40 -04:00
Konstantin Belousov	8cf912b017	ttydev_write: prevent stops while terminal is busied Since busy state is checked by all blocked writes, stopping a process which waits in ttydisc_write() causes cascade. Utilize sigdeferstop() to avoid the issue. Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> PR: 255816 MFC after: 1 week	2021-05-18 20:52:03 +03:00
Mateusz Guzik	cc6f46ac2f	vfs: refactor vdrop In particular move vunlazy into its own routine.	2021-05-18 15:30:28 +00:00
Mateusz Guzik	715fcc0d34	vfs: change vn_freevnodes_* prefix to idiomatic vfs_freevnodes_*	2021-05-18 15:30:28 +00:00
Colin Percival	b6be9566d2	Fix buffer overflow in preloaded hostuuid cleaning When a module of type "hostuuid" is provided by the loader, prison0_init strips any trailing whitespace and ASCII control characters by (a) adjusting the buffer length, and (b) zeroing out the characters in question, before storing it as the system's hostuuid. The buffer length adjustment was correct, but the zeroing overwrote one byte higher in memory than intended -- in the typical case, zeroing one byte past the end of the hostuuid buffer. Due to the layout of buffers passed by the boot loader to the kernel, this will be the first byte of a subsequent buffer. This was probably harmless; prison0_init runs after preloaded kernel modules have been linked and after the preloaded /boot/entropy cache has been processed, so in both cases having the first byte overwritten will not cause problems. We cannot however rule out the possibility that other objects which are preloaded by the loader could suffer from having the first byte overwritten. Since the zeroing does not in fact serve any purpose, remove it and trim trailing whitespace and ASCII control characters by adjusting the buffer length alone. Fixes: `c3188289` Preload hostuuid for early-boot use Reviewed by: kevans, markj MFC after: 3 days	2021-05-17 20:07:49 -07:00
Colin Percival	330f110bf1	Fix 'hostuuid: preload data malformed' warning If the preloaded hostuuid value is invalid and verbose booting is enabled, a warning is printed. This printf had two bugs: 1. It was missing a trailing \n character. 2. The malformed UUID is printed with %s even though it is not known to be NUL-terminated. This commit adds the missing \n and uses %.*s with the (already known) length of the preloaded UUID to ensure that we don't read past the end of the buffer. Reported by: kevans Fixes: `c3188289` Preload hostuuid for early-boot use MFC after: 3 days	2021-05-17 20:07:49 -07:00
Kirk McKusick	9a2fac6ba6	Fix handling of embedded symbolic links (and history lesson). The original filesystem release (4.2BSD) had no embedded sysmlinks. Historically symbolic links were just a different type of file, so the content of the symbolic link was contained in a single disk block fragment. We observed that most symbolic links were short enough that they could fit in the area of the inode that normally holds the block pointers. So we created embedded symlinks where the content of the link was held in the inode's pointer area thus avoiding the need to seek and read a data fragment and reducing the pressure on the block cache. At the time we had only UFS1 with 32-bit block pointers, so the test for a fastlink was: di_size < (NDADDR + NIADDR) * sizeof(daddr_t) (where daddr_t would be ufs1_daddr_t today). When embedded symlinks were added, a spare field in the superblock with a known zero value became fs_maxsymlinklen. New filesystems set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus filesystems that preceeded this change always read from blocks (since fs->fs_maxsymlinklen == 0) and newer ones used embedded symlinks if they fit. Similarly symlinks created on pre-embedded symlink filesystems always spill into blocks while newer ones will embed if they fit. At the same time that the embedded symbolic links were added, the on-disk directory structure was changed splitting the former u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen. Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can be used to distinguish old directory formats. In retrospect that should have just been an added flag, but we did not realize we needed to know about that change until it was already in production. Code was split into ufs/ffs so that the log structured filesystem could use ufs functionality while doing its own disk layout. This meant that no ffs superblock fields could be used in the ufs code. Thus ffs superblock fields that were needed in ufs code had to be copied to fields in the mount structure. Since ufs_readlink needed to know if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen. The kernel panic that arose to making this fix was triggered when a disk error created an inode of type symlink with no allocated data blocks but a large size. When readlink was called the uiomove was attempted which segment faulted. static int ufs_readlink(ap) struct vop_readlink_args /* { struct vnode a_vp; struct uio a_uio; struct ucred a_cred; } / ap; { struct vnode vp = ap->a_vp; struct inode ip = VTOI(vp); doff_t isize; isize = ip->i_size; if ((isize < vp->v_mount->mnt_maxsymlinklen) \|\| DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support / return (uiomove(SHORTLINK(ip), isize, ap->a_uio)); } return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred)); } The second part of the "if" statement that adds DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support */ is problematic. It never appeared in BSD released by Berkeley because as noted above mnt_maxsymlinklen is 0 for old format filesystems, so will always fall through to the VOP_READ as it should. I had to dig back through `git blame' to find that Rodney Grimes added it as part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.'' He must have brought it across from an earlier FreeBSD. Unfortunately the source-control logs for FreeBSD up to the merger with the AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the agreement to let FreeBSD remain unencumbered, so I cannot pin-point where that line got added on the FreeBSD side. The one change needed here is that mnt_maxsymlinklen is declared as an `int' and should be changed to be `u_int64_t'. This discovery led us to check out the code that deletes symbolic links. Specifically if (vp->v_type == VLNK && (ip->i_size < vp->v_mount->mnt_maxsymlinklen \|\| datablocks == 0)) { if (length != 0) panic("ffs_truncate: partial truncate of symlink"); bzero(SHORTLINK(ip), (u_int)ip->i_size); ip->i_size = 0; DIP_SET(ip, i_size, 0); UFS_INODE_SET_FLAG(ip, IN_SIZEMOD \| IN_CHANGE \| IN_UPDATE); if (needextclean) goto extclean; return (ffs_update(vp, waitforupdate)); } Here too our broken symlink inode with no data blocks allocated and a large size will segment fault as we are incorrectly using the test that we have no data blocks to decide that it is an embdedded symbolic link and attempting to bzero past the end of the inode. The test for datablocks == 0 is unnecessary as the test for ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right thing in all cases. The test for datablocks == 0 was added by David Greenman in this commit: Author: David Greenman <dg@FreeBSD.org> Date: Tue Aug 2 13:51:05 1994 +0000 Completed (hopefully) the kernel support for old style "fastlinks". Notes: svn path=/head/; revision=1821 I am guessing that he likely earlier added the incorrect test in the ufs_readlink code. I asked David if he had any recollection of why he made this change. Amazingly, he still had a recollection of why he had made a one-line change more than twenty years ago. And unsurpisingly it was because he had been stuck between a rock and a hard place. FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code base. Prior to that, there were three years of development in all areas of the kernel, including the filesystem code, from the combined set of people including Bill Jolitz, Patchkit contributors, and FreeBSD Project members. The compatibility issue at hand was caused by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite changes David had to find a way to provide compatibility with both the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite. He felt that these changes would provide compatibility with both systems. In his words: ``My recollection is that the 'FASTLINKS' symlinks support in FreeBSD-1.x, as implemented by Curt Mayer, worked differently than 4.4BSD. He used a spare field in the inode to duplicately store the length. When the 4.4BSD-Lite merge was done, the optimized symlinks support for existing filesystems (those that were initialized in FreeBSD-1.x) were broken due to the FFS on-disk structure of 4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to restore the backward compatibility with FreeBSD-1.x filesystems. I think it was the best that could be done in the somewhat urgent circumstances of the post Berkeley-USL settlement. Also, regarding Rod's massive commit with little explanation, some context: John Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to the 386 platform in just 10 days. It was by far the most intense hacking effort of my life. In addition to the porting of tons of FreeBSD-1 code, I think we wrote more than 30,000 lines of new code in that time to deal with the missing pieces and architectural changes of 4.4BSD-Lite. We didn't make many notes along the way. There was a lot of pressure to get something out to the rest of the developer community as fast as possible, so detailed discrete commits didn't happen - it all came as a giant wad, which is why Rod's commit message was worded the way it was.'' Reported by: Chuck Silvers Tested by: Chuck Silvers History by: David Greenman Lawrence MFC after: 1 week Sponsored by: Netflix	2021-05-16 17:04:11 -07:00
Mateusz Guzik	852088f6af	vfs: add missing atomic conversion to writecount adjustment Fixes: ("vfs: lockless writecount adjustment in set/unset text")	2021-05-14 17:42:05 +02:00
Mateusz Guzik	ca1ce50b2b	vfs: add more safety against concurrent forced unmount to vn_write 1. stop re-reading ->v_mount (can become NULL) 2. stop re-reading ->v_type (can change to VBAD)	2021-05-14 14:22:22 +00:00
Mateusz Guzik	b5fb9ae687	vfs: lockless writecount adjustment in set/unset text ... for cases where this is not the first/last exec.	2021-05-14 14:22:21 +00:00
Mark Johnston	2cca77ee01	kqueue timer: Remove detached knotes from the process stop queue There are some scenarios where a timer event may be detached when it is on the process' kqueue timer stop queue. If kqtimer_proc_continue() is called after that point, it will iterate over the queue and access freed timer structures. It is also possible, at least in a multithreaded program, for a stopped timer event to be scheduled without removing it from the process' stop queue. Ensure that we do not doubly enqueue the event structure in this case. Reported by: syzbot+cea0931bb4e34cd728bd@syzkaller.appspotmail.com Reported by: syzbot+9e1a2f3734652015998c@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30251	2021-05-14 10:08:14 -04:00
Ed Maste	2c9764f36b	regen syscall files after d51198d63b63	2021-05-13 14:09:58 -04:00
Mark Johnston	8b3c4231ab	posix timers: Check for overflow when converting to ns Disallow a time or timer period value when the conversion to nanoseconds would overflow. Otherwise it is possible to trigger a divison by zero in realtime_expire_l(), where we compute the number of overruns by dividing by the timer interval. Fixes: `7995dae9` ("posix timers: Improve the overrun calculation") Reported by: syzbot+5ab360bd3d3e3c5a6e0e@syzkaller.appspotmail.com Reported by: syzbot+157b74ff493140d86eac@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30233	2021-05-13 08:34:03 -04:00
Mark Johnston	9246b3090c	fork: Suspend other threads if both RFPROC and RFMEM are not set Otherwise, a multithreaded parent process may trigger races in vm_forkproc() if one thread calls rfork() with RFMEM set and another calls rfork() without RFMEM. Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to see if the address space is shared. Reported by: syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com Reported by: syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30220	2021-05-13 08:33:23 -04:00
Mateusz Guzik	cef8a95acb	vfs: fix vnode use count leak in O_EMPTY_PATH support The vnode returned by namei_setup is already referenced. Reported by: pho	2021-05-13 09:39:27 +00:00
Konstantin Belousov	6de3cf14c4	vn_open_cred(): disallow O_CREAT \| O_EMPTY_PATH This combination does not make sense, and cannot be satisfied by lookup. In particular, lookup cannot supply dvp, it only can directly return vp. Reported and reviewed by: markj using syzkaller Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-05-13 02:32:04 +03:00
Mark Johnston	d8acd2681b	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151	2021-05-12 13:00:09 -04:00
Mateusz Guzik	12288bd999	cache: fix lockless absolute symlink traversal to non-fp mounts Said lookups would incorrectly fail with EOPNOTSUP. Reported by: kib	2021-05-11 04:30:12 +00:00
Mark Johnston	c8bbb1272c	vfs: Fix error handling in vn_fullpath_hardlink() vn_fullpath_any_smr() will return a positive error number if the caller-supplied buffer isn't big enough. In this case the error must be propagated up, otherwise we may copy out uninitialized bytes. Reported by: syzkaller+KMSAN Reviewed by: mjg, kib MFC aftr: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30198	2021-05-10 20:22:27 -04:00
Konstantin Belousov	5e7cdf1817	openat(2): add O_EMPTY_PATH It reopens the passed file descriptor, checking the file backing vnode' current access rights against open mode. In particular, this flag allows to convert file descriptor opened with O_PATH, into operable file descriptor, assuming permissions allow that. Reviewed by: markj Tested by: Andrew Walker <awalker@ixsystems.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30148	2021-05-11 02:39:24 +03:00
Konstantin Belousov	d474440ab3	Constify vm_pager-related virtual tables. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Mark Johnston	9a7c2de364	realloc: Fix KASAN(9) shadow map updates When copying from the old buffer to the new buffer, we don't know the requested size of the old allocation, but only the size of the allocation provided by UMA. This value is "alloc". Because the copy may access bytes in the old allocation's red zone, we must mark the full allocation valid in the shadow map. Do so using the correct size. Reported by: kp Tested by: kp Sponsored by: The FreeBSD Foundation	2021-05-05 17:12:51 -04:00
Warner Losh	a512d0ab00	kern: clarify boot time In FreeBSD, the current time is computed from uptime + boottime. Uptime is a continuous, smooth function that's monotonically increasing. To effect changes to the current time, boottime is adjusted. boottime is mutable and shouldn't be cached against future need. Document the current implementation, with the caveat that we may stop stepping boottime on resume in the future and will step uptime instead (noted in the commit message, but not in the code). Sponsored by: Netflix Reviewed by: phk, rpokala Differential Revision: https://reviews.freebsd.org/D30116	2021-05-05 12:32:13 -06:00
Elliott Mitchell	a3c7da3d08	kern/intr: declare interrupt vectors unsigned These should never get values large enough for sign to matter, but one of them becoming negative could cause problems. MFC after: 1 week Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D29327	2021-05-03 13:24:30 -04:00
Mark Johnston	2b2d77e720	VOP_STAT: Provide a default value for va_gen Some filesystems, e.g., pseudofs and the NFSv3 client, do not provide one. Reviewed by: kib Reported by: KMSAN Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30091	2021-05-03 13:24:30 -04:00
Mark Johnston	cdfcfc607a	smp: Initialize arg->cpus sooner in smp_rendezvous_cpus_retry() Otherwise, if !smp_started is true, then smp_rendezvous_cpus_done() will harmlessly perform an atomic RMW on an uninitialized variable. Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-05-03 13:24:30 -04:00
Konstantin Belousov	7cb40543e9	filt_timerexpire: do not iterate over the interval User-supplied data might make this loop too time-consuming. Divide directly, and handle both the possibility that we were woken up earlier, and arithmetic overflows/underflows from the calculation. Reported and tested by: pho (previous version) Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30069	2021-05-03 19:49:54 +03:00
Konstantin Belousov	87a64872cd	Add ptrace(PT_COREDUMP) It writes the core of live stopped process to the file descriptor provided as an argument. Based on the initial version from https://reviews.freebsd.org/D29691, submitted by Michał Górny <mgorny@gentoo.org>. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:18:26 +03:00
Konstantin Belousov	68d311b666	ptracestop: mark threads suspended there with the new TDB_SSWITCH flag This way threads in ptracestop can be discovered by debugger Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:18:25 +03:00
Konstantin Belousov	9ebf9100ba	ptrace: do not allow for parallel ptrace requests Set a new P2_PTRACEREQ flag around the request Wait for the target . process P2_PTRACEREQ flag to clear before setting ours . Otherwise, we rely on the moment that the process lock is not dropped until the stopped target state is important. This is going to be no longer true after some future change. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:16:30 +03:00
Konstantin Belousov	54c8baa021	kern_ptrace(): extract code to determine ptrace eligibility into helper Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:48 +03:00
Konstantin Belousov	2bd0506c8d	kern_ptrace: change type of proctree_locked to bool Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:48 +03:00
Konstantin Belousov	af928fded0	Add thread_run_flash() helper It unsuspends single suspended thread, passed as the argument. It is up to the caller to arrange the target thread to suspend later, since the state of the process is not changed from stopped. In particular, the unsuspended thread must not leave to userspace, since boundary code is not prepared to this situation. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Konstantin Belousov	15465a2c25	Add sleepq_remove_nested() The helper removes the thread from a sleep queue, assuming that it would need to sleep. The sleepq_remove_nested() function is intended for quite special case, where suspended thread from traced stopped process is temporary unsuspended to do some work on behalf of the debugger in the target context, and this work might require sleep. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Konstantin Belousov	86ffb3d1a0	ELF coredump: define several useful flags for the coredump operations - SVC_ALL request dumping all map entries, including those marked as non-dumpable - SVC_NOCOMPRESS disallows compressing the dump regardless of the sysctl policy - SVC_PC_COREDUMP is provided for future use by userspace core dump request Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Konstantin Belousov	5bc3c61780	imgact_elf: consistently pass flags from coredump down to helper functions Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Rick Macklem	4f592683c3	copy_file_range(2): improve copying of a large hole to EOF PR#255523 reported that a file copy for a file with a large hole to EOF on ZFS ran slowly over NFSv4.2. The problem was that vn_generic_copy_file_range() would loop around reading the hole's data and then see it is all 0s. It was coded this way since UFS always allocates a data block near the end of the file, such that a hole to EOF never exists. This patch modifies vn_generic_copy_file_range() to check for a ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that case as a hole to EOF. asomers@ confirms that it works for his ZFS test case. PR: 255523 Tested by: asomers Reviewed by: asomers MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30076	2021-05-02 16:04:27 -07:00
Konstantin Belousov	2082565798	O_PATH: disable kqfilter for fifos Filter on fifos is real filter for the object, and not a filesystem events filter like EVFILT_VNODE. Reported by: markj using syzkaller Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-04-30 17:43:45 +03:00
Mark Johnston	20e3b9d8bd	kasan: Use vm_offset_t for the first parameter to kasan_shadow_map() No functional change intended. Sponsored by: The FreeBSD Foundation	2021-04-29 11:39:02 -04:00
Mateusz Guzik	074abaccfa	cache: remove incomplete lockless lockout support during resize This is already properly handled thanks to 2 step hash replacement.	2021-04-28 19:53:25 +00:00
Mark Johnston	d1e9441583	pipe: Avoid calling selrecord() on a closing pipe pipe_poll() may add the calling thread to the selinfo lists of both ends of a pipe. It is ok to do this for the local end, since we know we hold a reference on the file and so the local end is not closed. It is not ok to do this for the remote end, which may already be closed and have called seldrain(). In this scenario, when the polling thread wakes up, it may end up referencing a freed selinfo. Guard the selrecord() call appropriately. Reviewed by: kib Reported by: syzkaller+KASAN MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30016	2021-04-28 10:43:29 -04:00
Thomas Munro	3aaaa2efde	poll(2): Add POLLRDHUP. Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if requested. Triggered when the remote peer shuts down writing or closes its end. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29757	2021-04-28 23:00:31 +12:00
Mark Johnston	409ab7e109	imgact_elf: Ensure that the return value in parse_notes is initialized parse_notes relies on the caller-supplied callback to initialize "res". Two callbacks are used in practice, brandnote_cb and note_fctl_cb, and the latter fails to initialize res. Fix it. In the worst case, the bug would cause the inner loop of check_note to examine more program headers than necessary, and the note header usually comes last anyway. Reviewed by: kib Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29986	2021-04-26 14:53:16 -04:00
Edward Tomasz Napierala	5d1d844a77	kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW This makes this API match other kern_xxxat() functions. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D29776	2021-04-25 14:13:12 +01:00
Robert Watson	af14713d49	Support run-time configuration of the PIPE_MINDIRECT threshold. PIPE_MINDIRECT determines at what (blocking) write size one-copy optimizations are applied in pipe(2) I/O. That threshold hasn't been tuned since the 1990s when this code was originally committed, and allowing run-time reconfiguration will make it easier to assess whether contemporary microarchitectures would prefer a different threshold. (On our local RPi4 baords, the 8k default would ideally be at least 32k, but it's not clear how generalizable that observation is.) MFC after: 3 weeks Reviewers: jrtc27, arichardson Differential Revision: https://reviews.freebsd.org/D29819	2021-04-24 20:04:28 +01:00
Mark Johnston	8e8f1cc9bb	Re-enable network ioctls in capability mode This reverts a portion of `274579831b` ("capsicum: Limit socket operations in capability mode") as at least rtsol and dhcpcd rely on being able to configure network interfaces while in capability mode. Reported by: bapt, Greg V Sponsored by: The FreeBSD Foundation	2021-04-23 09:22:49 -04:00
Warner Losh	df456a1fcf	newbus: style nit (align comments) Sponsored by: Netflix	2021-04-21 15:37:24 -06:00
Warner Losh	1eebd6158c	newbus: Optimize/Simplify kobj_class_compile_common a little "i" is not used in this loop at all. There's no need to initialize and increment it. Reviewed by: markj@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29898	2021-04-21 15:37:24 -06:00
Konstantin Belousov	54f98c4dbf	vn_open_vnode(): handle error when fp == NULL If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to be called, we cannot use fp fo_close() when there is no fp. This occurs when e.g. kernel code directly calls vn_open() instead of the open(2) syscall. In this case, VOP_CLOSE() can be called directly, after possible lock upgrade. Reported by: nvass@gmx.com PR: 255119 Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29830	2021-04-21 18:06:51 +03:00
Konstantin Belousov	ecfbddf0cd	sysctl vm.objects: report backing object and swap use For anonymous objects, provide a handle kvo_me naming the object, and report the handle of the backing object. This allows userspace to deconstruct the shadow chain. Right now the handle is the address of the object in KVA, but this is not guaranteed. For the same anonymous objects, report the swap space used for actually swapped out pages, in kvo_swapped field. I do not believe that it is useful to report full 64bit counter there, so only uint32_t value is returned, clamped to the max. For kinfo_vmentry, report anonymous object handle backing the entry, so that the shadow chain for the specific mapping can be deconstructed. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29771	2021-04-19 21:32:01 +03:00
Konstantin Belousov	4342ba184c	sysctl_handle_string: do not malloc when SYSCTL_IN cannot fault In particular, this avoids malloc(9) calls when from early tunable handling, with no working malloc yet. Reported and tested by: mav Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-19 21:32:01 +03:00
Konstantin Belousov	578c26f31c	linkat(2): check NIRES_EMPTYPATH on the first fd arg Reported by: arichardson Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29834	2021-04-19 21:32:01 +03:00
Warner Losh	571a1a64b1	Minor style tidy: if( -> if ( Fix a few 'if(' to be 'if (' in a few places, per style(9) and overwhelming usage in the rest of the kernel / tree. MFC After: 3 days Sponsored by: Netflix	2021-04-18 11:19:15 -06:00
Warner Losh	f1f9870668	Minor style cleanup We prefer 'while (0)' to 'while(0)' according to grep and stlye(9)'s space after keyword rule. Remove a few stragglers of the latter. Many of these usages were inconsistent within the file. MFC After: 3 days Sponsored by: Netflix	2021-04-18 11:14:17 -06:00
Konstantin Belousov	bbf7a4e878	O_PATH: allow vnode kevent filter on such files if VREAD access is checked as allowed during open Requested by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:49:18 +03:00
Konstantin Belousov	f9b923af34	O_PATH: Allow to open symlink When O_NOFOLLOW is specified, namei() returns the symlink itself. In this case, open(O_PATH) should be allowed, to denote the location of symlink itself. Prevent O_EXEC in this case, execve(2) code is not ready to try to execute symlinks. Reported by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:49:09 +03:00
Konstantin Belousov	a5970a529c	Make files opened with O_PATH to not block non-forced unmount by only keeping hold count on the vnode, instead of the use count. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:48:27 +03:00
Konstantin Belousov	8d9ed174f3	open(2): Implement O_PATH Reviewed by: markj Tested by: pho Discussed with: walker.aj325_gmail.com, wulf Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:48:24 +03:00
Konstantin Belousov	509124b626	Add AT_EMPTY_PATH for several *at(2) syscalls It is currently allowed to fchownat(2), fchmodat(2), fchflagsat(2), utimensat(2), fstatat(2), and linkat(2). For linkat(2), PRIV_VFS_FHOPEN privilege is required to exercise the flag. It allows to link any open file. Requested by: trasz Tested by: pho, trasz Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29111	2021-04-15 12:48:11 +03:00
Konstantin Belousov	437c241d0c	vfs_vnops.c: Make vn_statfile() non-static Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:47:56 +03:00

... 3 4 5 6 7 ...

18717 Commits