freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	f1e2cc1c66	vfs: drop dedicated sysinit for mountlist_mtx Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-26 20:52:03 +02:00
Mateusz Guzik	0d28d014c8	vfs: refactor kern_unmount Split unmounting by path and id in preparation for other changes. Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-26 13:58:28 +02:00
Mateusz Guzik	7b2561b46b	vfs: stop open-coding vfs_getvfs in kern_unmount Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-26 11:38:31 +00:00
Mark Johnston	a507a40f3b	fsetown: Simplify error handling No functional change intended. Suggested by: kib Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31671	2021-08-25 16:20:07 -04:00
Mark Johnston	1d874ba4f8	fsetown: Fix process lookup bugs - pget()/pfind() will acquire the PID hash bucket locks, which are sleepable sx locks, but this means that the sigio mutex cannot be held while calling these functions. Instead, use pget() to hold the process, after which we lock the sigio and proc locks, respectively. - funsetownlst() assumes that processes cannot be registered for SIGIO once they have P_WEXIT set. However, pfind() will happily return exiting processes, breaking the invariant. Add an explicit check for P_WEXIT in fsetown() to fix this. [1] Fixes: `f52979098d` ("Fix a pair of races in SIGIO registration") Reported by: syzkaller [1] Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31661	2021-08-25 16:18:10 -04:00
Ka Ho Ng	9e202d036d	fspacectl(2): Changes on rmsr.r_offset's minimum value returned rmsr.r_offset now is set to rqsr.r_offset plus the number of bytes zeroed before hitting the end-of-file. After this change rmsr.r_offset no longer contains the EOF when the requested operation range is completely beyond the end-of-file. Instead in such case rmsr.r_offset is equal to rqsr.r_offset. Callers can obtain the number of bytes zeroed by subtracting rqsr.r_offset from rmsr.r_offset. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31677	2021-08-26 00:03:37 +08:00
Ka Ho Ng	5c1428d2c4	uipc_shm: Handle offset on shm_size as if it is beyond shm_size This avoids any unnecessary works in such case. Sponsored by: The FreeBSD Foundation Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D31655	2021-08-24 23:49:18 +08:00
Ka Ho Ng	1eaa36523c	fspacectl(2): Clarifies the return values rmacklem@ spotted two things in the system call: - Upon returning from a successful operation, vop_stddeallocate can update rmsr.r_offset to a value greater than file size. This behavior, although being harmless, can be confusing. - The EINVAL return value for rqsr.r_offset + rqsr.r_len > OFF_MAX is undocumented. This commit has the following changes: - vop_stddeallocate and shm_deallocate to bound the the affected area further by the file size. - The EINVAL case for rqsr.r_offset + rqsr.r_len > OFF_MAX is documented. - The fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9)'s return len is explicitly documented the be the value 0, and the return offset is restricted to be the smallest of off + len and current file size suggested by kib@. This semantic allows callers to interact better with potential file size growth after the call. Sponsored by: The FreeBSD Foundation Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D31604	2021-08-24 17:08:28 +08:00
Mateusz Guzik	b65ad70195	cache: retire cache_fast_revlookup sysctl Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-23 15:31:44 +02:00
Mateusz Guzik	7fd856ba07	vfs: s/__unused/__diagused in crossmp_* Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-23 15:23:42 +02:00
Mateusz Guzik	614faa3269	vfs: fix cache-relatecd LOR introduced in the previous change Reported by: kib Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-22 16:20:07 +00:00
Thomas Munro	f30a1ae8d5	lio_listio(2): Allow LIO_READV and LIO_WRITEV. Allow multiple vector IOs to be started with one system call. aio_readv() and aio_writev() already used these opcodes under the covers. This commit makes them available to user space. Being non-standard extensions, they're only visible if __BSD_VISIBLE is defined, like the functions. Reviewed by: asomers, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D31627	2021-08-22 23:00:42 +12:00
Jason A. Harmening	e81e71b0e9	Use interruptible wait for blocking recursive unmounts Now that we allow recursive unmount attempts to be abandoned upon exceeding the retry limit, we should avoid leaving an unkillable thread when a synchronous unmount request was issued against the base filesystem. Reviewed by: kib (earlier revision), mkusick Differential Revision: https://reviews.freebsd.org/D31450	2021-08-20 13:21:56 -07:00
Jason A. Harmening	a8c732f4e5	VFS: add retry limit and delay for failed recursive unmounts A forcible unmount attempt may fail due to a transient condition, but it may also fail due to some issue in the filesystem implementation that will indefinitely prevent successful unmount. In such a case, the retry logic in the recursive unmount facility will cause the deferred unmount taskqueue to execute constantly. Avoid this scenario by imposing a retry limit, with a default value of 10, beyond which the recursive unmount facility will emit a log message and give up. Additionally, introduce a grace period, with a default value of 1s, between successive unmount retries on the same mount. Create a new sysctl node, vfs.deferred_unmount, to export the total number of failed recursive unmount attempts since boot, and to allow the retry limit and retry grace period to be tuned. Reviewed by: kib (earlier revision), mkusick Differential Revision: https://reviews.freebsd.org/D31450	2021-08-20 13:20:50 -07:00
Mateusz Guzik	5d75ffdd0c	vfs: remove an unused variable from nameicap_tracker_add Reported by cc --analyze Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-20 17:52:24 +00:00
Mateusz Guzik	dbc689cdef	vfs: use vn_lock_pair to avoid establishing an ordering on mount This fixes some of the LORs seen on mount/unmount. Complete fix will require taking care of unmount as well. Reviewed by: kib Tested by: pho (previous version) Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31611	2021-08-20 17:52:24 +00:00
Kyle Evans	d7e1bdfeba	uipc: avoid circular pr_{slow,fast}timos domain_init() gets reinvoked for each vnet on a system, so we must not alter global state. Practically speaking, we were creating circular lists and tying up a softclock thread into an infinite loop. The breakage here was most easily observed by simply creating a jail in a new vnet and watching the system suddenly become erratic. Reported by: markj Fixes: `e0a17c3f06` ("uipc: create dedicated lists for fast ...") Pointy hat: kevans	2021-08-18 12:46:54 -05:00
Kristof Provost	07edc89c39	witness: remove ifnet_rw This lock no longer exists. It was removed in `a60100fdfc` (if: Remove ifnet_rwlock, 2020-11-25) Reviewed by: mjg Pointed out by: Dheeraj Kandula <dheerajk@netapp.com> Different Revision: https://reviews.freebsd.org/D31585	2021-08-18 08:51:26 +02:00
Kristof Provost	a051ca72e2	Introduce m_get3() Introduce m_get3() which is similar to m_get2(), but can allocate up to MJUM16BYTES bytes (m_get2() can only allocate up to MJUMPAGESIZE). This simplifies the bpf improvement in `f13da24715`. Suggested by: glebius Differential Revision: https://reviews.freebsd.org/D31455	2021-08-18 08:48:27 +02:00
Mateusz Guzik	e0a17c3f06	uipc: create dedicated lists for fast and slow timeout callbacks This avoids having to walk all possible protocols only to check if they have one (vast majority does not). Original patch by kevans@. Reviewed by: kevans Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-08-17 21:56:05 +02:00
Mark Johnston	c4feb1ab0a	sigtimedwait: Use a unique wait channel for sleeping When a sigtimedwait(2) caller goes to sleep, it uses a wait channel of p->p_sigacts with the proc lock as the interlock. However, p_sigacts can be shared between processes if a child is created with rfork(RFSIGSHARE \| RFPROC). Thus we can end up with two threads sleeping on the same wait channel using different locks, which is not permitted. Fix the problem simply by using a process-unique wait channel, following the example of sigsuspend. The actual wait channel value is irrelevant here, sleeping threads are awoken using sleepq_abort(). Reported by: syzbot+8c417afabadb50bb8827@syzkaller.appspotmail.com Reported by: syzbot+1d89fc2a9ef92ef64fa8@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31563	2021-08-16 15:11:15 -04:00
John Baldwin	d16cb228c1	ktls: Fix accounting for TLS 1.0 empty fragments. TLS 1.0 empty fragment mbufs have no payload and thus m_epg_npgs is zero. However, these mbufs need to occupy a "unit" of space for the purposes of M_NOTREADY tracking similar to regular mbufs. Previously this was done for the page count returned from ktls_frame() and passed to ktls_enqueue() as well as the page count passed to pru_ready(). However, sbready() and mb_free_notready() only use m_epg_nrdy to determine the number of "units" of space in an M_EXT mbuf, so when a TLS 1.0 fragment was marked ready it would mark one unit of the next mbuf in the socket buffer as ready as well. To fix, set m_epg_nrdy to 1 for empty fragments. This actually simplifies the code as now only ktls_frame() has to handle TLS 1.0 fragments explicitly and the rest of the KTLS functions can just use m_epg_nrdy. Reviewed by: gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31536	2021-08-16 10:42:46 -07:00
Konstantin Belousov	81b895a95b	pipe_paircreate(): do not leak pipepair memory on error Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-08-16 17:08:44 +03:00
Kyle Evans	29e400e994	domain: make it safer to add domains post-domainfinalize I can see two concerns for adding domains after domainfinalize: 1.) The slow/fast callouts have already been setup. 2.) Userland could create a socket while we're in the middle of initialization. We can address #1 fairly easily by tracking whether the domain's been initialized for at least the default vnet. There are still some concerns about the callbacks being invoked while a vnet is in the process of being created/destroyed, but this is a pre-existing issue that the callbacks must coordinate anyways. We should also address #2, but technically this has been an issue anyways because we don't assert on post-domainfinalize additions; we don't seem to hit it in practice. Future work can fix that up to make sure we don't find partially constructed domains, but care must be taken to make sure that at least, e.g., the usages of pffindproto in ip_input.c can still find them. Differential Revision: https://reviews.freebsd.org/D25459	2021-08-16 00:59:56 -05:00
Kyle Evans	239aebee61	domain: give domains a chance to probe for availability This gives any given domain a chance to indicate that it's not actually supported on the current system. If dom_probe isn't supplied, we assume the domain is universally applicable as most of them are. Keeping fully-initialized and registered domains around that physically can't work on a large majority of FreeBSD deployments is sub-optimal and leads to errors that aren't consistent with the reality of why the socket can't be created (e.g. ESOCKTNOSUPPORT) because such scenario has to be caught upon pru_attach, at which point kicking back the more-appropriate EAFNOSUPPORT would seem weird. The initial consumer of this will be hvsock, which is only available on HyperV guests. Reviewed by: cem (earlier version), bcr (manpages) Differential Revision: https://reviews.freebsd.org/D25062	2021-08-16 00:59:56 -05:00
Konstantin Belousov	9446d9e88f	fstatat(2): handle non-vnode file descriptors for AT_EMPTY_PATH Set NIRES_EMPTYPATH earlies, to have use of EMPTYPATH recorded even if we are going to return error. When namei_setup() refused to accept dirfd, which is not of the vnode type, and indicated by ENOTDIR error return, fall back to kern_fstat(dirfd). Reported by: dchagin Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31530	2021-08-14 00:17:18 +03:00
Ka Ho Ng	454bc887f2	uipc_shm: Implements fspacectl(2) support This implements fspacectl(2) support on shared memory objects. The semantic of SPACECTL_DEALLOC is equivalent to clearing the backing store and free the pages within the affected range. If the call succeeds, subsequent reads on the affected range return all zero. tests/sys/posixshm/posixshm_tests.c is expanded to include a fspacectl(2) functional test. Sponsored by: The FreeBSD Foundation Reviewed by: kevans, kib Differential Revision: https://reviews.freebsd.org/D31490	2021-08-12 23:04:18 +08:00
Ka Ho Ng	a638dc4ebc	vfs: Add ioflag to VOP_DEALLOCATE(9) The addition of ioflag allows callers passing IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation. The vop_stddeallocate fallback implementation is updated to pass the ioflag to the file system implementation. vn_deallocate(9) internally is also changed to pass ioflag to the VOP_DEALLOCATE call. Sponsored by: The FreeBSD Foundation Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D31500	2021-08-12 23:03:49 +08:00
Ka Ho Ng	c15384f896	vfs: Add get_write_ioflag helper to calculate ioflag Converted vn_write to use this helper. Sponsored by: The FreeBSD Foundation MFC after: 3 days Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31513	2021-08-12 17:35:34 +08:00
Dmitry Chagin	71854d9b2b	fork: Remove the unnecessary spaces. MFC after: 2 weeks	2021-08-12 11:58:17 +03:00
Dmitry Chagin	de8374df28	fork: Allow ABI to specify fork return values for child. At least Linux x86 ABI's does not use carry bit and expects that the dx register is preserved. For this add a new sv_set_fork_retval hook and call it from cpu_fork(). Add a short comment about touching dx in x86_set_fork_retval(), for more details see phab comments from kib@ and imp@. Reviewed by: kib Differential revision: https://reviews.freebsd.org/D31472 MFC after: 2 weeks	2021-08-12 11:45:25 +03:00
Eric van Gyzen	13a58148de	netdump: send key before dump, in case dump fails Previously, if an encrypted netdump failed, such as due to a timeout or network failure, the key was not saved, so a partial dump was completely useless. Send the key first, so the partial dump can be decrypted, because even a partial dump can be useful. Reviewed by: bdrewery, markj MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D31453	2021-08-11 10:54:56 -05:00
Mark Johnston	10a8e93da1	kmsan: Export kmsan_mark_mbuf() and kmsan_mark_bio() Sponsored by: The FreeBSD Foundation	2021-08-11 16:33:41 -04:00
Andrew Gallatin	95c51fafa4	ktls: Init reset tag task for cloned sessions When cloning a ktls session (which is needed when we need to switch output NICs for a NIC TLS session), we need to also init the reset task, like we do when creating a new tls session. Reviewed by: jhb Sponsored by: Netflix	2021-08-11 14:06:43 -04:00
Mitchell Horne	4ccaa87f69	kdb: Handle process enumeration before procinit() Make kdb_thr_first() and kdb_thr_next() return sane values if the allproc list and pidhashtbl haven't been initialized yet. This can happen if the debugger is entered very early on, for example with the '-d' boot flag. This allows remote gdb to attach at such a time, and fixes some ddb commands like 'show threads'. Be explicit about the static initialization of these variables. This part has no functional change. Reviewed by: markj, imp (previous version) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D31495	2021-08-11 14:44:22 -03:00
Ka Ho Ng	4a9b832a2a	vfs: Rename ioflg to ioflag in vn_deallocate This includes a style fix around ioflag checking as well. Sponsored by: The FreeBSD Foundation Reviewed by: kib, bcr Differential Revision: https://reviews.freebsd.org/D31505	2021-08-11 17:45:47 +08:00
Alexander Motin	67f508db84	Mark some sysctls as CTLFLAG_MPSAFE. MFC after: 2 weeks	2021-08-10 22:18:26 -04:00
Mark Johnston	100949103a	uma: Add KMSAN hooks For now, just hook the allocation path: upon allocation, items are marked as initialized (absent M_ZERO). Some zones are exempted from this when it would otherwise raise false positives. Use kmsan_orig() to update the origin map for UMA and malloc(9) allocations. This allows KMSAN to print the return address when an uninitialized UMA item is implicated in a report. For example: panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe Sponsored by: The FreeBSD Foundation	2021-08-10 21:27:54 -04:00
Mark Johnston	693c9516fa	busdma: Add KMSAN integration Sanitizer instrumentation of course cannot automatically update shadow state when devices write to host memory. KMSAN thus hooks into busdma, both to update shadow state after a device write, and to verify that the kernel does not publish uninitalized bytes to devices. To implement this, when KMSAN is configured, each dmamap embeds a memory descriptor describing the region currently loaded into the map. bus_dmamap_sync() uses the operation flags to determine whether to validate the loaded region or to mark it as initialized in the shadow map. Note that in cases where the amount of data written is less than the buffer size, the entire buffer is marked initialized even when it is not. For example, if a NIC writes a 128B packet into a 2KB buffer, the entire buffer will be marked initialized, but subsequent accesses past the first 128 bytes are likely caused by bugs. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31338	2021-08-10 21:27:54 -04:00
Mark Johnston	b0f71f1bc5	amd64: Add MD bits for KMSAN Interrupt and exception handlers must call kmsan_intr_enter() prior to calling any C code. This is because the KMSAN runtime maintains some TLS in order to track initialization state of function parameters and return values across function calls. Then, to ensure that this state is kept consistent in the face of asynchronous kernel-mode excpeptions, the runtime uses a stack of TLS blocks, and kmsan_intr_enter() and kmsan_intr_leave() push and pop that stack, respectively. Use these functions in amd64 interrupt and exception handlers. Note that handlers for user->kernel transitions need not be annotated. Also ensure that trap frames pushed by the CPU and by handlers are marked as initialized before they are used. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31467	2021-08-10 21:27:53 -04:00
Mark Johnston	8978608832	amd64: Populate the KMSAN shadow maps and integrate with the VM - During boot, allocate PDP pages for the shadow maps. The region above KERNBASE is currently not shadowed. - Create a dummy shadow for the vm page array. For now, this array is not protected by the shadow map to help reduce kernel memory usage. - Grow shadows when growing the kernel map. - Increase the default kernel stack size when KMSAN is enabled. As with KASAN, sanitizer instrumentation appears to create stack frames large enough that the default value is not sufficient. - Disable UMA's use of the direct map when KMSAN is configured. KMSAN cannot validate the direct map. - Disable unmapped I/O when KMSAN configured. - Lower the limit on paging buffers when KMSAN is configured. Each buffer has a static MAXPHYS-sized allocation of KVA, which in turn eats 2*MAXPHYS of space in the shadow map. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31295	2021-08-10 21:27:53 -04:00
Mark Johnston	5dda15adbc	kern: Ensure that thread-local KMSAN state is available Sponsored by: The FreeBSD Foundation	2021-08-10 21:27:53 -04:00
Mark Johnston	a422084abb	Add the KMSAN runtime KMSAN enables the use of LLVM's MemorySanitizer in the kernel. This enables precise detection of uses of uninitialized memory. As with KASAN, this feature has substantial runtime overhead and is intended to be used as part of some automated testing regime. The runtime maintains a pair of shadow maps. One is used to track the state of memory in the kernel map at bit-granularity: a bit in the kernel map is initialized when the corresponding shadow bit is clear, and is uninitialized otherwise. The second shadow map stores information about the origin of uninitialized regions of the kernel map, simplifying debugging. KMSAN relies on being able to intercept certain functions which cannot be instrumented by the compiler. KMSAN thus implements interceptors which manually update shadow state and in some cases explicitly check for uninitialized bytes. For instance, all calls to copyout() are subject to such checks. The runtime exports several functions which can be used to verify the shadow map for a given buffer. Helpers provide the same functionality for a few structures commonly used for I/O, such as CAM CCBs, BIOs and mbufs. These are handy when debugging a KMSAN report whose proximate and root causes are far away from each other. Obtained from: NetBSD Sponsored by: The FreeBSD Foundation	2021-08-10 21:27:53 -04:00
Mark Johnston	eca9ac5a32	vfs: Avoid a comparison with an uninitialized field in setutimes() Some filesystems, e.g., devfs, do not populate va_birthtime in their GETATTR implementations. To handle this, make sure that va_birthtime is initialized to the quasi-standard value of { VNOVAL, 0 } before calling VOP_GETATTR. Reported by: KMSAN Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31468	2021-08-09 13:27:20 -04:00
Alexander Motin	696fca3fd4	Optimize res_find(). When the device name is provided, we can simply run strncmp() for each line to quickly skip unrelated ones, that is much faster than sscanf() and only then strcmp(). MFC after: 2 weeks	2021-08-08 21:54:49 -04:00
Ed Maste	9feff969a0	Remove "All Rights Reserved" from FreeBSD Foundation sys/ copyrights These ones were unambiguous cases where the Foundation was the only listed copyright holder (in the associated license block). Sponsored by: The FreeBSD Foundation	2021-08-08 10:42:24 -04:00
Mateusz Guzik	b30e7cb7fa	cache: add OPENREAD and OPENWRITE to fast path lookup	2021-08-07 13:02:38 +02:00
Rick Macklem	c18c74a87c	namei: Add cn_flags bits for OPENREAD and OPENWRITE VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN to indicate that the lookup is for the last component of a pathname when doing open. If the cn_flags also indicates if the open is for Reading, Writing or Both, the NFSv4 client can do an NFSv4 Open operation in the same compound RPC as Lookup, often avoiding the additional Open RPC now done when VOP_OPEN() is called. This patch defines two new cn_flags bits called OPENREAD and OPENWRITE and sets these in open2nameif() based on FREAD, FWRITE flag bits. This will allow a subsequent patch to the NFSv4 client to do the Open operation in the same RPC as Lookup. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31431	2021-08-06 18:41:11 -07:00
Andrew Gallatin	09066b9866	ktls: Use the new PNOLOCK flag Use the new PNOLOCK flag to tsleep() to indicate that we are managing potential races, and don't need to sleep with a lock, or have a backstop timeout. Reviewed by: jhb Sponsored by: Netflix	2021-08-05 17:19:12 -04:00
Andrew Gallatin	1b97a054f3	tsleep: Add a PNOLOCK flag Add a PNOLOCK flag so that, in the race circumstance where wakeup races are externally mitigated, tsleep() can be called with a sleep time of 0 without triggering an an assertion. Reviewed by: jhb Sponsored by: Netflix	2021-08-05 17:16:30 -04:00
Andrew Gallatin	2694c869ff	ktls: fix a panic with INVARIANTS `98215005b7` introduced a new thread that uses tsleep(..0) to sleep forever. This hit an assert due to sleeping with a 0 timeout. So spell "forever" using SBT_MAX instead, which does not trigger the assert. Pointy hat to: gallatin Pointed out by: emaste Sponsored by: Netflix	2021-08-05 13:09:06 -04:00
Ka Ho Ng	da9fe3529b	Regen after `0dc332bff2`	2021-08-05 23:22:02 +08:00
Ka Ho Ng	0dc332bff2	Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9). fspacectl(2) is a system call to provide space management support to userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the deallocation. vn_deallocate(9) is a public KPI for kmods' use. The purpose of proposing a new system call, a KPI and a VOP call is to allow bhyve or other hypervisor monitors to emulate the behavior of SCSI UNMAP/NVMe DEALLOCATE on a plain file. fspacectl(2) comprises of cmd and flags parameters to specify the space management operation to be performed. Currently cmd has to be SPACECTL_DEALLOC, and flags has to be 0. fo_fspacectl is added to fileops. VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation of VOP_DEALLOCATE(9) is provided. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28347	2021-08-05 23:20:42 +08:00
Ka Ho Ng	abbb57d5a6	vfs: Introduce vn_bmap_seekhole_locked() vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole(). This variant requires shared vnode lock being held around the call. Sponsored by: The FreeBSD Foundation Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D31404	2021-08-05 22:52:26 +08:00
Andrew Gallatin	98215005b7	ktls: start a thread to keep the 16k ktls buffer zone populated Ktls recently received an optimization where we allocate 16k physically contiguous crypto destination buffers. This provides a large (more than 5%) reduction in CPU use in our workload. However, after several days of uptime, the performance benefit disappears because we have frequent allocation failures from the ktls buffer zone. It turns out that when load drops off, the ktls buffer zone is trimmed, and some 16k buffers are freed back to the OS. When load picks back up again, re-allocating those 16k buffers fails after some number of days of uptime because physical memory has become fragmented. This causes allocations to fail, because they are intentionally done without M_NORECLAIM, so as to avoid pausing the ktls crytpo work thread while the VM system defragments memory. To work around this, this change starts one thread per VM domain to allocate ktls buffers with M_NORECLAIM, as we don't care if this thread is paused while memory is defragged. The thread then frees the buffers back into the ktls buffer zone, thus allowing future allocations to succeed. Note that waking up the thread is intentionally racy, but neither of the races really matter. In the worst case, we could have either spurious wakeups or we could have to wait 1 second until the next rate-limited allocation failure to wake up the thread. This patch has been in use at Netflix on a handful of servers, and seems to fix the issue. Differential Revision: https://reviews.freebsd.org/D31260 Reviewed by: jhb, markj, (jtl, rrs, and dhw reviewed earlier version) Sponsored by: Netflix	2021-08-05 10:19:12 -04:00
John Baldwin	c51e4962a3	Document kern.log_wakeups_per_second. PR: 148680 MFC after: 2 weeks	2021-08-04 11:50:34 -07:00
Konstantin Belousov	0ef5eee9d9	Add vn_lktype_write() and remove repetetive code that calculates vnode locking type for write. Reviewed by: khng, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31405	2021-08-04 19:40:13 +03:00
Kyle Evans	04cc0c393c	malloc(9): provide missing malloc_aligned implementation Pointy hat: kevans Fixes: `6162cf885c` ("malloc(9): Document/complete aligned variants")	2021-08-02 21:12:39 -05:00
Eric van Gyzen	428624130a	Fix lockstat:::thread-spin dtrace probe with LOCK_PROFILING The spinning start time is missing from the calculation due to a misplaced #endif. Return the #endif where it's supposed to be. Submitted by: Alexander Alexeev <aalexeev@isilon.com> Reviewed by: bdrewery, mjg MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D31384	2021-08-02 14:44:23 -05:00
Adam Fenn	8ca384eb1d	devclass_alloc_unit: move "at" hint test to after device-in-use test Only perform this expensive operation when the unit number is a potential candidate (i.e. not already in use), thereby reducing device scan time on systems with many devices, unit numbers, and drivers. Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. X-NetApp-PR: #61 Differential Revision: https://reviews.freebsd.org/D31381	2021-08-02 11:27:17 -05:00
Alexander Motin	ca34553b6f	sched_ule(4): Pre-seed sched_random(). I don't think it changes anything, but why not. While there, make cpu_search_highest() use all 8 lower load bits for noise, since it does not use cs_prefer and the code is not shared with cpu_search_lowest() any more. MFC after: 1 month	2021-08-02 10:55:28 -04:00
Alexander Motin	8bb173fb5b	sched_ule(4): Use trylock when stealing load. On some load patterns it is possible for several CPUs to try steal thread from the same CPU despite randomization introduced. It may cause significant lock contention when holding one queue lock idle thread tries to acquire another one. Use of trylock on the remote queue allows both reduce the contention and handle lock ordering easier. If we can't get lock inside tdq_trysteal() we just return, allowing tdq_idled() handle it. If it happens in tdq_idled(), then we repeat search for load skipping this CPU. On 2-socket 80-thread Xeon system I am observing dramatic reduction of the lock spinning time when doing random uncached 4KB reads from 12 ZVOLs, while IOPS increase from 327K to 403K. MFC after: 1 month	2021-08-01 22:42:01 -04:00
Alexander Motin	2668bb2add	sched_ule(4): Reduce duplicate search for load. When sched_highest() called for some CPU group returns nothing, idle thread calls it for the parent CPU group. But the parent CPU group also includes the CPU group we've just searched, and unless there is a race going on, it is unlikely we find anything new this time. Avoid the double search in case of parent group having only two sub- groups (the most prominent case). Instead of escalating to the parent group run the next search over the sibling subgroup and escalate two levels up after if that fail too. In case of more than two siblings the difference is less significant, while searching the parent group can result in better decision if we find several candidate CPUs. On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU time spent inside cpu_search_highest() in both SMT (2x20x2) and non- SMT (2x20) cases. MFC after: 1 month	2021-08-01 22:07:51 -04:00
Mark Johnston	6f179693c5	Add interceptors for atomic operations on userspace memory Implement them for KASAN. KCSAN interceptors are left unimplemented for now. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-29 21:14:36 -04:00
Mark Johnston	a90d053b84	Simplify kernel sanitizer interceptors KASAN and KCSAN implement interceptors for various primitive operations that are not instrumented by the compiler. KMSAN requires them as well. Rather than adding new cases for each sanitizer which requires interceptors, implement the following protocol: - When interceptor definitions are required, define SAN_NEEDS_INTERCEPTORS and SANITIZER_INTERCEPTOR_PREFIX. - In headers that declare functions which need to be intercepted by a sanitizer runtime, use SANITIZER_INTERCEPTOR_PREFIX to provide declarations. - When SAN_RUNTIME is defined, do not redefine the names of intercepted functions. This is typically the case in files which implement sanitizer runtimes but is also needed in, for example, files which define ifunc selectors for intercepted operations. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-29 21:13:32 -04:00
Mark Johnston	9e575fadf4	link_elf_obj: Invoke fini callbacks This is required for KASAN: when a module is unloaded, poisoned regions (e.g., pad areas between global variables) are left as such, so if they are reused as KLDs are loaded, false positives can arise. Reported by: pho, Jenkins Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31339	2021-07-29 09:46:25 -04:00
Dmitry Chagin	9e32efa79b	umtx: Split do_unlock_pi on two counterparts. The umtx_pi_frop() will be used by Linux emulation layer. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31238 MFC after: 2 weeks	2021-07-29 12:47:39 +03:00
Dmitry Chagin	09f55e6002	umtx: Expose some of the pi umtx structures and API to the rest of the kernel. Differential Revision: https://reviews.freebsd.org/D31237 MFC after: 2 weeks	2021-07-29 12:46:58 +03:00
Dmitry Chagin	8e4d22c01d	umtx: Add umtxq_requeue Linux emulation layer extension. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31235 MFC after: 2 weeks	2021-07-29 12:43:07 +03:00
Dmitry Chagin	7caa29115b	umtx: Add bitset conditional wakeup functionality. The bitset is a Linux emulation layer extension. This 32-bit mask, in which at least one bit must be set, is used to select which threads should be woken up. The bitset is stored in the umtx_q structure, which is used to enqueue the waiter into the umtx waitqueue. Put the bitset into the hole, that appeared on LP64 due to data alignment, to prevent the growth of the struct umtx_q. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31234 MFC after: 2 weeks	2021-07-29 12:42:49 +03:00
Dmitry Chagin	1fdcc87cfd	umtx: Expose some of the umtx structures and API to the rest of the kernel. Differential Revision: https://reviews.freebsd.org/D31233 MFC after: 2 weeks	2021-07-29 12:42:17 +03:00
Dmitry Chagin	307a3dd35c	umtx: Expose struct abs_timeout to the rest of the kernel. Add umtx_ prefix to all abs_timeout facility and add declaration for it. For consistency with others abs_timeout mark inline abs_timeout_init2. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31249 MFC after: 2 weeks	2021-07-29 12:41:58 +03:00
Dmitry Chagin	af29f39958	umtx: Split umtx.h on two counterparts. To prevent umtx.h polluting by future changes split it on two headers: umtx.h - ABI header for userspace; umtxvar.h - the kernel staff. While here fix umtx_key_match style. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31248 MFC after: 2 weeks	2021-07-29 12:41:29 +03:00
Kyle Evans	e3707726c1	kern: remove deprecated makesyscalls.sh makesyscalls was rewritten in Lua and introduced in `d3276301ab`. In the time since, no objections have risen and a warning was introduced long ago on invocation of makesyscalls.sh that it would be removed before FreeBSD 13. Belatedly follow through on that.	2021-07-28 22:22:23 -05:00
Alexander Motin	aefe0a8c32	Refactor/optimize cpu_search_(). Remove cpu_search_both(), unused for many years. Without it there is less sense for the trick of compiling common cpu_search() into separate cpu_search_lowest() and cpu_search_highest(), so split them completely, making code more readable. While there, split iteration over children groups and CPUs, complicating code for very small deduplication. Stop passing cpuset_t arguments by value and avoid some manipulations. Since MAXCPU bump from 64 to 256, what was a single register turned into 32-byte memory array, requiring memory allocation and accesses. Splitting struct cpu_search into parameter and result parts allows to even more reduce stack usage, since the first can be passed through on recursion. Remove CPU_FFS() from the hot paths, precalculating first and last CPU for each CPU group in advance during initialization. Again, it was not a problem for 64 CPUs before, but for 256 FFS needs much more code. With these changes on 80-thread system doing ~260K uncached ZFS reads per second I observe ~30% reduction of time spent in cpu_search_(). MFC after: 1 month	2021-07-28 22:00:29 -04:00
Warner Losh	824897a3ae	genoffset: simplify and rewrite in sh genoffset used the fully generic ASSYM macro to generate the offsets needed for the thread_lite structure. However, since these are offsets into a structure, they will always be necessarily small and positive. As such, just create a simple character array of the right size and use a naming convention such that we can recover the field name, structure name and type. Use nm -t d and sort -n to sort these into order, then loop over the resutls to generate the thread_lite structure. MFC After: 2 weeks Reviewed by: kib, markj (earlier versions) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31203	2021-07-28 13:50:09 -06:00
Warner Losh	46dd3ef033	genassym.sh: Fix two minor issues found by shellcheck o Remove redunant $ in $(( )) expression. o Quote arg passed to work so paths with spaces, etc will work. MFC After: 2 weeks Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31335	2021-07-28 13:49:16 -06:00
Roy Marples	7045b1603b	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652	2021-07-28 09:35:09 -07:00
Konstantin Belousov	273728b125	Regen	2021-07-28 13:21:22 +03:00
Konstantin Belousov	9b6b793bd7	Revert most of `ce42e79310` to restore ABI compatibility for pre-10.x binaries. It restores _umtx_lock() and _umtx_unlock() syscalls, and UMTX_OP_LOCK/ UMTX_OP_UNLOCK umtx_op(2) operations. UMUTEX_ERROR_CHECK flag is left out for now, I do not think it makes a difference. PR: 218571 Reviewed by: brooks (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31220	2021-07-28 13:21:12 +03:00
John Baldwin	be79f30d6c	m_dup: Handle unmapped mbufs as an input mbuf. Use m_copydata() instead of a direct bcopy() when copying data out of a source mbuf into a newly-allocated mbuf. PR: 256610 Reported by: Niels Bakker <niels=freebsd@bakker.net> Reviewed by: markj MFC after: 2 weeks	2021-07-26 14:09:16 -07:00
Jason A. Harmening	2bc16e8aaf	VFS: remove MNTK_MARKER We no longer allow upper filesystems to be unregistered from the base mount while vfs_notify_upper() or any other upper operation is pending. New upper mounts can still be registered during this period, but they will be added at the end of the upper mount tailq. We therefore no longer need to allocate marker nodes during vfs_notify_upper() to keep our place in the iteration. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016	2021-07-24 12:52:32 -07:00
Jason A. Harmening	c746ed724d	Allow stacked filesystems to be recursively unmounted In certain emergency cases such as media failure or removal, UFS will initiate a forced unmount in order to prevent dirty buffers from accumulating against the no-longer-usable filesystem. The presence of a stacked filesystem such as nullfs or unionfs above the UFS mount will prevent this forced unmount from succeeding. This change addreses the situation by allowing stacked filesystems to be recursively unmounted on a taskqueue thread when the MNT_RECURSE flag is specified to dounmount(). This call will block until all upper mounts have been removed unless the caller specifies the MNT_DEFERRED flag to indicate the base filesystem should also be unmounted from the taskqueue. To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs have been combined with the existing 'mnt_uppers' list used by nullfs and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper(). The format of the mnt_uppers list has also been changed to accommodate filesystems such as unionfs in which a given mount may be stacked atop more than one lower mount. Additionally, management of lower FS reclaim/unlink notifications has been split into a separate list managed by a separate set of KPIs, as registration of an upper FS no longer implies interest in these notifications. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016	2021-07-24 12:52:00 -07:00
Warner Losh	6475667f7b	devctl: don't publish the mount options Mount options aren't solely ASCII strings. In addition, experience to date suggests that the mount options are much less useful than was originally supposed and the mount flags suffice to make decisions. Drop the reporting of options for the mount/remount/unmount events. Reviewed by: markj Reported by: KASAN Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31287	2021-07-24 09:03:53 -06:00
Mark Johnston	ebf9886654	imgact_elf: Avoid redefining suword() Otherwise this interferes with the definition for sanitizer interceptors. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 15:40:54 -04:00
Mark Johnston	048cd371f3	vfs: Initialize "lastfail" in vfs_mountroot_wait() This variable is only used to rate-limit "Root mount waiting for: ..." messages using ppsratecheck(). Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 12:04:02 -04:00
Mark Johnston	ea3fbe0707	KASAN: Disable checking before triggering a panic KASAN hooks will not generate reports if panicstr != NULL, but then there is a window after the initial panic() call where another report may be raised. This can happen if a false positive occurs; to simplify debugging of such problems, avoid recursing. Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:14 -04:00
Mark Johnston	0dcef81de9	Add required sysctl name length checks to various handlers Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:13 -04:00
Mark Johnston	cae3f9dd01	select: Define select_flags[] as const MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:13 -04:00
Mark Johnston	90959dd1e5	acct: Zero pad bytes in accounting records Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:29:57 -04:00
Mark Johnston	5c18bf9d5f	ktrace: Zero request structures when populating the pool Otherwise uninitialized pad bytes may be copied into the ktrace log file. Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:29:53 -04:00
Alan Somers	6c95065590	Escape any '.' characters in sysctl node names ZFS creates some sysctl nodes that include a pool name, and '.' is an allowed character in pool names. But it's the separator in the sysctl tree, so it can't be included in a sysctl name. Replace it with "%25". Handily, "%" is illegal in ZFS pool names, so there's no ambiguity there. PR: 257316 MFC after: 3 weeks Sponsored by: Axcient Reviewed by: freqlabs Differential Revision: https://reviews.freebsd.org/D31265	2021-07-22 10:22:48 -06:00
Kyle Evans	23ecfa9d5b	kern: mountroot: avoid fd leak in .md parsing parse_dir_md() opens /dev/mdctl but only closes the resulting fd on success, not upon failure of the ioctl or when we exceed the md unit max. Reviewed by: kib (slightly previous version) Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. X-NetApp-PR: #62 Differential Revision: https://reviews.freebsd.org/D31229	2021-07-21 10:18:09 -05:00
Edward Tomasz Napierala	a40cf4175c	Implement unprivileged chroot This builds on recently introduced NO_NEW_PRIVS flag to implement unprivileged chroot, enabled by `security.bsd.unprivileged_chroot`. It allows non-root processes to chroot(2), provided they have the NO_NEW_PRIVS flag set. The chroot(8) utility gets a new flag, -n, which sets NO_NEW_PRIVS before chrooting. Reviewed By: kib Sponsored By: EPSRC Relnotes: yes Differential Revision: https://reviews.freebsd.org/D30130	2021-07-20 08:57:53 +00:00
Dmitry Chagin	1ca6b15bbd	Drop "All rights reserved" from my copyright statements. Add email and fixup years while here. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D30912 MFC after: 2 weeks	2021-07-20 10:05:50 +03:00
Dmitry Chagin	5fd9cd53d2	linux(4): Modify sv_onexec hook to return an error. Temporary add stubs to the Linux emulation layer which calls the existing hook. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D30911 MFC after: 2 weeks	2021-07-20 09:56:25 +03:00
Dmitry Chagin	62ba4cd340	Call sv_onexec hook after the process VA is created. For future use in the Linux emulation layer call sv_onexec hook right after the new process address space is created. It's safe, as sv_onexec used only by Linux abi and linux_on_exec() does not depend on a state of process VA. Reviewed by: kib Differential revision: https://reviews.freebsd.org/D30899 MFC after: 2 weeks	2021-07-20 09:55:14 +03:00
Dmitry Chagin	b39fa4770d	Remove bogus cast from exec_sysvec_init(). Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D30910 MFC after: 2 weeks	2021-07-20 09:54:09 +03:00
Dmitry Chagin	21629e2a45	Modify exec_sysvec_init() to allow non-native abi to setup their sysentvecs. For future use in the Linux emulation layer modify the exec_sysvec_init() to allow non-native abi to fill sv_timekeep_base and sv_shared_page_obj. Reviewed by: kib Differential revision: https://reviews.freebsd.org/D30898 MFC after: 2 weeks	2021-07-20 09:53:21 +03:00
Kyle Evans	db0f264393	kenv: allow listing of static kernel environments The early environment is typically cleared, so these new options need the PRESERVE_EARLY_KENV kernel config(8) option. These environments are reported as missing by kenv(1) if the option is not present in the running kernel. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D30835	2021-07-18 23:06:19 -05:00
Kyle Evans	7a129c973b	kern: add an option for preserving the early kenv Some downstream configurations do not store secrets in the early (loader/static) environments and desire a way to preserve these for diagnostic reasons. Provide an option to do so. Reviewed by: imp, jhb (earlier version) Differential Revision: https://reviews.freebsd.org/D30834	2021-07-18 23:05:48 -05:00
David Chisnall	cf98bc28d3	Pass the syscall number to capsicum permission-denied signals The syscall number is stored in the same register as the syscall return on amd64 (and possibly other architectures) and so it is impossible to recover in the signal handler after the call has returned. This small tweak delivers it in the `si_value` field of the signal, which is sufficient to catch capability violations and emulate them with a call to a more-privileged process in the signal handler. This reapplies `3a522ba1bc` with a fix for the static assertion failure on i386. Approved by: markj (mentor) Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D29185	2021-07-16 18:06:44 +01:00
Mark Johnston	c1aff72cfa	callout: Make cc_cpu local to kern_timeout.c No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-15 22:41:10 -04:00
Mark Johnston	2e5f615295	lio_listio: Don't post a completion notification if none was requested One is allowed to use LIO_NOWAIT without specifying a sigevent. In this case, lj->lioj_signal is left uninitialized, but several code paths examine liov_signal.sigev_notify to figure out which notification to post. Unconditionally initialize that field to SIGEV_NONE. Add a dumb test case which triggers the bug. Reported by: KMSAN+syzkaller Reviewed by: asomers MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31197	2021-07-15 22:41:10 -04:00
Konstantin Belousov	0bdb2cbf9d	procctl(PROC_ASLR_STATUS): fix vmspace leak Reported by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-07-15 03:02:50 +03:00
Mark Johnston	2783335cae	blist: Correct the node count computed in blist_create() Commit `bb4a27f927` added the ability to allocate a span of blocks crossing a meta node boundary. To ensure that blst_next_leaf_alloc() does not walk past the end of the tree, an extra all-zero meta node needs to be present at the end of the allocation, and blst_next_leaf_alloc() is implemented such that the presence of this node terminates the search. blist_create() computes the number of nodes required. It had two problems: 1. When the size of the blist is a power of BLIST_RADIX, we would unnecessarily allocate an extra level in the tree. 2. When the size of the blist is a multiple of BLIST_RADIX, we would fail to allocate a terminator node. In this case, blst_next_leaf_alloc() could scan beyond the bounds of the allocation. This was found using KASAN. Modify blist_create() to handle these cases correctly. Reported by: pho Reviewed by: dougm MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D31158	2021-07-13 17:47:27 -04:00
Mark Johnston	45e2357113	malloc: Pass the allocation size to malloc_large() by value Its callers do not make use the modified size that malloc_large() was returning, so there's no need to pass a pointer. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-13 17:47:02 -04:00
Mateusz Guzik	844aa31c6d	cache: add cache_enter_time_flags	2021-07-12 07:03:14 +02:00
David Chisnall	d2b558281a	Revert "Pass the syscall number to capsicum permission-denied signals" This broke the i386 build. This reverts commit `3a522ba1bc`.	2021-07-10 20:26:01 +01:00
David Chisnall	3a522ba1bc	Pass the syscall number to capsicum permission-denied signals The syscall number is stored in the same register as the syscall return on amd64 (and possibly other architectures) and so it is impossible to recover in the signal handler after the call has returned. This small tweak delivers it in the `si_value` field of the signal, which is sufficient to catch capability violations and emulate them with a call to a more-privileged process in the signal handler. Approved by: markj (mentor) Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D29185	2021-07-10 17:19:52 +01:00
Alexander Motin	63ca9ea4f3	Use sleepq_signal(SLEEPQ_DROP) in cv_signal(). Same as wakeup_one()/wakeup_any() commit before it reduces the lock hold time and so contention. MFC after: 1 week	2021-07-09 20:57:58 -04:00
Mark Johnston	588c7a06df	KASAN: Implement __asan_unregister_globals() It will be called during KLD unload to unpoison the redzones following global variables. Otherwise, virtual address ranges previously used for a KLD may be left tainted, triggering false positives when they are recycled. Reported by: pho Sponsored by: The FreeBSD Foundation	2021-07-09 20:38:50 -04:00
Michal Meloun	e88c3b1b02	intrng: remove now redundant shadow variable. Should not be a functional change. Submitted by: ehem_freebsd@m5p.com Discussed in: https://reviews.freebsd.org/D29310 MFC after: 4 weeks	2021-07-08 08:46:41 +02:00
Michal Meloun	a49f208d94	intrng: Releasing interrupt source should clear interrupt table full state. The first release of an interrupt in a situation where the interrupt table is full should schedule a full table check the next time an interrupt is allocated. A full check is necessary to ensure maximum separation between the order of allocation and the order of release. Submitted by: ehem_freebsd@m5p.com (initial version) Discussed in: https://reviews.freebsd.org/D29310 MFC after: 4 weeks	2021-07-08 08:16:46 +02:00
Andrew Gallatin	4150a5a87e	ktls: fix NOINET build Reported by: mjguzik Sponsored by: Netflix	2021-07-07 10:40:02 -04:00
Randall Stewart	d7955cc0ff	tcp: HPTS performance enhancements HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083	2021-07-07 07:22:35 -04:00
Konstantin Belousov	28a66fc3da	Do not call FreeBSD-ABI specific code for all ABIs Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle robust mutexes, for native FreeBSD ABI. Similarly, there is no sense in calling sigfastblock_clear() for non-native ABIs. Requested by: dchagin Reviewed by: dchagin, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Konstantin Belousov	55976ce11a	Move sv_onexit() sysentvec hook slightly later after itimers are stopped. This makes it more usable for e.g. native FreeBSD ABI sysentvecs. Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Konstantin Belousov	71ab344524	Add sv_onexec_old() sysent hook for exec event Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure. The old vmspace for the process is still intact during the call. Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Mateusz Guzik	c2c34ee540	mbuf: add m_get_raw and m_gethdr_raw The intent is to eliminate the MT_NOINIT flag and consequently a branch from the constructor. Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31080	2021-07-07 11:05:46 +00:00
Mateusz Guzik	0a718a6e6e	mbuf: replace all direct uma_zfree(zone_mbuf) calls with m_free_raw Reviewed by: donner Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31082	2021-07-07 11:05:46 +00:00
Andrew Gallatin	28d0a740dd	ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by: hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908	2021-07-06 10:28:32 -04:00
Jessica Clarke	55c57a7811	rman: Remove an outdated comment that no longer applies Since commit `2dd1bdf183` in 2016 the r_start and r_end fields have been rman_res_t, which was briefly unsigned long, but commit `da1b038af9` changed the typedef to be uintmax_t instead. C99 is also something we assume these days. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D30808	2021-07-05 16:15:03 +01:00
Mateusz Guzik	904a08f342	ktls: switch bare zone_mbuf use to m_free_raw Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30955	2021-07-02 08:30:22 +00:00
Mateusz Guzik	05462babd4	mbuf: add m_free_raw to be used instead of directly calling uma_zfree The intent is to remove all direct zone_mbuf consumers so that ctor/dtor from that zone can be reimplemented as wrappers around uma, avoiding an indirect function call. Reviewed by: kbowling Discussed with: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30959	2021-07-02 08:30:22 +00:00
Mateusz Guzik	fb32c8dbeb	iflib: retire MB_DTOR_SKIP The flag was added in 2016 but remains unused. Reviewed by: kbowling Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30958	2021-07-02 08:30:22 +00:00
Edward Tomasz Napierala	db8d680ebe	procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS This introduces a new, per-process flag, "NO_NEW_PRIVS", which is inherited, preserved on exec, and cannot be cleared. The flag, when set, makes subsequent execs ignore any SUID and SGID bits, instead executing those binaries as if they not set. The main purpose of the flag is implementation of Linux PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged chroot. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30939	2021-07-01 09:42:07 +01:00
Dmitry Chagin	5d9f790191	Eliminate p_elf_machine from struct proc. Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now cached in the struct proc at p_elf_brandinfo member. Note to MFC: D30918, KBI Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D30926 MFC after: 2 weeks	2021-06-29 20:18:29 +03:00
Dmitry Chagin	615f22b2fb	Add a link to the Elf_Brandinfo into the struc proc. To allow the ABI to make a dicision based on the Brandinfo add a link to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits of Elf_Brandinfo flags is private to the ABI. Note to MFC: it breaks KBI. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D30918 MFC after: 2 weeks	2021-06-29 20:15:08 +03:00
Edward Tomasz Napierala	435754a59e	Add infrastructure required for Linux coredump support This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`, and `sv_elf_core_prepare_notes` fields to `struct sysentvec`, and modifies imgact_elf.c to make use of them instead of hardcoding FreeBSD-specific values. It also updates all of the ABI definitions to preserve current behaviour. This makes it possible to implement non-native ELF coredump support without unnecessary code duplication. It will be used for Linux coredumps. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30921	2021-06-29 08:49:12 +01:00
Edward Tomasz Napierala	61b4c62718	imgact_elf.c: style, remove unnecessary casts Remove unnecessary type casts and redundant brackets. No functional changes. Suggested By: kib Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30841	2021-06-27 17:05:59 +01:00
Alexander Motin	6df35af4d8	Allow sleepq_signal() to drop the lock. Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the sleep queue chain lock before returning. Reduced lock scope allows significantly reduce lock contention inside taskqueue_enqueue() for ZFS worker threads doing ~350K disk reads/s on 40-thread system. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2021-06-25 14:12:21 -04:00
Konstantin Belousov	802cf4ab0e	namei: add NDPREINIT() macro Its intent is to do the initialization of the future part of struct nameidata which should be used across several namei() and VOPs. Right now it is NOP. Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Warner Losh	ddfc9c4c59	newbus: Move from bus_child_{pnpinfo,location}_src to bus_child_{pnpinfo,location} with sbuf Now that the upper layers all go through a layer to tie into these information functions that translates an sbuf into char * and len. The current interface suffers issues of what to do in cases of truncation, etc. Instead, migrate all these functions to using struct sbuf and these issues go away. The caller is also in charge of any memory allocation and/or expansion that's needed during this process. Create a bus_generic_child_{pnpinfo,location} and make it default. It just returns success. This is for those busses that have no information for these items. Migrate the now-empty routines to using this as appropriate. Document these new interfaces with man pages, and oversight from before. Reviewed by: jhb, bcr Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29937	2021-06-22 20:52:06 -06:00
Edward Tomasz Napierala	06250515cf	imgact_elf: compute auxv buffer size instead of using magic value The new buffer is somewhat larger, but there should be no functional changes. Reviewed By: kib, imp Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30821	2021-06-21 17:07:07 +01:00
Colin Percival	fe51b5a76d	kern_tslog: Include tslog data from loader The i386 loader (and hopefully others to come) now passes tslog data as a "preloaded module". Include this in the data returned by the debug.tslog sysctl. Reviewed by: kevans	2021-06-20 20:09:47 -07:00
Warner Losh	0a99422970	Move mips and arm to 1000Hz by default. armv6 and armv7 systems already were 1000Hz. The other armv5 were a mix of 100 and 1000. This changes them to 1000. Should there be issues, we can add options HZ=100 to the systems that have bad performance at the drop of a hat. mips is a lot more complicated. But most of the systems are already 1000HZ. The hardware exceptions are all fast enough to run at 1000Hz. MALTA is our primary emulator, and history has shown emulators tend to like 100Hz better, so run those systems at 100Hz. As with arm, any system that shows a huge performance regression can reverted to 100Hz easily. This was going to be committed well in advance of the 13 branch, but it was delayed and forgotten til now. Discussed on: #bsdmips ages ago Sponsored by: Netflix	2021-06-16 20:00:14 -06:00
John Baldwin	faf0224ff2	ktls: Don't mark existing received mbufs notready for TOE TLS. The TOE driver might receive decrypted TLS records that are enqueued to the socket buffer after ktls_try_toe() returns and before ktls_enable_rx() locks the receive buffer to call sb_mark_notready(). In that case, sb_mark_notready() would incorrectly treat the decrypted TLS record as an encrypted record and schedule it for decryption. This always resulted in the connection being dropped as the data in the control message did not look like a valid TLS header. To fix, don't try to handle software decryption of existing buffers in the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS driver needs to decrypt existing data in the socket buffer, the driver will need to manage that in its tod_alloc_tls_session method. Sponsored by: Chelsio Communications	2021-06-15 17:45:21 -07:00
Konstantin Belousov	a12e901a5a	Add a knob to disable dequeueing SIGCHLD on waiting for live process It seems that Linux does not dequeue siginfo for SIGCHLD when wait*(2) reports status of the running process. In particular, sigwaitinfo(2) and other signal querying syscalls can observe the siginfo after wait. FreeBSD dequeued siginfo from the beginning, so we cannot change the default ABI to be more compatible. Still, add a knob to enable to change to the other behavior for debugging purposes. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	bc38762474	Add a knob to not drop signal with default ignored or ignored actions Traditionally, BSD drops signals with the default action during send, not even putting them to the destination process queue. This semantic is not shared with other operating systems (Linux), which do queue such signals. In particular, sigtimedwait(2) and related syscalls can observe the delivery. Add a global knob kern.sig_discard_ign which can be set to false to force enqueuing of the signals with default action. Also add an ABI flag to indicate that signals should be queued. Note that it is not practical to run with the knob turned on, because almost all software that care about the delivery of such signals, is aware of the difference, and misbehaves if the signals are actually queued. The purpose of the knob as is is to allow for easier diagnostic of the programs that need the adjustments, to confirm the cause of problem. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	acced8b043	sigwait: add comment explaining EINTR/ERESTART details Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	afb36e289c	sigwait(2) and sigtimedwait(2) must not be restarted. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:18 +03:00
Mark Johnston	a100217489	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:32 -04:00
Mark Johnston	f4bb1869dd	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:27 -04:00
Andrew Gallatin	ed5e13cfc2	ktls: Fix interaction with RATELIMIT uipc_ktls.c was missing opt_ratelimit.h, so it was never noticing that RATELIMIT was enabled. Once it was enabled, it failed to compile as ktls_modify_txrtlmt() had accrued a compilation error when it was not being compiled in. Sponsored by: Netflix	2021-06-14 10:51:16 -04:00
Dmitry Chagin	e884512ad1	Split kern_poll() on two counterparts. The kern_poll_kfds() operates on clear kernel data, kfds points to an array in the kernel, while kern_poll() operates on user supplied pollfd. Move nfds check to kern_poll_maxfds(). No functional changes, it's for future use in the Linux emulation layer. Reviewd by: kib Differential Revision: https://reviews.freebsd.org/D30690 MFC after: 2 weeks	2021-06-10 15:11:25 +03:00
Dmitry Chagin	f570a6723e	Fix copyright, remove "all rights reserved". The eventfd code was written by me, rdivacky@ copyrigth applicable only to epoll part of the Linuxulator code. Roman is ok to retire his copyright from sys/kern/sys_eventfd.c and 'All rights reserved.' lines from sys/compat/linux/linux_event.[c\|h] and sys/kern/sys_eventfd.c files. Reviewed by: kib, emaste Approved by: rdivacky Differential Revision: https://reviews.freebsd.org/D30677 MFC after: 2 weeks	2021-06-08 08:18:00 +03:00
Mark Johnston	887c753c9f	Fix handling of D_GIANTOK It was meant to suppress only the printf(), not the subsequent injection of Giant-protected thunks for various file operations. Fixes: `fbeb4ccac9` Reported by: pho Tested by: pho MFC after: 6 days Pointy hat: markj	2021-06-07 16:45:50 -04:00
Mark Johnston	fbeb4ccac9	Suppress D_NEEDGIANT warnings for some drivers During boot we warn that the kbd and openfirm drivers are Giant-locked and may be deleted. Generally, the warning helps signal that certain old drivers are not being maintained and are subject to removal, but this doesn't really apply to certain drivers which are harder to detangle from Giant. Add a flag, D_GIANTOK, that devices can specify to suppress the misleading warning. Use it in the kbd and openfirm drivers. Reviewed by: imp, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30649	2021-06-06 16:44:46 -04:00
Konstantin Belousov	2d423f7671	sysent: allow ABI to disable setid on exec. Reviewed by: dchagin Tested by: trasz MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28154	2021-06-06 21:42:52 +03:00

1 2 3 4 5 ...

18640 Commits