freebsd-dev

Author	SHA1	Message	Date
Kyle Evans	7a129c973b	kern: add an option for preserving the early kenv Some downstream configurations do not store secrets in the early (loader/static) environments and desire a way to preserve these for diagnostic reasons. Provide an option to do so. Reviewed by: imp, jhb (earlier version) Differential Revision: https://reviews.freebsd.org/D30834	2021-07-18 23:05:48 -05:00
David Chisnall	cf98bc28d3	Pass the syscall number to capsicum permission-denied signals The syscall number is stored in the same register as the syscall return on amd64 (and possibly other architectures) and so it is impossible to recover in the signal handler after the call has returned. This small tweak delivers it in the `si_value` field of the signal, which is sufficient to catch capability violations and emulate them with a call to a more-privileged process in the signal handler. This reapplies `3a522ba1bc` with a fix for the static assertion failure on i386. Approved by: markj (mentor) Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D29185	2021-07-16 18:06:44 +01:00
Mark Johnston	c1aff72cfa	callout: Make cc_cpu local to kern_timeout.c No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-15 22:41:10 -04:00
Mark Johnston	2e5f615295	lio_listio: Don't post a completion notification if none was requested One is allowed to use LIO_NOWAIT without specifying a sigevent. In this case, lj->lioj_signal is left uninitialized, but several code paths examine liov_signal.sigev_notify to figure out which notification to post. Unconditionally initialize that field to SIGEV_NONE. Add a dumb test case which triggers the bug. Reported by: KMSAN+syzkaller Reviewed by: asomers MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31197	2021-07-15 22:41:10 -04:00
Konstantin Belousov	0bdb2cbf9d	procctl(PROC_ASLR_STATUS): fix vmspace leak Reported by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-07-15 03:02:50 +03:00
Mark Johnston	2783335cae	blist: Correct the node count computed in blist_create() Commit `bb4a27f927` added the ability to allocate a span of blocks crossing a meta node boundary. To ensure that blst_next_leaf_alloc() does not walk past the end of the tree, an extra all-zero meta node needs to be present at the end of the allocation, and blst_next_leaf_alloc() is implemented such that the presence of this node terminates the search. blist_create() computes the number of nodes required. It had two problems: 1. When the size of the blist is a power of BLIST_RADIX, we would unnecessarily allocate an extra level in the tree. 2. When the size of the blist is a multiple of BLIST_RADIX, we would fail to allocate a terminator node. In this case, blst_next_leaf_alloc() could scan beyond the bounds of the allocation. This was found using KASAN. Modify blist_create() to handle these cases correctly. Reported by: pho Reviewed by: dougm MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D31158	2021-07-13 17:47:27 -04:00
Mark Johnston	45e2357113	malloc: Pass the allocation size to malloc_large() by value Its callers do not make use the modified size that malloc_large() was returning, so there's no need to pass a pointer. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-13 17:47:02 -04:00
Mateusz Guzik	844aa31c6d	cache: add cache_enter_time_flags	2021-07-12 07:03:14 +02:00
David Chisnall	d2b558281a	Revert "Pass the syscall number to capsicum permission-denied signals" This broke the i386 build. This reverts commit `3a522ba1bc`.	2021-07-10 20:26:01 +01:00
David Chisnall	3a522ba1bc	Pass the syscall number to capsicum permission-denied signals The syscall number is stored in the same register as the syscall return on amd64 (and possibly other architectures) and so it is impossible to recover in the signal handler after the call has returned. This small tweak delivers it in the `si_value` field of the signal, which is sufficient to catch capability violations and emulate them with a call to a more-privileged process in the signal handler. Approved by: markj (mentor) Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D29185	2021-07-10 17:19:52 +01:00
Alexander Motin	63ca9ea4f3	Use sleepq_signal(SLEEPQ_DROP) in cv_signal(). Same as wakeup_one()/wakeup_any() commit before it reduces the lock hold time and so contention. MFC after: 1 week	2021-07-09 20:57:58 -04:00
Mark Johnston	588c7a06df	KASAN: Implement __asan_unregister_globals() It will be called during KLD unload to unpoison the redzones following global variables. Otherwise, virtual address ranges previously used for a KLD may be left tainted, triggering false positives when they are recycled. Reported by: pho Sponsored by: The FreeBSD Foundation	2021-07-09 20:38:50 -04:00
Michal Meloun	e88c3b1b02	intrng: remove now redundant shadow variable. Should not be a functional change. Submitted by: ehem_freebsd@m5p.com Discussed in: https://reviews.freebsd.org/D29310 MFC after: 4 weeks	2021-07-08 08:46:41 +02:00
Michal Meloun	a49f208d94	intrng: Releasing interrupt source should clear interrupt table full state. The first release of an interrupt in a situation where the interrupt table is full should schedule a full table check the next time an interrupt is allocated. A full check is necessary to ensure maximum separation between the order of allocation and the order of release. Submitted by: ehem_freebsd@m5p.com (initial version) Discussed in: https://reviews.freebsd.org/D29310 MFC after: 4 weeks	2021-07-08 08:16:46 +02:00
Andrew Gallatin	4150a5a87e	ktls: fix NOINET build Reported by: mjguzik Sponsored by: Netflix	2021-07-07 10:40:02 -04:00
Randall Stewart	d7955cc0ff	tcp: HPTS performance enhancements HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083	2021-07-07 07:22:35 -04:00
Konstantin Belousov	28a66fc3da	Do not call FreeBSD-ABI specific code for all ABIs Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle robust mutexes, for native FreeBSD ABI. Similarly, there is no sense in calling sigfastblock_clear() for non-native ABIs. Requested by: dchagin Reviewed by: dchagin, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Konstantin Belousov	55976ce11a	Move sv_onexit() sysentvec hook slightly later after itimers are stopped. This makes it more usable for e.g. native FreeBSD ABI sysentvecs. Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Konstantin Belousov	71ab344524	Add sv_onexec_old() sysent hook for exec event Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure. The old vmspace for the process is still intact during the call. Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30987	2021-07-07 14:12:07 +03:00
Mateusz Guzik	c2c34ee540	mbuf: add m_get_raw and m_gethdr_raw The intent is to eliminate the MT_NOINIT flag and consequently a branch from the constructor. Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31080	2021-07-07 11:05:46 +00:00
Mateusz Guzik	0a718a6e6e	mbuf: replace all direct uma_zfree(zone_mbuf) calls with m_free_raw Reviewed by: donner Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31082	2021-07-07 11:05:46 +00:00
Andrew Gallatin	28d0a740dd	ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by: hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908	2021-07-06 10:28:32 -04:00
Jessica Clarke	55c57a7811	rman: Remove an outdated comment that no longer applies Since commit `2dd1bdf183` in 2016 the r_start and r_end fields have been rman_res_t, which was briefly unsigned long, but commit `da1b038af9` changed the typedef to be uintmax_t instead. C99 is also something we assume these days. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D30808	2021-07-05 16:15:03 +01:00
Mateusz Guzik	904a08f342	ktls: switch bare zone_mbuf use to m_free_raw Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30955	2021-07-02 08:30:22 +00:00
Mateusz Guzik	05462babd4	mbuf: add m_free_raw to be used instead of directly calling uma_zfree The intent is to remove all direct zone_mbuf consumers so that ctor/dtor from that zone can be reimplemented as wrappers around uma, avoiding an indirect function call. Reviewed by: kbowling Discussed with: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30959	2021-07-02 08:30:22 +00:00
Mateusz Guzik	fb32c8dbeb	iflib: retire MB_DTOR_SKIP The flag was added in 2016 but remains unused. Reviewed by: kbowling Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30958	2021-07-02 08:30:22 +00:00
Edward Tomasz Napierala	db8d680ebe	procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS This introduces a new, per-process flag, "NO_NEW_PRIVS", which is inherited, preserved on exec, and cannot be cleared. The flag, when set, makes subsequent execs ignore any SUID and SGID bits, instead executing those binaries as if they not set. The main purpose of the flag is implementation of Linux PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged chroot. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30939	2021-07-01 09:42:07 +01:00
Dmitry Chagin	5d9f790191	Eliminate p_elf_machine from struct proc. Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now cached in the struct proc at p_elf_brandinfo member. Note to MFC: D30918, KBI Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D30926 MFC after: 2 weeks	2021-06-29 20:18:29 +03:00
Dmitry Chagin	615f22b2fb	Add a link to the Elf_Brandinfo into the struc proc. To allow the ABI to make a dicision based on the Brandinfo add a link to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits of Elf_Brandinfo flags is private to the ABI. Note to MFC: it breaks KBI. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D30918 MFC after: 2 weeks	2021-06-29 20:15:08 +03:00
Edward Tomasz Napierala	435754a59e	Add infrastructure required for Linux coredump support This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`, and `sv_elf_core_prepare_notes` fields to `struct sysentvec`, and modifies imgact_elf.c to make use of them instead of hardcoding FreeBSD-specific values. It also updates all of the ABI definitions to preserve current behaviour. This makes it possible to implement non-native ELF coredump support without unnecessary code duplication. It will be used for Linux coredumps. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30921	2021-06-29 08:49:12 +01:00
Edward Tomasz Napierala	61b4c62718	imgact_elf.c: style, remove unnecessary casts Remove unnecessary type casts and redundant brackets. No functional changes. Suggested By: kib Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30841	2021-06-27 17:05:59 +01:00
Alexander Motin	6df35af4d8	Allow sleepq_signal() to drop the lock. Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the sleep queue chain lock before returning. Reduced lock scope allows significantly reduce lock contention inside taskqueue_enqueue() for ZFS worker threads doing ~350K disk reads/s on 40-thread system. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2021-06-25 14:12:21 -04:00
Konstantin Belousov	802cf4ab0e	namei: add NDPREINIT() macro Its intent is to do the initialization of the future part of struct nameidata which should be used across several namei() and VOPs. Right now it is NOP. Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Warner Losh	ddfc9c4c59	newbus: Move from bus_child_{pnpinfo,location}_src to bus_child_{pnpinfo,location} with sbuf Now that the upper layers all go through a layer to tie into these information functions that translates an sbuf into char * and len. The current interface suffers issues of what to do in cases of truncation, etc. Instead, migrate all these functions to using struct sbuf and these issues go away. The caller is also in charge of any memory allocation and/or expansion that's needed during this process. Create a bus_generic_child_{pnpinfo,location} and make it default. It just returns success. This is for those busses that have no information for these items. Migrate the now-empty routines to using this as appropriate. Document these new interfaces with man pages, and oversight from before. Reviewed by: jhb, bcr Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29937	2021-06-22 20:52:06 -06:00
Edward Tomasz Napierala	06250515cf	imgact_elf: compute auxv buffer size instead of using magic value The new buffer is somewhat larger, but there should be no functional changes. Reviewed By: kib, imp Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30821	2021-06-21 17:07:07 +01:00
Colin Percival	fe51b5a76d	kern_tslog: Include tslog data from loader The i386 loader (and hopefully others to come) now passes tslog data as a "preloaded module". Include this in the data returned by the debug.tslog sysctl. Reviewed by: kevans	2021-06-20 20:09:47 -07:00
Warner Losh	0a99422970	Move mips and arm to 1000Hz by default. armv6 and armv7 systems already were 1000Hz. The other armv5 were a mix of 100 and 1000. This changes them to 1000. Should there be issues, we can add options HZ=100 to the systems that have bad performance at the drop of a hat. mips is a lot more complicated. But most of the systems are already 1000HZ. The hardware exceptions are all fast enough to run at 1000Hz. MALTA is our primary emulator, and history has shown emulators tend to like 100Hz better, so run those systems at 100Hz. As with arm, any system that shows a huge performance regression can reverted to 100Hz easily. This was going to be committed well in advance of the 13 branch, but it was delayed and forgotten til now. Discussed on: #bsdmips ages ago Sponsored by: Netflix	2021-06-16 20:00:14 -06:00
John Baldwin	faf0224ff2	ktls: Don't mark existing received mbufs notready for TOE TLS. The TOE driver might receive decrypted TLS records that are enqueued to the socket buffer after ktls_try_toe() returns and before ktls_enable_rx() locks the receive buffer to call sb_mark_notready(). In that case, sb_mark_notready() would incorrectly treat the decrypted TLS record as an encrypted record and schedule it for decryption. This always resulted in the connection being dropped as the data in the control message did not look like a valid TLS header. To fix, don't try to handle software decryption of existing buffers in the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS driver needs to decrypt existing data in the socket buffer, the driver will need to manage that in its tod_alloc_tls_session method. Sponsored by: Chelsio Communications	2021-06-15 17:45:21 -07:00
Konstantin Belousov	a12e901a5a	Add a knob to disable dequeueing SIGCHLD on waiting for live process It seems that Linux does not dequeue siginfo for SIGCHLD when wait*(2) reports status of the running process. In particular, sigwaitinfo(2) and other signal querying syscalls can observe the siginfo after wait. FreeBSD dequeued siginfo from the beginning, so we cannot change the default ABI to be more compatible. Still, add a knob to enable to change to the other behavior for debugging purposes. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	bc38762474	Add a knob to not drop signal with default ignored or ignored actions Traditionally, BSD drops signals with the default action during send, not even putting them to the destination process queue. This semantic is not shared with other operating systems (Linux), which do queue such signals. In particular, sigtimedwait(2) and related syscalls can observe the delivery. Add a global knob kern.sig_discard_ign which can be set to false to force enqueuing of the signals with default action. Also add an ABI flag to indicate that signals should be queued. Note that it is not practical to run with the knob turned on, because almost all software that care about the delivery of such signals, is aware of the difference, and misbehaves if the signals are actually queued. The purpose of the knob as is is to allow for easier diagnostic of the programs that need the adjustments, to confirm the cause of problem. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	acced8b043	sigwait: add comment explaining EINTR/ERESTART details Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:19 +03:00
Konstantin Belousov	afb36e289c	sigwait(2) and sigtimedwait(2) must not be restarted. Reported by: dchagin Reviewed by: dchagin, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30675	2021-06-16 02:00:18 +03:00
Mark Johnston	a100217489	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:32 -04:00
Mark Johnston	f4bb1869dd	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:27 -04:00
Andrew Gallatin	ed5e13cfc2	ktls: Fix interaction with RATELIMIT uipc_ktls.c was missing opt_ratelimit.h, so it was never noticing that RATELIMIT was enabled. Once it was enabled, it failed to compile as ktls_modify_txrtlmt() had accrued a compilation error when it was not being compiled in. Sponsored by: Netflix	2021-06-14 10:51:16 -04:00
Dmitry Chagin	e884512ad1	Split kern_poll() on two counterparts. The kern_poll_kfds() operates on clear kernel data, kfds points to an array in the kernel, while kern_poll() operates on user supplied pollfd. Move nfds check to kern_poll_maxfds(). No functional changes, it's for future use in the Linux emulation layer. Reviewd by: kib Differential Revision: https://reviews.freebsd.org/D30690 MFC after: 2 weeks	2021-06-10 15:11:25 +03:00
Dmitry Chagin	f570a6723e	Fix copyright, remove "all rights reserved". The eventfd code was written by me, rdivacky@ copyrigth applicable only to epoll part of the Linuxulator code. Roman is ok to retire his copyright from sys/kern/sys_eventfd.c and 'All rights reserved.' lines from sys/compat/linux/linux_event.[c\|h] and sys/kern/sys_eventfd.c files. Reviewed by: kib, emaste Approved by: rdivacky Differential Revision: https://reviews.freebsd.org/D30677 MFC after: 2 weeks	2021-06-08 08:18:00 +03:00
Mark Johnston	887c753c9f	Fix handling of D_GIANTOK It was meant to suppress only the printf(), not the subsequent injection of Giant-protected thunks for various file operations. Fixes: `fbeb4ccac9` Reported by: pho Tested by: pho MFC after: 6 days Pointy hat: markj	2021-06-07 16:45:50 -04:00
Mark Johnston	fbeb4ccac9	Suppress D_NEEDGIANT warnings for some drivers During boot we warn that the kbd and openfirm drivers are Giant-locked and may be deleted. Generally, the warning helps signal that certain old drivers are not being maintained and are subject to removal, but this doesn't really apply to certain drivers which are harder to detangle from Giant. Add a flag, D_GIANTOK, that devices can specify to suppress the misleading warning. Use it in the kbd and openfirm drivers. Reviewed by: imp, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30649	2021-06-06 16:44:46 -04:00
Konstantin Belousov	2d423f7671	sysent: allow ABI to disable setid on exec. Reviewed by: dchagin Tested by: trasz MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28154	2021-06-06 21:42:52 +03:00
Konstantin Belousov	19e6043a44	kern_exec.c: Add execve_nosetid() helper Reviewed by: dchagin Tested by: trasz MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28154	2021-06-06 21:42:41 +03:00
Jason A. Harmening	59409cb90f	Add a generic mechanism for preventing forced unmount This is aimed at preventing stacked filesystems like nullfs and unionfs from "losing" their lower mounts due to forced unmount. Otherwise, VFS operations that are passed through to the lower filesystem(s) may crash or otherwise cause unpredictable behavior. Introduce two new functions: vfs_pin_from_vp() and vfs_unpin(). which are intended to be called on the lower mount(s) when the stacked filesystem is mounted and unmounted, respectively. Much as registration in the mnt_uppers list previously did, pinning will prevent even forced unmount of the lower FS and will allow the stacked FS to freely operate on the lower mount either by direct use of the struct mount* or indirect use through a properly-referenced vnode's v_mount field. vfs_pin_from_vp() is modeled after vfs_ref_from_vp() in that it uses the mount interlock coupled with re-checking vp->v_mount to ensure that it will fail in the face of a pending unmount request, even if the concurrent unmount fully completes. Adopt these new functions in both nullfs and unionfs. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30401	2021-06-05 18:20:36 -07:00
wiklam	43521b46fc	Correcting comment about "sched_interact_score". Reviewed by: jrtc@, imp@ Pull Request: https://github.com/freebsd/freebsd-src/pull/431 Sponsored by: Netflix	2021-06-02 21:50:57 -06:00
Warner Losh	9f3d1a98dd	regen after tweaks to getgroups and setgroups Sponsored by: Netflix	2021-06-02 13:24:50 -06:00
Moritz Buhl	4bc2174a1b	kern: fail getgroup and setgroup with negative int Found using https://github.com/NetBSD/src/blob/trunk/tests/lib/libc/sys/t_getgroups.c getgroups/setgroups want an int and therefore casting it to u_int resulted in `getgroups(-1, ...)` not returning -1 / errno = EINVAL. imp@ updated syscall.master and made changes markj@ suggested PR: 189941 Tested by: imp@ Reviewed by: markj@ Pull Request: https://github.com/freebsd/freebsd-src/pull/407 Differential Revision: https://reviews.freebsd.org/D30617	2021-06-02 13:22:57 -06:00
Mateusz Guzik	c9f8dcda85	kqueue: replace kq_ncallouts loop with atomic_fetchadd	2021-06-02 15:14:58 +00:00
Rich Ercolani	a19ae1b099	vfs: fix MNT_SYNCHRONOUS check in vn_write `ca1ce50b2b` ("vfs: add more safety against concurrent forced unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS if O_FSYNC is set. Reviewed By: mjg Differential Revision: https://reviews.freebsd.org/D30610	2021-06-02 13:42:02 +00:00
Kyle Evans	2d741f33bd	kern: ether_gen_addr: randomize on default hostuuid, too Currently, this will still hash the default (all zero) hostuuid and potentially arrive at a MAC address that has a high chance of collision if another interface of the same name appears in the same broadcast domain on another host without a hostuuid, e.g., some virtual machine setups. Instead of using the default hostuuid, just treat it as a failure and generate a random LA unicast MAC address. Reviewed by: bz, gbe, imp, kbowling, kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29788	2021-06-01 22:59:21 -05:00
Mark Johnston	283e60fb31	ktrace: Fix an inverted comparison added in commit `f3851b235` Fixes: `f3851b235` ("ktrace: Fix a race with fork()") Reported by: dchagin, phk	2021-06-01 09:15:35 -04:00
Konstantin Belousov	d3f7975fcb	thread_reap_barrier(): remove unused variable Noted by: alc Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week	2021-05-31 23:03:42 +03:00
Konstantin Belousov	f62c7e54e9	Add thread_reap_barrier() Reviewed by: hselasky,markj Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30468	2021-05-31 18:09:22 +03:00
Konstantin Belousov	3a68546d23	quisce_cpus(): add special handling for PDROP Currently passing PDROP to the quisce_cpus() function does not make sense. Add special meaning for it, by not waiting for the idle thread to schedule. Also avoid allocating u_int[MAXCPU] on the stack. Reviewed by: hselasky, markj Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30468	2021-05-31 18:09:22 +03:00
Konstantin Belousov	845d77974b	kern_thread.c: wrap too long lines Reviewed by: hselasky, markj Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30468	2021-05-31 18:09:22 +03:00
Konstantin Belousov	e266a0f7f0	kern linker: do not allow more than one kldload and kldunload syscalls simultaneously kld_sx is dropped e.g. for executing sysinits, which allows user to initiate kldunload while module is not yet fully initialized. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D30456 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-05-31 18:09:22 +03:00
Konstantin Belousov	27006229f7	vinvalbuf: do not panic if we were unable to flush dirty buffers Return EBUSY instead and let caller to handle the issue. For vgone()/vnode reclamation, caller first does vinvalbuf(V_SAVE), which return EBUSY in case dirty buffers where not flushed. Then caller calls vinvalbuf(0) due to non-zero return, which gets rid of all dirty buffers without dependencies. PR: 238565 Reviewed by: asomers, mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30555	2021-05-31 01:20:53 +03:00
Jason A. Harmening	a4b07a2701	VFS_QUOTACTL(9): allow implementation to indicate busy state changes Instead of requiring all implementations of vfs_quotactl to unbusy the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param to VFS_QUOTACTL(9). The implementation may then indicate to the caller whether it needed to unbusy the mount. Also, add stbool.h to libprocstat modules which #define _KERNEL before including sys/mount.h. Otherwise they'll pull in sys/types.h before defining _KERNEL and therefore won't have the bool definition they need for mp_busy. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30556	2021-05-30 14:53:47 -07:00
Jason A. Harmening	271fcf1c28	Revert commits `6d3e78ad6c` and `54256e7954` Parts of libprocstat like to pretend they're kernel components for the sake of including mount.h, and including sys/types.h in the _KERNEL case doesn't fix the build for some reason. Revert both the VFS_QUOTACTL() change and the follow-up "fix" for now.	2021-05-29 17:48:02 -07:00
Mateusz Guzik	3cf75ca220	vfs: retire unused vn_seqc_write_begin_unheld*	2021-05-29 22:04:09 +00:00
Mateusz Guzik	d81aefa8b7	vfs: use the sentinel trick in locked lookup path parsing	2021-05-29 22:04:09 +00:00
Mateusz Guzik	478c52f1e3	vfs: slightly rework vn_rlimit_fsize	2021-05-29 22:04:09 +00:00
Mateusz Guzik	9bfddb3ac4	fd: use PROC_WAIT_UNLOCKED when clearing p_fd/p_pd	2021-05-29 22:04:09 +00:00
Jason A. Harmening	6d3e78ad6c	VFS_QUOTACTL(9): allow implementation to indicate busy state changes Instead of requiring all implementations of vfs_quotactl to unbusy the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param to VFS_QUOTACTL(9). The implementation may then indicate to the caller whether it needed to unbusy the mount. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30218	2021-05-29 14:05:39 -07:00
Mark Johnston	f3851b235b	ktrace: Fix a race with fork() ktrace(2) may toggle trace points in any of 1. a single process 2. all members of a process group 3. all descendents of the processes in 1 or 2 In the first two cases, we do not permit the operation if the process is being forked or not visible. However, in case 3 we did not enforce this restriction for descendents. As a result, the assertions about the child in ktrprocfork() may be violated. Move these checks into ktrops() so that they are applied consistently. Allow KTROP_CLEAR for nascent processes. Otherwise, there is a window where we cannot clear trace points for a nascent child if they are inherited from the parent. Reported by: syzbot+d96676592978f137e05c@syzkaller.appspotmail.com Reported by: syzbot+7c98fcf84a4439f2817f@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30481	2021-05-27 15:52:20 -04:00
Mark Johnston	e00bae5c18	kevent: Prohibit negative change and event list lengths Previously, a negative change list length would be treated the same as an empty change list. A negative event list length would result in bogus copyouts. Make kevent(2) return EINVAL for both cases so that application bugs are more easily found, and to be more robust against future changes to kevent internals. Reviewed by: imp, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30480	2021-05-27 15:52:20 -04:00
Mark Johnston	f885100773	ktrace: Handle negative array sizes in ktrstructarray ktrstructarray() may be used to create copies of kevent(2) change and event arrays. It is called before parameter validation is done and so should check for bogus array lengths before allocating a copy. Reported by: syzkaller Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30479	2021-05-27 15:52:20 -04:00
Edward Tomasz Napierala	905d192d6f	Unstaticize parts of coredumping code This makes it possible to call __elfN(size_segments) and __elfN(puthdr) from Linux coredump code. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30455	2021-05-26 11:51:57 +01:00
John Baldwin	6b313a3a60	Include the trailer in the original dst_iov. This avoids creating a duplicate copy on the stack just to append the trailer. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30139	2021-05-25 16:59:19 -07:00
John Baldwin	21e3c1fbe2	Assume OCF is the only KTLS software backend. This removes support for loadable software backends. The KTLS OCF support is now always included in kernels with KERN_TLS and the ktls_ocf.ko module has been removed. The software encryption routines now take an mbuf directly and use the TLS mbuf as the crypto buffer when possible. Bump __FreeBSD_version for software backends in ports. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30138	2021-05-25 16:59:19 -07:00
John Baldwin	883a0196b6	crypto: Add a new type of crypto buffer for a single mbuf. This is intended for use in KTLS transmit where each TLS record is described by a single mbuf that is itself queued in the socket buffer. Using the existing CRYPTO_BUF_MBUF would result in bus_dmamap_load_crp() walking additional mbufs in the socket buffer that are not relevant, but generating a S/G list that potentially exceeds the limit of the tag (while also wasting CPU cycles). Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30136	2021-05-25 16:59:18 -07:00
John Baldwin	6663f8a23e	sglist: Add sglist_append_single_mbuf(). This function appends the contents of a single mbuf to an sglist rather than an entire mbuf chain. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30135	2021-05-25 16:59:18 -07:00
John Baldwin	aa341db39b	Rename m_unmappedtouio() to m_unmapped_uiomove(). This function doesn't only copy data into a uio but instead is a variant of uiomove() similar to uiomove_fromphys(). Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30444	2021-05-25 16:59:18 -07:00
John Baldwin	3f9dac85cc	Extend m_copyback() to support unmapped mbufs. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30133	2021-05-25 16:59:18 -07:00
John Baldwin	3c7a01d773	Extend m_apply() to support unmapped mbufs. m_apply() invokes the callback function separately on each segment of an unmapped mbuf: the TLS header, individual pages, and the TLS trailer. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30132	2021-05-25 16:59:18 -07:00
Edward Tomasz Napierala	3b9971c8da	Clean up some of the core dumping code. No functional changes. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30397	2021-05-25 16:30:32 +01:00
Konstantin Belousov	fd3ac06f45	ptrace: add an option to not kill debuggees on debugger exit Requested by: markj Reviewed by: jhb (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differrential revision: https://reviews.freebsd.org/D30351	2021-05-25 18:22:34 +03:00
Konstantin Belousov	d7a7ea5be6	sys_process.c: extract ptrace_unsuspend() Reviewed by: jhb Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differrential revision: https://reviews.freebsd.org/D30351	2021-05-25 18:22:27 +03:00
Mateusz Guzik	a269183875	vfs: elide vnode locking when it is only needed for audit if possible	2021-05-23 19:37:16 +00:00
Mark Johnston	6f6cd1e8e8	ktrace: Remove vrele() at the end of ktr_writerequest() As of commit `fc369a353` we no longer ref the vnode when writing a record. Drop the corresponding vrele() call in the error case. Fixes: `fc369a353` ("ktrace: fix a race between writes and close") Reported by: syzbot+9b96ea7a5ff8917d3fe4@syzkaller.appspotmail.com Reported by: syzbot+6120ebbb354cd52e5107@syzkaller.appspotmail.com Reviewed by: kib MFC after: 6 days Differential Revision: https://reviews.freebsd.org/D30404	2021-05-23 14:13:01 -04:00
Mateusz Guzik	e2ab16b1a6	lockprof: move panic check after inspecting the state	2021-05-23 17:55:27 +00:00
Mateusz Guzik	6a467cc5e1	lockprof: pass lock type as an argument instead of reading the spin flag	2021-05-23 17:55:27 +00:00
Hans Petter Selasky	ef0f7ae934	The old thread priority must be stored as part of the EPOCH(9) tracker. Else recursive use of EPOCH(9) may cause the wrong priority to be restored. Bump the __FreeBSD_version due to changing the thread and epoch tracker structure. Differential Revision: https://reviews.freebsd.org/D30375 Reviewed by: markj@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-23 10:53:25 +02:00
Mateusz Guzik	138f78e94b	umtx: convert umtxq_lock to a macro Then LOCK_PROFILING starts reporting callers instead of the inline.	2021-05-22 21:01:05 +00:00
Mateusz Guzik	e71d5c7331	Fix limit testing after `1762f674cc` ktrace commit. The previous: if ((uoff_t)uio->uio_offset + uio->uio_resid > lim) signal(....); was replaced with: if ((uoff_t)uio->uio_offset + uio->uio_resid < lim) return; signal(....); Making (uoff_t)uio->uio_offset + uio->uio_resid == lim trip over the limit, when it did not previously. Unbreaks running 13.0 buildworld.	2021-05-22 20:18:21 +00:00
Konstantin Belousov	fc369a353b	ktrace: fix a race between writes and close It was possible that termination of ktrace session occured during some record write, in which case write occured after the close of the vnode. Use ktr_io_params refcounting to avoid this situation, by taking the reference on the structure instead of vnode. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30400	2021-05-22 23:14:13 +03:00
Mateusz Guzik	48235c377f	Fix a braino in previous. Instead of trying to partially ifdef out ktrace handling, define the missing identifier to 0. Without this fix lack of ktrace in the kernel also means there is no SIGXFSZ signal delivery.	2021-05-22 19:53:40 +00:00
Mateusz Guzik	154f0ecc10	Fix tinderbox build after `1762f674cc` ktrace commit.	2021-05-22 19:41:19 +00:00
Mateusz Guzik	a0842e69aa	lockprof: add contested-only profiling This allows tracking all wait times with much smaller runtime impact. For example when doing -j 104 buildkernel on tmpfs: no profiling: 2921.70s user 282.72s system 6598% cpu 48.562 total all acquires: 2926.87s user 350.53s system 6656% cpu 49.237 total contested only: 2919.64s user 290.31s system 6583% cpu 48.756 total	2021-05-22 19:28:37 +00:00
Mateusz Guzik	fca5cfd584	lockprof: retire lock_prof_skipcount The implementation uses a global variable for ALL calls, defeating the point of sampling in the first place. Remove it as it clearly remains unused.	2021-05-22 19:28:37 +00:00
Mateusz Guzik	cf74b2be53	vfs: retire the now unused vnlru_free routine	2021-05-22 18:42:30 +00:00
Mark Johnston	e4b16f2fb1	ktrace: Avoid recursion in namei() sys_ktrace() calls namei(), which may call ktrnamei(). But sys_ktrace() also calls ktrace_enter() first, so if the caller is itself being traced, the assertion in ktrace_enter() is triggered. And, ktrnamei() does not check for recursion like most other ktrace ops do. Fix the bug by simply deferring the ktrace_enter() call. Also make the parameter to ktrnamei() const and convert to ANSI. Reported by: syzbot+d0a4de45e58d3c08af4b@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30340	2021-05-22 12:07:32 -04:00
Konstantin Belousov	f784da883f	Move mnt_maxsymlinklen into appropriate fs mount data structures Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC-Note: struct mount layout Differential revision: https://reviews.freebsd.org/D30325	2021-05-22 15:16:09 +03:00
Konstantin Belousov	ea2b64c241	ktrace: add a kern.ktrace.filesize_limit_signal knob When enabled, writes to ktrace.out that exceed the max file size limit cause SIGXFSZ as it should be, but note that the limit is taken from the process that initiated ktrace. When disabled, write is blocked, but signal is not send. Note that in either case ktrace for the affected process is stopped. Requested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:09 +03:00
Konstantin Belousov	02645b886b	ktrace: use the limit of the trace initiator for file size limit on writes Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:09 +03:00
Konstantin Belousov	1762f674cc	ktrace: pack all ktrace parameters into allocated structure ktr_io_params Ref-count the ktr_io_params structure instead of vnode/cred. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	a6144f713c	ktrace: do not stop tracing other processes if our cannot write to this vnode Other processes might still be able to write, make the decision to stop based on the per-process situation. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	9bb84c23e7	accounting: explicitly mark the exiting thread as doing accounting and use the mark to stop applying file size limits on the write of the accounting record. This allows to remove hack to clear process limits in acct_process(), and avoids the bug with the clearing being ineffective because limits are also cached in the thread structure. Reported and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	70c05850e2	kern_descrip.c: Style Wrap too long lines. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257	2021-05-22 15:16:08 +03:00
Konstantin Belousov	d713bf7927	vn_need_pageq_flush(): simplify There is no need to own vnode interlock, since v_object is type stable and can only change to/from NULL, and no other checks in the function access fields protected by the interlock. Remove the need variable, the result of the test is directly usable as return value. Tested by: mav, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-05-22 12:29:44 +03:00
Edward Tomasz Napierala	33621dfc19	Refactor core dumping code a bit This makes it possible to use core_write(), core_output(), and sbuf_drain_core_output(), in Linux coredump code. Moving them out of imgact_elf.c is necessary because of the weird way it's being built. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30369	2021-05-22 09:59:00 +01:00
Mark Johnston	916c61a5ed	Fix handling of errors from pru_send(PRUS_NOTREADY) PRUS_NOTREADY indicates that the caller has not yet populated the chain with data, and so it is not ready for transmission. This is used by sendfile (for async I/O) and KTLS (for encryption). In particular, if pru_send returns an error, the caller is responsible for freeing the chain since other implicit references to the data buffers exist. For async sendfile, it happens that an error will only be returned if the connection was dropped, in which case tcp_usr_ready() will handle freeing the chain. But since KTLS can be used in conjunction with the regular socket I/O system calls, many more error cases - which do not result in the connection being dropped - are reachable. In these cases, KTLS was effectively assuming success. So: - Change sosend_generic() to free the mbuf chain if pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the chain at that point. - Similarly, in vn_sendfile() change the !async I/O && KTLS case to free the chain. - If async I/O is still outstanding when pru_send fails in vn_sendfile(), set an error in the sfio structure so that the connection is aborted and the mbuf chain is freed. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349	2021-05-21 17:45:19 -04:00
Hans Petter Selasky	c82c200622	Accessing the epoch structure should happen after the INIT_CHECK(). Else the epoch pointer may be NULL. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-21 11:21:32 +02:00
Hans Petter Selasky	f33168351b	Properly define EPOCH(9) function macro. No functional change intended. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-21 11:21:32 +02:00
Hans Petter Selasky	cc9bb7a9b8	Rework for-loop in EPOCH(9) to reduce indentation level. No functional change intended. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-05-21 11:21:32 +02:00
Lv Yunlong	b295c5ddce	socket: Release cred reference later in sodealloc() We dereference so->so_cred to update the per-uid socket buffer accounting, so the crfree() call must be deferred until after that point. PR: 255869 MFC after: 1 week	2021-05-18 15:25:40 -04:00
Konstantin Belousov	8cf912b017	ttydev_write: prevent stops while terminal is busied Since busy state is checked by all blocked writes, stopping a process which waits in ttydisc_write() causes cascade. Utilize sigdeferstop() to avoid the issue. Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> PR: 255816 MFC after: 1 week	2021-05-18 20:52:03 +03:00
Mateusz Guzik	cc6f46ac2f	vfs: refactor vdrop In particular move vunlazy into its own routine.	2021-05-18 15:30:28 +00:00
Mateusz Guzik	715fcc0d34	vfs: change vn_freevnodes_* prefix to idiomatic vfs_freevnodes_*	2021-05-18 15:30:28 +00:00
Colin Percival	b6be9566d2	Fix buffer overflow in preloaded hostuuid cleaning When a module of type "hostuuid" is provided by the loader, prison0_init strips any trailing whitespace and ASCII control characters by (a) adjusting the buffer length, and (b) zeroing out the characters in question, before storing it as the system's hostuuid. The buffer length adjustment was correct, but the zeroing overwrote one byte higher in memory than intended -- in the typical case, zeroing one byte past the end of the hostuuid buffer. Due to the layout of buffers passed by the boot loader to the kernel, this will be the first byte of a subsequent buffer. This was probably harmless; prison0_init runs after preloaded kernel modules have been linked and after the preloaded /boot/entropy cache has been processed, so in both cases having the first byte overwritten will not cause problems. We cannot however rule out the possibility that other objects which are preloaded by the loader could suffer from having the first byte overwritten. Since the zeroing does not in fact serve any purpose, remove it and trim trailing whitespace and ASCII control characters by adjusting the buffer length alone. Fixes: `c3188289` Preload hostuuid for early-boot use Reviewed by: kevans, markj MFC after: 3 days	2021-05-17 20:07:49 -07:00
Colin Percival	330f110bf1	Fix 'hostuuid: preload data malformed' warning If the preloaded hostuuid value is invalid and verbose booting is enabled, a warning is printed. This printf had two bugs: 1. It was missing a trailing \n character. 2. The malformed UUID is printed with %s even though it is not known to be NUL-terminated. This commit adds the missing \n and uses %.*s with the (already known) length of the preloaded UUID to ensure that we don't read past the end of the buffer. Reported by: kevans Fixes: `c3188289` Preload hostuuid for early-boot use MFC after: 3 days	2021-05-17 20:07:49 -07:00
Kirk McKusick	9a2fac6ba6	Fix handling of embedded symbolic links (and history lesson). The original filesystem release (4.2BSD) had no embedded sysmlinks. Historically symbolic links were just a different type of file, so the content of the symbolic link was contained in a single disk block fragment. We observed that most symbolic links were short enough that they could fit in the area of the inode that normally holds the block pointers. So we created embedded symlinks where the content of the link was held in the inode's pointer area thus avoiding the need to seek and read a data fragment and reducing the pressure on the block cache. At the time we had only UFS1 with 32-bit block pointers, so the test for a fastlink was: di_size < (NDADDR + NIADDR) * sizeof(daddr_t) (where daddr_t would be ufs1_daddr_t today). When embedded symlinks were added, a spare field in the superblock with a known zero value became fs_maxsymlinklen. New filesystems set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus filesystems that preceeded this change always read from blocks (since fs->fs_maxsymlinklen == 0) and newer ones used embedded symlinks if they fit. Similarly symlinks created on pre-embedded symlink filesystems always spill into blocks while newer ones will embed if they fit. At the same time that the embedded symbolic links were added, the on-disk directory structure was changed splitting the former u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen. Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can be used to distinguish old directory formats. In retrospect that should have just been an added flag, but we did not realize we needed to know about that change until it was already in production. Code was split into ufs/ffs so that the log structured filesystem could use ufs functionality while doing its own disk layout. This meant that no ffs superblock fields could be used in the ufs code. Thus ffs superblock fields that were needed in ufs code had to be copied to fields in the mount structure. Since ufs_readlink needed to know if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen. The kernel panic that arose to making this fix was triggered when a disk error created an inode of type symlink with no allocated data blocks but a large size. When readlink was called the uiomove was attempted which segment faulted. static int ufs_readlink(ap) struct vop_readlink_args /* { struct vnode a_vp; struct uio a_uio; struct ucred a_cred; } / ap; { struct vnode vp = ap->a_vp; struct inode ip = VTOI(vp); doff_t isize; isize = ip->i_size; if ((isize < vp->v_mount->mnt_maxsymlinklen) \|\| DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support / return (uiomove(SHORTLINK(ip), isize, ap->a_uio)); } return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred)); } The second part of the "if" statement that adds DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support */ is problematic. It never appeared in BSD released by Berkeley because as noted above mnt_maxsymlinklen is 0 for old format filesystems, so will always fall through to the VOP_READ as it should. I had to dig back through `git blame' to find that Rodney Grimes added it as part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.'' He must have brought it across from an earlier FreeBSD. Unfortunately the source-control logs for FreeBSD up to the merger with the AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the agreement to let FreeBSD remain unencumbered, so I cannot pin-point where that line got added on the FreeBSD side. The one change needed here is that mnt_maxsymlinklen is declared as an `int' and should be changed to be `u_int64_t'. This discovery led us to check out the code that deletes symbolic links. Specifically if (vp->v_type == VLNK && (ip->i_size < vp->v_mount->mnt_maxsymlinklen \|\| datablocks == 0)) { if (length != 0) panic("ffs_truncate: partial truncate of symlink"); bzero(SHORTLINK(ip), (u_int)ip->i_size); ip->i_size = 0; DIP_SET(ip, i_size, 0); UFS_INODE_SET_FLAG(ip, IN_SIZEMOD \| IN_CHANGE \| IN_UPDATE); if (needextclean) goto extclean; return (ffs_update(vp, waitforupdate)); } Here too our broken symlink inode with no data blocks allocated and a large size will segment fault as we are incorrectly using the test that we have no data blocks to decide that it is an embdedded symbolic link and attempting to bzero past the end of the inode. The test for datablocks == 0 is unnecessary as the test for ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right thing in all cases. The test for datablocks == 0 was added by David Greenman in this commit: Author: David Greenman <dg@FreeBSD.org> Date: Tue Aug 2 13:51:05 1994 +0000 Completed (hopefully) the kernel support for old style "fastlinks". Notes: svn path=/head/; revision=1821 I am guessing that he likely earlier added the incorrect test in the ufs_readlink code. I asked David if he had any recollection of why he made this change. Amazingly, he still had a recollection of why he had made a one-line change more than twenty years ago. And unsurpisingly it was because he had been stuck between a rock and a hard place. FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code base. Prior to that, there were three years of development in all areas of the kernel, including the filesystem code, from the combined set of people including Bill Jolitz, Patchkit contributors, and FreeBSD Project members. The compatibility issue at hand was caused by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite changes David had to find a way to provide compatibility with both the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite. He felt that these changes would provide compatibility with both systems. In his words: ``My recollection is that the 'FASTLINKS' symlinks support in FreeBSD-1.x, as implemented by Curt Mayer, worked differently than 4.4BSD. He used a spare field in the inode to duplicately store the length. When the 4.4BSD-Lite merge was done, the optimized symlinks support for existing filesystems (those that were initialized in FreeBSD-1.x) were broken due to the FFS on-disk structure of 4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to restore the backward compatibility with FreeBSD-1.x filesystems. I think it was the best that could be done in the somewhat urgent circumstances of the post Berkeley-USL settlement. Also, regarding Rod's massive commit with little explanation, some context: John Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to the 386 platform in just 10 days. It was by far the most intense hacking effort of my life. In addition to the porting of tons of FreeBSD-1 code, I think we wrote more than 30,000 lines of new code in that time to deal with the missing pieces and architectural changes of 4.4BSD-Lite. We didn't make many notes along the way. There was a lot of pressure to get something out to the rest of the developer community as fast as possible, so detailed discrete commits didn't happen - it all came as a giant wad, which is why Rod's commit message was worded the way it was.'' Reported by: Chuck Silvers Tested by: Chuck Silvers History by: David Greenman Lawrence MFC after: 1 week Sponsored by: Netflix	2021-05-16 17:04:11 -07:00
Mateusz Guzik	852088f6af	vfs: add missing atomic conversion to writecount adjustment Fixes: ("vfs: lockless writecount adjustment in set/unset text")	2021-05-14 17:42:05 +02:00
Mateusz Guzik	ca1ce50b2b	vfs: add more safety against concurrent forced unmount to vn_write 1. stop re-reading ->v_mount (can become NULL) 2. stop re-reading ->v_type (can change to VBAD)	2021-05-14 14:22:22 +00:00
Mateusz Guzik	b5fb9ae687	vfs: lockless writecount adjustment in set/unset text ... for cases where this is not the first/last exec.	2021-05-14 14:22:21 +00:00
Mark Johnston	2cca77ee01	kqueue timer: Remove detached knotes from the process stop queue There are some scenarios where a timer event may be detached when it is on the process' kqueue timer stop queue. If kqtimer_proc_continue() is called after that point, it will iterate over the queue and access freed timer structures. It is also possible, at least in a multithreaded program, for a stopped timer event to be scheduled without removing it from the process' stop queue. Ensure that we do not doubly enqueue the event structure in this case. Reported by: syzbot+cea0931bb4e34cd728bd@syzkaller.appspotmail.com Reported by: syzbot+9e1a2f3734652015998c@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30251	2021-05-14 10:08:14 -04:00
Ed Maste	2c9764f36b	regen syscall files after d51198d63b63	2021-05-13 14:09:58 -04:00
Mark Johnston	8b3c4231ab	posix timers: Check for overflow when converting to ns Disallow a time or timer period value when the conversion to nanoseconds would overflow. Otherwise it is possible to trigger a divison by zero in realtime_expire_l(), where we compute the number of overruns by dividing by the timer interval. Fixes: `7995dae9` ("posix timers: Improve the overrun calculation") Reported by: syzbot+5ab360bd3d3e3c5a6e0e@syzkaller.appspotmail.com Reported by: syzbot+157b74ff493140d86eac@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30233	2021-05-13 08:34:03 -04:00
Mark Johnston	9246b3090c	fork: Suspend other threads if both RFPROC and RFMEM are not set Otherwise, a multithreaded parent process may trigger races in vm_forkproc() if one thread calls rfork() with RFMEM set and another calls rfork() without RFMEM. Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to see if the address space is shared. Reported by: syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com Reported by: syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30220	2021-05-13 08:33:23 -04:00
Mateusz Guzik	cef8a95acb	vfs: fix vnode use count leak in O_EMPTY_PATH support The vnode returned by namei_setup is already referenced. Reported by: pho	2021-05-13 09:39:27 +00:00
Konstantin Belousov	6de3cf14c4	vn_open_cred(): disallow O_CREAT \| O_EMPTY_PATH This combination does not make sense, and cannot be satisfied by lookup. In particular, lookup cannot supply dvp, it only can directly return vp. Reported and reviewed by: markj using syzkaller Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-05-13 02:32:04 +03:00
Mark Johnston	d8acd2681b	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151	2021-05-12 13:00:09 -04:00
Mateusz Guzik	12288bd999	cache: fix lockless absolute symlink traversal to non-fp mounts Said lookups would incorrectly fail with EOPNOTSUP. Reported by: kib	2021-05-11 04:30:12 +00:00
Mark Johnston	c8bbb1272c	vfs: Fix error handling in vn_fullpath_hardlink() vn_fullpath_any_smr() will return a positive error number if the caller-supplied buffer isn't big enough. In this case the error must be propagated up, otherwise we may copy out uninitialized bytes. Reported by: syzkaller+KMSAN Reviewed by: mjg, kib MFC aftr: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30198	2021-05-10 20:22:27 -04:00
Konstantin Belousov	5e7cdf1817	openat(2): add O_EMPTY_PATH It reopens the passed file descriptor, checking the file backing vnode' current access rights against open mode. In particular, this flag allows to convert file descriptor opened with O_PATH, into operable file descriptor, assuming permissions allow that. Reviewed by: markj Tested by: Andrew Walker <awalker@ixsystems.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30148	2021-05-11 02:39:24 +03:00
Konstantin Belousov	d474440ab3	Constify vm_pager-related virtual tables. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Mark Johnston	9a7c2de364	realloc: Fix KASAN(9) shadow map updates When copying from the old buffer to the new buffer, we don't know the requested size of the old allocation, but only the size of the allocation provided by UMA. This value is "alloc". Because the copy may access bytes in the old allocation's red zone, we must mark the full allocation valid in the shadow map. Do so using the correct size. Reported by: kp Tested by: kp Sponsored by: The FreeBSD Foundation	2021-05-05 17:12:51 -04:00
Warner Losh	a512d0ab00	kern: clarify boot time In FreeBSD, the current time is computed from uptime + boottime. Uptime is a continuous, smooth function that's monotonically increasing. To effect changes to the current time, boottime is adjusted. boottime is mutable and shouldn't be cached against future need. Document the current implementation, with the caveat that we may stop stepping boottime on resume in the future and will step uptime instead (noted in the commit message, but not in the code). Sponsored by: Netflix Reviewed by: phk, rpokala Differential Revision: https://reviews.freebsd.org/D30116	2021-05-05 12:32:13 -06:00
Elliott Mitchell	a3c7da3d08	kern/intr: declare interrupt vectors unsigned These should never get values large enough for sign to matter, but one of them becoming negative could cause problems. MFC after: 1 week Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D29327	2021-05-03 13:24:30 -04:00
Mark Johnston	2b2d77e720	VOP_STAT: Provide a default value for va_gen Some filesystems, e.g., pseudofs and the NFSv3 client, do not provide one. Reviewed by: kib Reported by: KMSAN Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30091	2021-05-03 13:24:30 -04:00
Mark Johnston	cdfcfc607a	smp: Initialize arg->cpus sooner in smp_rendezvous_cpus_retry() Otherwise, if !smp_started is true, then smp_rendezvous_cpus_done() will harmlessly perform an atomic RMW on an uninitialized variable. Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-05-03 13:24:30 -04:00
Konstantin Belousov	7cb40543e9	filt_timerexpire: do not iterate over the interval User-supplied data might make this loop too time-consuming. Divide directly, and handle both the possibility that we were woken up earlier, and arithmetic overflows/underflows from the calculation. Reported and tested by: pho (previous version) Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30069	2021-05-03 19:49:54 +03:00
Konstantin Belousov	87a64872cd	Add ptrace(PT_COREDUMP) It writes the core of live stopped process to the file descriptor provided as an argument. Based on the initial version from https://reviews.freebsd.org/D29691, submitted by Michał Górny <mgorny@gentoo.org>. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:18:26 +03:00
Konstantin Belousov	68d311b666	ptracestop: mark threads suspended there with the new TDB_SSWITCH flag This way threads in ptracestop can be discovered by debugger Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:18:25 +03:00
Konstantin Belousov	9ebf9100ba	ptrace: do not allow for parallel ptrace requests Set a new P2_PTRACEREQ flag around the request Wait for the target . process P2_PTRACEREQ flag to clear before setting ours . Otherwise, we rely on the moment that the process lock is not dropped until the stopped target state is important. This is going to be no longer true after some future change. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:16:30 +03:00
Konstantin Belousov	54c8baa021	kern_ptrace(): extract code to determine ptrace eligibility into helper Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:48 +03:00
Konstantin Belousov	2bd0506c8d	kern_ptrace: change type of proctree_locked to bool Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:48 +03:00
Konstantin Belousov	af928fded0	Add thread_run_flash() helper It unsuspends single suspended thread, passed as the argument. It is up to the caller to arrange the target thread to suspend later, since the state of the process is not changed from stopped. In particular, the unsuspended thread must not leave to userspace, since boundary code is not prepared to this situation. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Konstantin Belousov	15465a2c25	Add sleepq_remove_nested() The helper removes the thread from a sleep queue, assuming that it would need to sleep. The sleepq_remove_nested() function is intended for quite special case, where suspended thread from traced stopped process is temporary unsuspended to do some work on behalf of the debugger in the target context, and this work might require sleep. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Konstantin Belousov	86ffb3d1a0	ELF coredump: define several useful flags for the coredump operations - SVC_ALL request dumping all map entries, including those marked as non-dumpable - SVC_NOCOMPRESS disallows compressing the dump regardless of the sysctl policy - SVC_PC_COREDUMP is provided for future use by userspace core dump request Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Konstantin Belousov	5bc3c61780	imgact_elf: consistently pass flags from coredump down to helper functions Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29955	2021-05-03 19:13:47 +03:00
Rick Macklem	4f592683c3	copy_file_range(2): improve copying of a large hole to EOF PR#255523 reported that a file copy for a file with a large hole to EOF on ZFS ran slowly over NFSv4.2. The problem was that vn_generic_copy_file_range() would loop around reading the hole's data and then see it is all 0s. It was coded this way since UFS always allocates a data block near the end of the file, such that a hole to EOF never exists. This patch modifies vn_generic_copy_file_range() to check for a ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that case as a hole to EOF. asomers@ confirms that it works for his ZFS test case. PR: 255523 Tested by: asomers Reviewed by: asomers MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30076	2021-05-02 16:04:27 -07:00
Konstantin Belousov	2082565798	O_PATH: disable kqfilter for fifos Filter on fifos is real filter for the object, and not a filesystem events filter like EVFILT_VNODE. Reported by: markj using syzkaller Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-04-30 17:43:45 +03:00
Mark Johnston	20e3b9d8bd	kasan: Use vm_offset_t for the first parameter to kasan_shadow_map() No functional change intended. Sponsored by: The FreeBSD Foundation	2021-04-29 11:39:02 -04:00
Mateusz Guzik	074abaccfa	cache: remove incomplete lockless lockout support during resize This is already properly handled thanks to 2 step hash replacement.	2021-04-28 19:53:25 +00:00
Mark Johnston	d1e9441583	pipe: Avoid calling selrecord() on a closing pipe pipe_poll() may add the calling thread to the selinfo lists of both ends of a pipe. It is ok to do this for the local end, since we know we hold a reference on the file and so the local end is not closed. It is not ok to do this for the remote end, which may already be closed and have called seldrain(). In this scenario, when the polling thread wakes up, it may end up referencing a freed selinfo. Guard the selrecord() call appropriately. Reviewed by: kib Reported by: syzkaller+KASAN MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30016	2021-04-28 10:43:29 -04:00
Thomas Munro	3aaaa2efde	poll(2): Add POLLRDHUP. Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if requested. Triggered when the remote peer shuts down writing or closes its end. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29757	2021-04-28 23:00:31 +12:00
Mark Johnston	409ab7e109	imgact_elf: Ensure that the return value in parse_notes is initialized parse_notes relies on the caller-supplied callback to initialize "res". Two callbacks are used in practice, brandnote_cb and note_fctl_cb, and the latter fails to initialize res. Fix it. In the worst case, the bug would cause the inner loop of check_note to examine more program headers than necessary, and the note header usually comes last anyway. Reviewed by: kib Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29986	2021-04-26 14:53:16 -04:00
Edward Tomasz Napierala	5d1d844a77	kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW This makes this API match other kern_xxxat() functions. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D29776	2021-04-25 14:13:12 +01:00
Robert Watson	af14713d49	Support run-time configuration of the PIPE_MINDIRECT threshold. PIPE_MINDIRECT determines at what (blocking) write size one-copy optimizations are applied in pipe(2) I/O. That threshold hasn't been tuned since the 1990s when this code was originally committed, and allowing run-time reconfiguration will make it easier to assess whether contemporary microarchitectures would prefer a different threshold. (On our local RPi4 baords, the 8k default would ideally be at least 32k, but it's not clear how generalizable that observation is.) MFC after: 3 weeks Reviewers: jrtc27, arichardson Differential Revision: https://reviews.freebsd.org/D29819	2021-04-24 20:04:28 +01:00
Mark Johnston	8e8f1cc9bb	Re-enable network ioctls in capability mode This reverts a portion of `274579831b` ("capsicum: Limit socket operations in capability mode") as at least rtsol and dhcpcd rely on being able to configure network interfaces while in capability mode. Reported by: bapt, Greg V Sponsored by: The FreeBSD Foundation	2021-04-23 09:22:49 -04:00
Warner Losh	df456a1fcf	newbus: style nit (align comments) Sponsored by: Netflix	2021-04-21 15:37:24 -06:00
Warner Losh	1eebd6158c	newbus: Optimize/Simplify kobj_class_compile_common a little "i" is not used in this loop at all. There's no need to initialize and increment it. Reviewed by: markj@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29898	2021-04-21 15:37:24 -06:00
Konstantin Belousov	54f98c4dbf	vn_open_vnode(): handle error when fp == NULL If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to be called, we cannot use fp fo_close() when there is no fp. This occurs when e.g. kernel code directly calls vn_open() instead of the open(2) syscall. In this case, VOP_CLOSE() can be called directly, after possible lock upgrade. Reported by: nvass@gmx.com PR: 255119 Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29830	2021-04-21 18:06:51 +03:00
Konstantin Belousov	ecfbddf0cd	sysctl vm.objects: report backing object and swap use For anonymous objects, provide a handle kvo_me naming the object, and report the handle of the backing object. This allows userspace to deconstruct the shadow chain. Right now the handle is the address of the object in KVA, but this is not guaranteed. For the same anonymous objects, report the swap space used for actually swapped out pages, in kvo_swapped field. I do not believe that it is useful to report full 64bit counter there, so only uint32_t value is returned, clamped to the max. For kinfo_vmentry, report anonymous object handle backing the entry, so that the shadow chain for the specific mapping can be deconstructed. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29771	2021-04-19 21:32:01 +03:00
Konstantin Belousov	4342ba184c	sysctl_handle_string: do not malloc when SYSCTL_IN cannot fault In particular, this avoids malloc(9) calls when from early tunable handling, with no working malloc yet. Reported and tested by: mav Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-19 21:32:01 +03:00
Konstantin Belousov	578c26f31c	linkat(2): check NIRES_EMPTYPATH on the first fd arg Reported by: arichardson Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29834	2021-04-19 21:32:01 +03:00
Warner Losh	571a1a64b1	Minor style tidy: if( -> if ( Fix a few 'if(' to be 'if (' in a few places, per style(9) and overwhelming usage in the rest of the kernel / tree. MFC After: 3 days Sponsored by: Netflix	2021-04-18 11:19:15 -06:00
Warner Losh	f1f9870668	Minor style cleanup We prefer 'while (0)' to 'while(0)' according to grep and stlye(9)'s space after keyword rule. Remove a few stragglers of the latter. Many of these usages were inconsistent within the file. MFC After: 3 days Sponsored by: Netflix	2021-04-18 11:14:17 -06:00
Konstantin Belousov	bbf7a4e878	O_PATH: allow vnode kevent filter on such files if VREAD access is checked as allowed during open Requested by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:49:18 +03:00
Konstantin Belousov	f9b923af34	O_PATH: Allow to open symlink When O_NOFOLLOW is specified, namei() returns the symlink itself. In this case, open(O_PATH) should be allowed, to denote the location of symlink itself. Prevent O_EXEC in this case, execve(2) code is not ready to try to execute symlinks. Reported by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:49:09 +03:00
Konstantin Belousov	a5970a529c	Make files opened with O_PATH to not block non-forced unmount by only keeping hold count on the vnode, instead of the use count. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:48:27 +03:00
Konstantin Belousov	8d9ed174f3	open(2): Implement O_PATH Reviewed by: markj Tested by: pho Discussed with: walker.aj325_gmail.com, wulf Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:48:24 +03:00
Konstantin Belousov	509124b626	Add AT_EMPTY_PATH for several *at(2) syscalls It is currently allowed to fchownat(2), fchmodat(2), fchflagsat(2), utimensat(2), fstatat(2), and linkat(2). For linkat(2), PRIV_VFS_FHOPEN privilege is required to exercise the flag. It allows to link any open file. Requested by: trasz Tested by: pho, trasz Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29111	2021-04-15 12:48:11 +03:00
Konstantin Belousov	437c241d0c	vfs_vnops.c: Make vn_statfile() non-static Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:47:56 +03:00
Konstantin Belousov	42be0a7b10	Style. Add missed spaces, wrap long lines. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:47:46 +03:00
Mateusz Guzik	4f0279e064	cache: extend mismatch vnode assert print to include the name	2021-04-15 07:55:43 +00:00
Mark Johnston	29bb6c19f0	domainset: Define additional global policies Add global definitions for first-touch and interleave policies. The former may be useful for UMA, which implements a similar policy without using domainset iterators. No functional change intended. Reviewed by: mav MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29104	2021-04-14 13:03:33 -04:00
Konstantin Belousov	75c5cf7a72	filt_timerexpire: avoid process lock recursion Found by: syzkaller Reported and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29746	2021-04-14 10:53:28 +03:00
Konstantin Belousov	5cc1d19941	realtimer_expire: avoid proc lock recursion when called from itimer_proc_continue() It is fine to drop the process lock there, process cannot exit until its timers are cleared. Found by: syzkaller Reported and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29746	2021-04-14 10:53:19 +03:00
Konstantin Belousov	116f26f947	sbuf_uionew(): sbuf_new() takes int as length and length should be not less than SBUF_MINSIZE Reported and tested by: pho Noted and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29752	2021-04-14 10:23:20 +03:00
Mark Johnston	06a53ecf24	malloc: Add state transitions for KASAN - Reuse some REDZONE bits to keep track of the requested and allocated sizes, and use that to provide red zones. - As in UMA, disable memory trashing to avoid unnecessary CPU overhead. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29461	2021-04-13 17:42:21 -04:00
Mark Johnston	f1c3adefd9	execve: Mark exec argument buffers We cache mapped execve argument buffers to avoid the overhead of TLB shootdowns. Mark them invalid when they are freed to the cache. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29460	2021-04-13 17:42:21 -04:00
Mark Johnston	b261bb4057	vfs: Add KASAN state transitions for vnodes vnodes are a bit special in that they may exist on per-CPU lists even while free. Add a KASAN-only destructor that poisons regions of each vnode that are not expected to be accessed after a free. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29459	2021-04-13 17:42:21 -04:00
Mark Johnston	6faf45b34b	amd64: Implement a KASAN shadow map The idea behind KASAN is to use a region of memory to track the validity of buffers in the kernel map. This region is the shadow map. The compiler inserts calls to the KASAN runtime for every emitted load and store, and the runtime uses the shadow map to decide whether the access is valid. Various kernel allocators call kasan_mark() to update the shadow map. Since the shadow map tracks only accesses to the kernel map, accesses to other kernel maps are not validated by KASAN. UMA_MD_SMALL_ALLOC is disabled when KASAN is configured to reduce usage of the direct map. Currently we have no mechanism to completely eliminate uses of the direct map, so KASAN's coverage is not comprehensive. The shadow map uses one byte per eight bytes in the kernel map. In pmap_bootstrap() we create an initial set of page tables for the kernel and preloaded data. When pmap_growkernel() is called, we call kasan_shadow_map() to extend the shadow map. kasan_shadow_map() uses pmap_kasan_enter() to allocate memory for the shadow region and map it. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29417	2021-04-13 17:42:20 -04:00
Mark Johnston	38da497a4d	Add the KASAN runtime KASAN enables the use of LLVM's AddressSanitizer in the kernel. This feature makes use of compiler instrumentation to validate memory accesses in the kernel and detect several types of bugs, including use-after-frees and out-of-bounds accesses. It is particularly effective when combined with test suites or syzkaller. KASAN has high CPU and memory usage overhead and so is not suited for production environments. The runtime and pmap maintain a shadow of the kernel map to store information about the validity of memory mapped at a given kernel address. The runtime implements a number of functions defined by the compiler ABI. These are prefixed by __asan. The compiler emits calls to __asan_load() and __asan_store() around memory accesses, and the runtime consults the shadow map to determine whether a given access is valid. kasan_mark() is called by various kernel allocators to update state in the shadow map. Updates to those allocators will come in subsequent commits. The runtime also defines various interceptors. Some low-level routines are implemented in assembly and are thus not amenable to compiler instrumentation. To handle this, the runtime implements these routines on behalf of the rest of the kernel. The sanitizer implementation validates memory accesses manually before handing off to the real implementation. The sanitizer in a KASAN-configured kernel can be disabled by setting the loader tunable debug.kasan.disable=1. Obtained from: NetBSD MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29416	2021-04-13 17:42:20 -04:00
Mitchell Horne	2816bd8442	rmlock(9): add an RM_DUPOK flag Allows for duplicate locks to be acquired without witness complaining. Similar flags exists already for rwlock(9) and sx(9). Reviewed by: markj MFC after: 3 days Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. NetApp PR: 52 Differential Revision: https://reviews.freebsd.org/D29683n	2021-04-12 11:42:21 -03:00
Mark Johnston	dfff37765c	Rename struct device to struct _device types.h defines device_t as a typedef of struct device *. struct device is defined in subr_bus.c and almost all of the kernel uses device_t. The LinuxKPI also defines a struct device, so type confusion can occur. This causes bugs and ambiguity for debugging tools. Rename the FreeBSD struct device to struct _device. Reviewed by: gbe (man pages) Reviewed by: rpokala, imp, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29676	2021-04-12 09:32:30 -04:00
Andrew Turner	5d2d599d3f	Create VM_MEMATTR_DEVICE on all architectures This is intended to be used with memory mapped IO, e.g. from bus_space_map with no flags, or pmap_mapdev. Use this new memory type in the map request configured by resource_init_map_request, and in pciconf. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D29692	2021-04-12 06:15:31 +00:00
Konstantin Belousov	a091c35323	ptrace: restructure comments around reparenting on PT_DETACH style code, and use {} for both branches. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-11 14:44:30 +03:00
Konstantin Belousov	9d7e450b64	ptrace: remove dead call to FIX_SSTEP() It was an alias for procfs_fix_sstep() long time ago. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-11 14:44:30 +03:00
Piotr Pawel Stefaniak	a212f56d10	Balance parentheses in sysctl descriptions	2021-04-11 10:30:55 +02:00
Konstantin Belousov	2fd1ffefaa	Stop arming kqueue timers on knote owner suspend or terminate This way, even if the process specified very tight reschedule intervals, it should be stoppable/killable. Reported and reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:43:51 +03:00
Konstantin Belousov	533e5057ed	Add helper for kqueue timers callout scheduling Reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:42:56 +03:00
Konstantin Belousov	4d27d8d2f3	Stop arming realtime posix process timers on suspend or terminate Reported and reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:42:51 +03:00
Konstantin Belousov	dc47fdf131	Stop arming periodic process timers on suspend or terminate Reported and reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:42:44 +03:00
Mateusz Guzik	72b3b5a941	vfs: replace vfs_smr_quiesce with vfs_smr_synchronize This ends up using a smr specific method. Suggested by: markj Tested by: pho	2021-04-08 11:14:45 +00:00
Mark Johnston	0f07c234ca	Remove more remnants of sio(4) Reviewed by: imp MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29626	2021-04-07 14:33:02 -04:00
Mark Johnston	274579831b	capsicum: Limit socket operations in capability mode Capsicum did not prevent certain privileged networking operations, specifically creation of raw sockets and network configuration ioctls. However, these facilities can be used to circumvent some of the restrictions that capability mode is supposed to enforce. Add capability mode checks to disallow network configuration ioctls and creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET internet sockets. Reviewed by: oshogbo Discussed with: emaste Reported by: manu Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29423	2021-04-07 14:32:56 -04:00
Mateusz Guzik	13b3862ee8	cache: update an assert on CACHE_FPL_STATUS_ABORTED Since symlink support it can get upgraded to CACHE_FPL_STATUS_DESTROYED. Reported by: bdrewery	2021-04-06 22:31:58 +02:00
Mark Johnston	2425f5e912	mount: Disallow mounting over a jail root Discussed with: jamie Approved by: so Security: CVE-2020-25584 Security: FreeBSD-SA-21:10.jail_mount	2021-04-06 14:49:36 -04:00
Edward Tomasz Napierala	7f6157f7fd	lock_delay(9): improve interaction with restrict_starvation After `e7a5b3bd05`, the la->delay value was adjusted after being set by the starvation_limit code block, which is wrong. Reported By: avg Reviewed By: avg Fixes: `e7a5b3bd05` Sponsored By: NetApp, Inc. Sponsored By: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D29513	2021-04-03 13:08:53 +01:00
Mark Johnston	52a99c72b5	sendfile: Fix error initialization in sendfile_getobj() Reviewed by: chs, kib Reported by: jhb Fixes: `faa998f6ff` MFC after: 1 day Differential Revision: https://reviews.freebsd.org/D29540	2021-04-02 17:42:38 -04:00
Richard Scheffenegger	cad4fd0365	Make sbuf_drain safe for external use While sbuf_drain was an internal function, two KASSERTS checked the sanity of it being called. However, an external caller may be ignorant if there is any data to drain, or if an error has already accumulated. Be nice and return immediately with the accumulated error. MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29544	2021-04-02 20:12:11 +02:00
Konstantin Belousov	aa3ea612be	x86: remove gcov kernel support Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D29529	2021-04-02 15:41:51 +03:00
Mateusz Guzik	f79bd71def	cache: add high level overview Differential Revision: https://reviews.freebsd.org/D28675	2021-04-02 05:11:05 +02:00
Mateusz Guzik	dc532884d5	cache: fix resizing in face of lockless lookup Reported by: pho Tested by: pho	2021-04-02 05:11:05 +02:00
Lawrence Stewart	1eb402e47a	stats(3): Improve t-digest merging of samples which result in mu adjustment underflow. Allow the calculation of the mu adjustment factor to underflow instead of rejecting the VOI sample from the digest and logging an error. This trades off some (currently unquantified) additional centroid error in exchange for better fidelity of the distribution's density, which is the right trade off at the moment until follow up work to better handle and track accumulated error can be undertaken. Obtained from: Netflix MFC after: immediately	2021-04-02 13:17:53 +11:00
Richard Scheffenegger	c804c8f2c5	Export sbuf_drain to orchestrate lock and drain action While exporting large amounts of data to a sysctl request, datastructures may need to be locked. Exporting the sbuf_drain function allows the coordination between drain events and held locks, to avoid stalls. PR: 254333 Reviewed By: jhb MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29481	2021-03-31 19:17:37 +02:00
Konstantin Belousov	e243367b64	mbuf: add a way to mark flowid as calculated from the internal headers In some settings offload might calculate hash from decapsulated packet. Reserve a bit in packet header rsstype to indicate that. Add m_adj_decap() that acts similarly to m_adj, but also either clear flowid if it is not marked as inner, or transfer it to the decapsulated header, clearing inner indicator. It depends on the internals of m_adj() that reuses the argument packet header for the result. Use m_adj_decap() for decapsulating vxlan(4) and gif(4) input packets. Reviewed by: ae, hselasky, np Sponsored by: Nvidia Networking / Mellanox Technologies MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28773	2021-03-31 14:38:26 +03:00
Mark Johnston	3428b6c050	Fix several dev_clone callbacks to avoid out-of-bounds reads Use strncmp() instead of bcmp(), so that we don't have to find the minimum of the string lengths before comparing. Reviewed by: kib Reported by: KASAN MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29463	2021-03-28 11:08:36 -04:00
Mark Johnston	71c160a8f6	vfs: Add an assertion around name length limits Some filesystems assume that they can copy a name component, with length bounded by NAME_MAX, into a dirent buffer of size MAXNAMLEN. These constants have the same value; add a compile-time assertion to that effect. Reported by: Alexey Kulaev <alex.qart@gmail.com> Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29431	2021-03-27 13:45:19 -04:00
Mark Johnston	653a437c04	accept_filter: Fix filter parameter handling For filters which implement accf_create, the setsockopt(2) handler caches the filter name in the socket, but it also incorrectly frees the buffer containing the copy, leaving a dangling pointer. Note that no accept filters provided in the base system are susceptible to this, as they don't implement accf_create. Reported by: Alexey Kulaev <alex.qart@gmail.com> Discussed with: emaste Security: kernel use-after-free MFC after: 3 days Sponsored by: The FreeBSD Foundation	2021-03-25 17:55:46 -04:00
Mark Johnston	ec8f1ea8d5	Generalize sanitizer interceptors for memory and string routines Similar to commit `3ead60236f` ("Generalize bus_space(9) and atomic(9) sanitizer interceptors"), use a more generic scheme for interposing sanitizer implementations of routines like memcpy(). No functional change intended. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2021-03-24 19:46:22 -04:00
Mark Johnston	3ead60236f	Generalize bus_space(9) and atomic(9) sanitizer interceptors Make it easy to define interceptors for new sanitizer runtimes, rather than assuming KCSAN. Lay a bit of groundwork for KASAN and KMSAN. When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions in atomic_san.h are used by default instead of the inline implementations in the platform's atomic.h. These definitions are implemented in the sanitizer runtime, which includes machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual implementations. No functional change intended. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2021-03-22 22:21:53 -04:00
Adrian Chadd	25bfa44860	Add device and ifnet logging methods, similar to device_printf / if_printf * device_printf() is effectively a printf * if_printf() is effectively a LOG_INFO This allows subsystems to log device/netif stuff using different log levels, rather than having to invent their own way to prefix unit/netif names. Differential Revision: https://reviews.freebsd.org/D29320 Reviewed by: imp	2021-03-22 00:02:34 +00:00
Alex Richardson	6ceacebdf5	Unbreak MSG_CMSG_CLOEXEC MSG_CMSG_CLOEXEC has not been working since 2015 (SVN r284380) because _finstall expects O_CLOEXEC and not UF_EXCLOSE as the flags argument. This was probably not noticed because we don't have a test for this flag so this commit adds one. I found this problem because one of the libwayland tests was failing. Fixes: `ea31808c3b` ("fd: move out actual fp installation to _finstall") MFC after: 3 days Reviewed By: mjg, kib Differential Revision: https://reviews.freebsd.org/D29328	2021-03-18 20:52:20 +00:00
Mateusz Guzik	e9272225e6	vfs: fix vnlru marker handling for filtered/unfiltered cases The global list has a marker with an invariant that free vnodes are placed somewhere past that. A caller which performs filtering (like ZFS) can move said marker all the way to the end, across free vnodes which don't match. Then a caller which does not perform filtering will fail to find them. This makes vn_alloc_hard sleep for 1 second instead of reclaiming, resulting in significant stalls. Fix the problem by requiring an explicit marker by callers which do filtering. As a temporary measure extend vnlru_free to restart if it fails to reclaim anything. Big thanks go to the reporter for testing several iterations of the patch. Reported by: Yamagi <lists yamagi.org> Tested by: Yamagi <lists yamagi.org> Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D29324	2021-03-18 14:59:03 +00:00
Kyle Evans	f187d6dfbf	base: remove if_wg(4) and associated utilities, manpage After length decisions, we've decided that the if_wg(4) driver and related work is not yet ready to live in the tree. This driver has larger security implications than many, and thus will be held to more scrutiny than other drivers. Please also see the related message sent to the freebsd-hackers@ and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on 2021/03/16, with the subject line "Removing WireGuard Support From Base" for additional context.	2021-03-17 09:14:48 -05:00
Mark Johnston	4aa157dd5b	link_elf_obj: Add a case missing from `5e6989ba4f` Fixes: `5e6989ba4f` MFC after: 3 days Sponsored by: The FreeBSD Foundation	2021-03-16 15:01:41 -04:00
Kyle Evans	74ae3f3e33	if_wg: import latest fixup work from the wireguard-freebsd project This is the culmination of about a week of work from three developers to fix a number of functional and security issues. This patch consists of work done by the following folks: - Jason A. Donenfeld <Jason@zx2c4.com> - Matt Dunwoodie <ncon@noconroy.net> - Kyle Evans <kevans@FreeBSD.org> Notable changes include: - Packets are now correctly staged for processing once the handshake has completed, resulting in less packet loss in the interim. - Various race conditions have been resolved, particularly w.r.t. socket and packet lifetime (panics) - Various tests have been added to assure correct functionality and tooling conformance - Many security issues have been addressed - if_wg now maintains jail-friendly semantics: sockets are created in the interface's home vnet so that it can act as the sole network connection for a jail - if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0 - if_wg now exports via ioctl a format that is future proof and complete. It is additionally supported by the upstream wireguard-tools (which we plan to merge in to base soon) - if_wg now conforms to the WireGuard protocol and is more closely aligned with security auditing guidelines Note that the driver has been rebased away from using iflib. iflib poses a number of challenges for a cloned device trying to operate in a vnet that are non-trivial to solve and adds complexity to the implementation for little gain. The crypto implementation that was previously added to the tree was a super complex integration of what previously appeared in an old out of tree Linux module, which has been reduced to crypto.c containing simple boring reference implementations. This is part of a near-to-mid term goal to work with FreeBSD kernel crypto folks and take advantage of or improve accelerated crypto already offered elsewhere. There's additional test suite effort underway out-of-tree taking advantage of the aforementioned jail-friendly semantics to test a number of real-world topologies, based on netns.sh. Also note that this is still a work in progress; work going further will be much smaller in nature. MFC after: 1 month (maybe)	2021-03-14 23:52:04 -05:00
Jonathan T. Looney	dbec10e088	Fetch the sigfastblock value in syscalls that wait for signals We have seen several cases of processes which have become "stuck" in kern_sigsuspend(). When this occurs, the kernel's td_sigblock_val is set to 0x10 (one block outstanding) and the userspace copy of the word is set to 0 (unblocked). Because the kernel's cached value shows that signals are blocked, kern_sigsuspend() blocks almost all signals, which means the process hangs indefinitely in sigsuspend(). It is not entirely clear what is causing this condition to occur. However, it seems to make sense to add some protection against this case by fetching the latest sigfastblock value from userspace for syscalls which will sleep waiting for signals. Here, the change is applied to kern_sigsuspend() and kern_sigtimedwait(). Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29225	2021-03-12 18:14:17 +00:00
John Baldwin	640d54045b	Set TDP_KTHREAD before calling cpu_fork() and cpu_copy_thread(). This permits these routines to use special logic for initializing MD kthread state. For the kproc case, this required moving the logic to set these flags from kproc_create() into do_fork(). Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29207	2021-03-12 09:48:20 -08:00
Konstantin Belousov	44691b33cc	vlrureclaim: only skip vnode with resident pages if it own the pages Nullfs vnode which shares vm_object and pages with the lower vnode should not be exempt from the reclaim just because lower vnode cached a lot. Their reclamation is actually very cheap and should be preferred over real fs vnodes, but this change is already useful. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Warner Losh	e52368365d	config_intrhook: provide config_intrhook_drain config_intrhook_drain will remove the hook from the list as config_intrhook_disestablish does if the hook hasn't been called. If it has, config_intrhook_drain will wait for the hook to be disestablished in the normal course (or expedited, it's up to the driver to decide how and when to call config_intrhook_disestablish). This is intended for removable devices that use config_intrhook and might be attached early in boot, but that may be removed before the kernel can call the config_intrhook or before it ends. To prevent all races, the detach routine will need to call config_intrhook_train. Sponsored by: Netflix, Inc Reviewed by: jhb, mav, gde (in D29006 for man page) Differential Revision: https://reviews.freebsd.org/D29005	2021-03-11 09:45:10 -07:00
Hans Petter Selasky	6eb60f5b7f	Use the word "LinuxKPI" instead of "Linux compatibility", to not confuse with user-space Linux compatibility support. No functional change. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-10 12:35:16 +01:00
Kyle Evans	1ae20f7c70	kern: malloc: fix panic on M_WAITOK during THREAD_NO_SLEEPING() Simple condition flip; we wanted to panic here after epoch_trace_list(). Reviewed by: glebius, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29125	2021-03-09 05:16:39 -06:00
Warner Losh	88a5591203	config_intrhook: Move from TAILQ to STAILQ and padding config_intrhook doesn't need to be a two-pointer TAILQ. We rarely add/delete from this and so those need not be optimized. Instaed, use the one-pointer STAILQ plus a uintptr_t to be used as a flags word. This will allow these changes to be MFC'd to 12 and 13 to fix a race in removable devices. Feedback from: jhb Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D29004	2021-03-08 15:59:00 -07:00
Mark Johnston	435c7cfb24	Rename _cscan_atomic.h and _cscan_bus.h to atomic_san.h and bus_san.h Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so put these in a more generic place as a step towards importing the other sanitizers. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29103	2021-03-08 12:39:06 -05:00
Mark Johnston	7995dae9d3	posix timers: Improve the overrun calculation timer_settime(2) may be used to configure a timeout in the past. If the timer is also periodic, we also try to compute the number of timer overruns that occurred between the initial timeout and the time at which the timer fired. This is done in a loop which iterates once per period between the initial timeout and now. If the period is small and the initial timeout was a long time ago, this loop can take forever to run, so the system is effectively DOSed. Replace the loop with a more direct calculation of (now - initial timeout) / period to compute the number of overruns. Reported by: syzkaller Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29093	2021-03-08 12:39:06 -05:00
Mark Johnston	60d12ef952	posix timers: Sprinkle some style fixes MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Mark Johnston	8ff2b41c05	posix timers: Declare unexported functions as static MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Konstantin Belousov	56b9bee63a	Make kern.timecounter.hardware tunable Noted and reviewed by: kevans MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D29122	2021-03-08 03:48:21 +02:00
Hans Petter Selasky	c743a6bd4f	Implement mallocarray_domainset(9) variant of mallocarray(9). Reviewed by: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-06 11:38:55 +01:00
Konstantin Belousov	ead7697f04	Restore AT_RESOLVE_BENEATH support for funlinkat(2)/unlinkat(2). MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-06 07:24:18 +02:00
Mark Johnston	89b650872b	ktls: Hide initialization message behind bootverbose We don't typically print anything when a subsystem initializes itself, and KTLS is currently disabled by default anyway. Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29097	2021-03-05 13:11:02 -05:00
Mark Johnston	5e6989ba4f	link_elf_obj: Handle init_array sections in KLDs Reuse existing handling for .ctors, print a warning if multiple constructor sections are present. Destructors are not handled as of yet. This is required for KASAN. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29049	2021-03-04 10:07:10 -05:00
Mark Johnston	49f6925ca3	ktls: Cache output buffers for software encryption Maintain a cache of physically contiguous runs of pages for use as output buffers when software encryption is configured and in-place encryption is not possible. This makes allocation and free cheaper since in the common case we avoid touching the vm_page structures for the buffer, and fewer calls into UMA are needed. gallatin@ reports a ~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after this change. It is possible that we will not be able to allocate these buffers if physical memory is fragmented. To avoid frequently calling into the physical memory allocator in this scenario, rate-limit allocation attempts after a failure. In the failure case we fall back to the old behaviour of allocating a page at a time. N.B.: this scheme could be simplified, either by simply using malloc() and looking up the PAs of the pages backing the buffer, or by falling back to page by page allocation and creating a mapping in the cache zone. This requires some way to save a mapping of an M_EXTPG page array in the mbuf, though. m_data is not really appropriate. The second approach may be possible by saving the mapping in the plinks union of the first vm_page structure of the array, but this would force a vm_page access when freeing an mbuf. Reviewed by: gallatin, jhb Tested by: gallatin Sponsored by: Ampere Computing Submitted by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D28556	2021-03-03 17:34:01 -05:00
Alan Somers	ab63da3564	Speed up geom_stats_resync in the presence of many devices The old code had a O(n) loop, where n is the size of /dev/devstat. Multiply that by another O(n) loop in devstat_mmap for a total of O(n^2). This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can quickly determine the right amount of memory to map, eliminating the O(n) loop in userland. This change decreases the time to run "gstat -bI0.001" with 16,384 md devices from 29.7s to 4.2s. Also, fix a memory leak first reported as PR 203097. Sponsored by: Axcient Reviewed by: mav, imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28968	2021-03-02 18:33:45 -07:00
Robert Wing	0cc746f193	filt_fsevent: only record interested events Respect filter-specific flags for the EVFILT_FS filter. When a kevent is registered with the EVFILT_FS filter, it is always triggered when an EVFILT_FS event occurs, regardless of the filter-specific flags used. Fix that. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28974	2021-03-02 14:19:22 -09:00
Konstantin Belousov	28cd3a673e	O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path Requested and reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:40 +02:00
Konstantin Belousov	49c98a4bf3	nameicap_check_dotdot: trim tracker on check Tracker should contain exactly the path from the starting directory to the current lookup point. Otherwise we might not detect some cases of dotdot escape. Consequently, if we are walking up the tree by dotdot lookup, we must remove an entries below the walked directory. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:35 +02:00
Konstantin Belousov	e8a2862aa0	Add nameicap_cleanup_from(), to clean tracker list starting from some element Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:30 +02:00
Konstantin Belousov	2388ad7c29	nameicap_tracker_add: avoid duplicates in the tracker list Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:23 +02:00
Konstantin Belousov	59e7494281	Do not call nameicap_tracker_add() for dotdot case. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:14 +02:00
Konstantin Belousov	20e91ca36a	open(2): Remove O_BENEATH and AT_BENEATH with the reasoning that the flags did not worked properly, and were not shipped in a release. O_RESOLVE_BENEATH is kept as useful. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:16:55 +02:00
Kyle Evans	60c4ec806d	jail: allow root to implicitly widen its cpuset to attach The default behavior for attaching processes to jails is that the jail's cpuset augments the attaching processes, so that it cannot be used to escalate a user's ability to take advantage of more CPUs than the administrator wanted them to. This is problematic when root needs to manage jails that have disjoint sets with whatever process is attaching, as this would otherwise result in a deadlock. Therefore, if we did not have an appropriate common subset of cpus/domains for our new policy, we now allow the process to simply take on the jail set if it has the privilege to widen its mask anyways. With the new logic, root can still usefully cpuset a process that attaches to a jail with the desire of maintaining the set it was given pre-attachment while still retaining the ability to manage child jails without jumping through hoops. A test has been added to demonstrate the issue; cpuset of a process down to just the first CPU and attempting to attach to a jail without access to any of the same CPUs previously resulted in EDEADLK and now results in taking on the jail's mask for privileged users. PR: 253724 Reviewed by: jamie (also discussed with) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28952	2021-03-01 12:38:31 -06:00
Rick Macklem	a5f9fe2bab	copy_file_range(2): Fix for small values of input file offset and len r366302 broke copy_file_range(2) for small values of input file offset and len. It was possible for rem to be greater than len and then "len - rem" was a large value, since both variables are unsigned. Reported by: koobs, Pablo <pablogsal gmail com> (Python) Reviewed by: asomers, koobs MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28981	2021-03-01 06:31:10 -08:00
Konstantin Belousov	b5449c92b4	Use atomic_interrupt_fence() instead of bare __compiler_membar() for the which which definitely use membar to sync with interrupt handlers. libc and rtld uses of __compiler_membar() seems to want compiler barriers proper. The barrier in sched_unpin_lite() after td_pinned decrement seems to be not needed and removed, instead of convertion. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28956	2021-02-28 01:27:29 +02:00
Mateusz Guzik	1239a72221	cache: temporarily drop the assert that dvp != vp when adding an entry Historically it was allowed for any names, but arguably should never be even attempted. Allow it again since there is a release pending and allowing it is bug-compatible with previous behavior. Reported by: otis	2021-02-27 22:29:50 +00:00
Jamie Gritton	589e4c1df4	jail: Add safety around prison_deref() flags. do_jail_attach() now only uses the PD_XXX flags that refer to lock status, so make sure that something else like PD_KILL doesn't slip through. Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.	2021-02-25 20:10:42 -08:00
Jamie Gritton	108a9384e9	jail: Fix locking on an early jail_set error. I had locked allprison_lock without immediately setting PD_LIST_LOCKED.	2021-02-25 19:52:58 -08:00

... 3 4 5 6 7 ...

18640 Commits