freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	074abaccfa	cache: remove incomplete lockless lockout support during resize This is already properly handled thanks to 2 step hash replacement.	2021-04-28 19:53:25 +00:00
Mark Johnston	d1e9441583	pipe: Avoid calling selrecord() on a closing pipe pipe_poll() may add the calling thread to the selinfo lists of both ends of a pipe. It is ok to do this for the local end, since we know we hold a reference on the file and so the local end is not closed. It is not ok to do this for the remote end, which may already be closed and have called seldrain(). In this scenario, when the polling thread wakes up, it may end up referencing a freed selinfo. Guard the selrecord() call appropriately. Reviewed by: kib Reported by: syzkaller+KASAN MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30016	2021-04-28 10:43:29 -04:00
Thomas Munro	3aaaa2efde	poll(2): Add POLLRDHUP. Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if requested. Triggered when the remote peer shuts down writing or closes its end. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29757	2021-04-28 23:00:31 +12:00
Mark Johnston	409ab7e109	imgact_elf: Ensure that the return value in parse_notes is initialized parse_notes relies on the caller-supplied callback to initialize "res". Two callbacks are used in practice, brandnote_cb and note_fctl_cb, and the latter fails to initialize res. Fix it. In the worst case, the bug would cause the inner loop of check_note to examine more program headers than necessary, and the note header usually comes last anyway. Reviewed by: kib Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29986	2021-04-26 14:53:16 -04:00
Edward Tomasz Napierala	5d1d844a77	kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW This makes this API match other kern_xxxat() functions. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D29776	2021-04-25 14:13:12 +01:00
Robert Watson	af14713d49	Support run-time configuration of the PIPE_MINDIRECT threshold. PIPE_MINDIRECT determines at what (blocking) write size one-copy optimizations are applied in pipe(2) I/O. That threshold hasn't been tuned since the 1990s when this code was originally committed, and allowing run-time reconfiguration will make it easier to assess whether contemporary microarchitectures would prefer a different threshold. (On our local RPi4 baords, the 8k default would ideally be at least 32k, but it's not clear how generalizable that observation is.) MFC after: 3 weeks Reviewers: jrtc27, arichardson Differential Revision: https://reviews.freebsd.org/D29819	2021-04-24 20:04:28 +01:00
Mark Johnston	8e8f1cc9bb	Re-enable network ioctls in capability mode This reverts a portion of `274579831b` ("capsicum: Limit socket operations in capability mode") as at least rtsol and dhcpcd rely on being able to configure network interfaces while in capability mode. Reported by: bapt, Greg V Sponsored by: The FreeBSD Foundation	2021-04-23 09:22:49 -04:00
Warner Losh	df456a1fcf	newbus: style nit (align comments) Sponsored by: Netflix	2021-04-21 15:37:24 -06:00
Warner Losh	1eebd6158c	newbus: Optimize/Simplify kobj_class_compile_common a little "i" is not used in this loop at all. There's no need to initialize and increment it. Reviewed by: markj@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29898	2021-04-21 15:37:24 -06:00
Konstantin Belousov	54f98c4dbf	vn_open_vnode(): handle error when fp == NULL If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to be called, we cannot use fp fo_close() when there is no fp. This occurs when e.g. kernel code directly calls vn_open() instead of the open(2) syscall. In this case, VOP_CLOSE() can be called directly, after possible lock upgrade. Reported by: nvass@gmx.com PR: 255119 Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29830	2021-04-21 18:06:51 +03:00
Konstantin Belousov	ecfbddf0cd	sysctl vm.objects: report backing object and swap use For anonymous objects, provide a handle kvo_me naming the object, and report the handle of the backing object. This allows userspace to deconstruct the shadow chain. Right now the handle is the address of the object in KVA, but this is not guaranteed. For the same anonymous objects, report the swap space used for actually swapped out pages, in kvo_swapped field. I do not believe that it is useful to report full 64bit counter there, so only uint32_t value is returned, clamped to the max. For kinfo_vmentry, report anonymous object handle backing the entry, so that the shadow chain for the specific mapping can be deconstructed. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29771	2021-04-19 21:32:01 +03:00
Konstantin Belousov	4342ba184c	sysctl_handle_string: do not malloc when SYSCTL_IN cannot fault In particular, this avoids malloc(9) calls when from early tunable handling, with no working malloc yet. Reported and tested by: mav Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-19 21:32:01 +03:00
Konstantin Belousov	578c26f31c	linkat(2): check NIRES_EMPTYPATH on the first fd arg Reported by: arichardson Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29834	2021-04-19 21:32:01 +03:00
Warner Losh	571a1a64b1	Minor style tidy: if( -> if ( Fix a few 'if(' to be 'if (' in a few places, per style(9) and overwhelming usage in the rest of the kernel / tree. MFC After: 3 days Sponsored by: Netflix	2021-04-18 11:19:15 -06:00
Warner Losh	f1f9870668	Minor style cleanup We prefer 'while (0)' to 'while(0)' according to grep and stlye(9)'s space after keyword rule. Remove a few stragglers of the latter. Many of these usages were inconsistent within the file. MFC After: 3 days Sponsored by: Netflix	2021-04-18 11:14:17 -06:00
Konstantin Belousov	bbf7a4e878	O_PATH: allow vnode kevent filter on such files if VREAD access is checked as allowed during open Requested by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:49:18 +03:00
Konstantin Belousov	f9b923af34	O_PATH: Allow to open symlink When O_NOFOLLOW is specified, namei() returns the symlink itself. In this case, open(O_PATH) should be allowed, to denote the location of symlink itself. Prevent O_EXEC in this case, execve(2) code is not ready to try to execute symlinks. Reported by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:49:09 +03:00
Konstantin Belousov	a5970a529c	Make files opened with O_PATH to not block non-forced unmount by only keeping hold count on the vnode, instead of the use count. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:48:27 +03:00
Konstantin Belousov	8d9ed174f3	open(2): Implement O_PATH Reviewed by: markj Tested by: pho Discussed with: walker.aj325_gmail.com, wulf Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:48:24 +03:00
Konstantin Belousov	509124b626	Add AT_EMPTY_PATH for several *at(2) syscalls It is currently allowed to fchownat(2), fchmodat(2), fchflagsat(2), utimensat(2), fstatat(2), and linkat(2). For linkat(2), PRIV_VFS_FHOPEN privilege is required to exercise the flag. It allows to link any open file. Requested by: trasz Tested by: pho, trasz Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29111	2021-04-15 12:48:11 +03:00
Konstantin Belousov	437c241d0c	vfs_vnops.c: Make vn_statfile() non-static Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:47:56 +03:00
Konstantin Belousov	42be0a7b10	Style. Add missed spaces, wrap long lines. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323	2021-04-15 12:47:46 +03:00
Mateusz Guzik	4f0279e064	cache: extend mismatch vnode assert print to include the name	2021-04-15 07:55:43 +00:00
Mark Johnston	29bb6c19f0	domainset: Define additional global policies Add global definitions for first-touch and interleave policies. The former may be useful for UMA, which implements a similar policy without using domainset iterators. No functional change intended. Reviewed by: mav MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29104	2021-04-14 13:03:33 -04:00
Konstantin Belousov	75c5cf7a72	filt_timerexpire: avoid process lock recursion Found by: syzkaller Reported and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29746	2021-04-14 10:53:28 +03:00
Konstantin Belousov	5cc1d19941	realtimer_expire: avoid proc lock recursion when called from itimer_proc_continue() It is fine to drop the process lock there, process cannot exit until its timers are cleared. Found by: syzkaller Reported and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29746	2021-04-14 10:53:19 +03:00
Konstantin Belousov	116f26f947	sbuf_uionew(): sbuf_new() takes int as length and length should be not less than SBUF_MINSIZE Reported and tested by: pho Noted and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29752	2021-04-14 10:23:20 +03:00
Mark Johnston	06a53ecf24	malloc: Add state transitions for KASAN - Reuse some REDZONE bits to keep track of the requested and allocated sizes, and use that to provide red zones. - As in UMA, disable memory trashing to avoid unnecessary CPU overhead. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29461	2021-04-13 17:42:21 -04:00
Mark Johnston	f1c3adefd9	execve: Mark exec argument buffers We cache mapped execve argument buffers to avoid the overhead of TLB shootdowns. Mark them invalid when they are freed to the cache. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29460	2021-04-13 17:42:21 -04:00
Mark Johnston	b261bb4057	vfs: Add KASAN state transitions for vnodes vnodes are a bit special in that they may exist on per-CPU lists even while free. Add a KASAN-only destructor that poisons regions of each vnode that are not expected to be accessed after a free. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29459	2021-04-13 17:42:21 -04:00
Mark Johnston	6faf45b34b	amd64: Implement a KASAN shadow map The idea behind KASAN is to use a region of memory to track the validity of buffers in the kernel map. This region is the shadow map. The compiler inserts calls to the KASAN runtime for every emitted load and store, and the runtime uses the shadow map to decide whether the access is valid. Various kernel allocators call kasan_mark() to update the shadow map. Since the shadow map tracks only accesses to the kernel map, accesses to other kernel maps are not validated by KASAN. UMA_MD_SMALL_ALLOC is disabled when KASAN is configured to reduce usage of the direct map. Currently we have no mechanism to completely eliminate uses of the direct map, so KASAN's coverage is not comprehensive. The shadow map uses one byte per eight bytes in the kernel map. In pmap_bootstrap() we create an initial set of page tables for the kernel and preloaded data. When pmap_growkernel() is called, we call kasan_shadow_map() to extend the shadow map. kasan_shadow_map() uses pmap_kasan_enter() to allocate memory for the shadow region and map it. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29417	2021-04-13 17:42:20 -04:00
Mark Johnston	38da497a4d	Add the KASAN runtime KASAN enables the use of LLVM's AddressSanitizer in the kernel. This feature makes use of compiler instrumentation to validate memory accesses in the kernel and detect several types of bugs, including use-after-frees and out-of-bounds accesses. It is particularly effective when combined with test suites or syzkaller. KASAN has high CPU and memory usage overhead and so is not suited for production environments. The runtime and pmap maintain a shadow of the kernel map to store information about the validity of memory mapped at a given kernel address. The runtime implements a number of functions defined by the compiler ABI. These are prefixed by __asan. The compiler emits calls to __asan_load() and __asan_store() around memory accesses, and the runtime consults the shadow map to determine whether a given access is valid. kasan_mark() is called by various kernel allocators to update state in the shadow map. Updates to those allocators will come in subsequent commits. The runtime also defines various interceptors. Some low-level routines are implemented in assembly and are thus not amenable to compiler instrumentation. To handle this, the runtime implements these routines on behalf of the rest of the kernel. The sanitizer implementation validates memory accesses manually before handing off to the real implementation. The sanitizer in a KASAN-configured kernel can be disabled by setting the loader tunable debug.kasan.disable=1. Obtained from: NetBSD MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29416	2021-04-13 17:42:20 -04:00
Mitchell Horne	2816bd8442	rmlock(9): add an RM_DUPOK flag Allows for duplicate locks to be acquired without witness complaining. Similar flags exists already for rwlock(9) and sx(9). Reviewed by: markj MFC after: 3 days Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. NetApp PR: 52 Differential Revision: https://reviews.freebsd.org/D29683n	2021-04-12 11:42:21 -03:00
Mark Johnston	dfff37765c	Rename struct device to struct _device types.h defines device_t as a typedef of struct device *. struct device is defined in subr_bus.c and almost all of the kernel uses device_t. The LinuxKPI also defines a struct device, so type confusion can occur. This causes bugs and ambiguity for debugging tools. Rename the FreeBSD struct device to struct _device. Reviewed by: gbe (man pages) Reviewed by: rpokala, imp, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29676	2021-04-12 09:32:30 -04:00
Andrew Turner	5d2d599d3f	Create VM_MEMATTR_DEVICE on all architectures This is intended to be used with memory mapped IO, e.g. from bus_space_map with no flags, or pmap_mapdev. Use this new memory type in the map request configured by resource_init_map_request, and in pciconf. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D29692	2021-04-12 06:15:31 +00:00
Konstantin Belousov	a091c35323	ptrace: restructure comments around reparenting on PT_DETACH style code, and use {} for both branches. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-11 14:44:30 +03:00
Konstantin Belousov	9d7e450b64	ptrace: remove dead call to FIX_SSTEP() It was an alias for procfs_fix_sstep() long time ago. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-11 14:44:30 +03:00
Piotr Pawel Stefaniak	a212f56d10	Balance parentheses in sysctl descriptions	2021-04-11 10:30:55 +02:00
Konstantin Belousov	2fd1ffefaa	Stop arming kqueue timers on knote owner suspend or terminate This way, even if the process specified very tight reschedule intervals, it should be stoppable/killable. Reported and reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:43:51 +03:00
Konstantin Belousov	533e5057ed	Add helper for kqueue timers callout scheduling Reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:42:56 +03:00
Konstantin Belousov	4d27d8d2f3	Stop arming realtime posix process timers on suspend or terminate Reported and reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:42:51 +03:00
Konstantin Belousov	dc47fdf131	Stop arming periodic process timers on suspend or terminate Reported and reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106	2021-04-09 23:42:44 +03:00
Mateusz Guzik	72b3b5a941	vfs: replace vfs_smr_quiesce with vfs_smr_synchronize This ends up using a smr specific method. Suggested by: markj Tested by: pho	2021-04-08 11:14:45 +00:00
Mark Johnston	0f07c234ca	Remove more remnants of sio(4) Reviewed by: imp MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29626	2021-04-07 14:33:02 -04:00
Mark Johnston	274579831b	capsicum: Limit socket operations in capability mode Capsicum did not prevent certain privileged networking operations, specifically creation of raw sockets and network configuration ioctls. However, these facilities can be used to circumvent some of the restrictions that capability mode is supposed to enforce. Add capability mode checks to disallow network configuration ioctls and creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET internet sockets. Reviewed by: oshogbo Discussed with: emaste Reported by: manu Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29423	2021-04-07 14:32:56 -04:00
Mateusz Guzik	13b3862ee8	cache: update an assert on CACHE_FPL_STATUS_ABORTED Since symlink support it can get upgraded to CACHE_FPL_STATUS_DESTROYED. Reported by: bdrewery	2021-04-06 22:31:58 +02:00
Mark Johnston	2425f5e912	mount: Disallow mounting over a jail root Discussed with: jamie Approved by: so Security: CVE-2020-25584 Security: FreeBSD-SA-21:10.jail_mount	2021-04-06 14:49:36 -04:00
Edward Tomasz Napierala	7f6157f7fd	lock_delay(9): improve interaction with restrict_starvation After `e7a5b3bd05`, the la->delay value was adjusted after being set by the starvation_limit code block, which is wrong. Reported By: avg Reviewed By: avg Fixes: `e7a5b3bd05` Sponsored By: NetApp, Inc. Sponsored By: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D29513	2021-04-03 13:08:53 +01:00
Mark Johnston	52a99c72b5	sendfile: Fix error initialization in sendfile_getobj() Reviewed by: chs, kib Reported by: jhb Fixes: `faa998f6ff` MFC after: 1 day Differential Revision: https://reviews.freebsd.org/D29540	2021-04-02 17:42:38 -04:00
Richard Scheffenegger	cad4fd0365	Make sbuf_drain safe for external use While sbuf_drain was an internal function, two KASSERTS checked the sanity of it being called. However, an external caller may be ignorant if there is any data to drain, or if an error has already accumulated. Be nice and return immediately with the accumulated error. MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29544	2021-04-02 20:12:11 +02:00
Konstantin Belousov	aa3ea612be	x86: remove gcov kernel support Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D29529	2021-04-02 15:41:51 +03:00
Mateusz Guzik	f79bd71def	cache: add high level overview Differential Revision: https://reviews.freebsd.org/D28675	2021-04-02 05:11:05 +02:00
Mateusz Guzik	dc532884d5	cache: fix resizing in face of lockless lookup Reported by: pho Tested by: pho	2021-04-02 05:11:05 +02:00
Lawrence Stewart	1eb402e47a	stats(3): Improve t-digest merging of samples which result in mu adjustment underflow. Allow the calculation of the mu adjustment factor to underflow instead of rejecting the VOI sample from the digest and logging an error. This trades off some (currently unquantified) additional centroid error in exchange for better fidelity of the distribution's density, which is the right trade off at the moment until follow up work to better handle and track accumulated error can be undertaken. Obtained from: Netflix MFC after: immediately	2021-04-02 13:17:53 +11:00
Richard Scheffenegger	c804c8f2c5	Export sbuf_drain to orchestrate lock and drain action While exporting large amounts of data to a sysctl request, datastructures may need to be locked. Exporting the sbuf_drain function allows the coordination between drain events and held locks, to avoid stalls. PR: 254333 Reviewed By: jhb MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29481	2021-03-31 19:17:37 +02:00
Konstantin Belousov	e243367b64	mbuf: add a way to mark flowid as calculated from the internal headers In some settings offload might calculate hash from decapsulated packet. Reserve a bit in packet header rsstype to indicate that. Add m_adj_decap() that acts similarly to m_adj, but also either clear flowid if it is not marked as inner, or transfer it to the decapsulated header, clearing inner indicator. It depends on the internals of m_adj() that reuses the argument packet header for the result. Use m_adj_decap() for decapsulating vxlan(4) and gif(4) input packets. Reviewed by: ae, hselasky, np Sponsored by: Nvidia Networking / Mellanox Technologies MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28773	2021-03-31 14:38:26 +03:00
Mark Johnston	3428b6c050	Fix several dev_clone callbacks to avoid out-of-bounds reads Use strncmp() instead of bcmp(), so that we don't have to find the minimum of the string lengths before comparing. Reviewed by: kib Reported by: KASAN MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29463	2021-03-28 11:08:36 -04:00
Mark Johnston	71c160a8f6	vfs: Add an assertion around name length limits Some filesystems assume that they can copy a name component, with length bounded by NAME_MAX, into a dirent buffer of size MAXNAMLEN. These constants have the same value; add a compile-time assertion to that effect. Reported by: Alexey Kulaev <alex.qart@gmail.com> Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29431	2021-03-27 13:45:19 -04:00
Mark Johnston	653a437c04	accept_filter: Fix filter parameter handling For filters which implement accf_create, the setsockopt(2) handler caches the filter name in the socket, but it also incorrectly frees the buffer containing the copy, leaving a dangling pointer. Note that no accept filters provided in the base system are susceptible to this, as they don't implement accf_create. Reported by: Alexey Kulaev <alex.qart@gmail.com> Discussed with: emaste Security: kernel use-after-free MFC after: 3 days Sponsored by: The FreeBSD Foundation	2021-03-25 17:55:46 -04:00
Mark Johnston	ec8f1ea8d5	Generalize sanitizer interceptors for memory and string routines Similar to commit `3ead60236f` ("Generalize bus_space(9) and atomic(9) sanitizer interceptors"), use a more generic scheme for interposing sanitizer implementations of routines like memcpy(). No functional change intended. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2021-03-24 19:46:22 -04:00
Mark Johnston	3ead60236f	Generalize bus_space(9) and atomic(9) sanitizer interceptors Make it easy to define interceptors for new sanitizer runtimes, rather than assuming KCSAN. Lay a bit of groundwork for KASAN and KMSAN. When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions in atomic_san.h are used by default instead of the inline implementations in the platform's atomic.h. These definitions are implemented in the sanitizer runtime, which includes machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual implementations. No functional change intended. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2021-03-22 22:21:53 -04:00
Adrian Chadd	25bfa44860	Add device and ifnet logging methods, similar to device_printf / if_printf * device_printf() is effectively a printf * if_printf() is effectively a LOG_INFO This allows subsystems to log device/netif stuff using different log levels, rather than having to invent their own way to prefix unit/netif names. Differential Revision: https://reviews.freebsd.org/D29320 Reviewed by: imp	2021-03-22 00:02:34 +00:00
Alex Richardson	6ceacebdf5	Unbreak MSG_CMSG_CLOEXEC MSG_CMSG_CLOEXEC has not been working since 2015 (SVN r284380) because _finstall expects O_CLOEXEC and not UF_EXCLOSE as the flags argument. This was probably not noticed because we don't have a test for this flag so this commit adds one. I found this problem because one of the libwayland tests was failing. Fixes: `ea31808c3b` ("fd: move out actual fp installation to _finstall") MFC after: 3 days Reviewed By: mjg, kib Differential Revision: https://reviews.freebsd.org/D29328	2021-03-18 20:52:20 +00:00
Mateusz Guzik	e9272225e6	vfs: fix vnlru marker handling for filtered/unfiltered cases The global list has a marker with an invariant that free vnodes are placed somewhere past that. A caller which performs filtering (like ZFS) can move said marker all the way to the end, across free vnodes which don't match. Then a caller which does not perform filtering will fail to find them. This makes vn_alloc_hard sleep for 1 second instead of reclaiming, resulting in significant stalls. Fix the problem by requiring an explicit marker by callers which do filtering. As a temporary measure extend vnlru_free to restart if it fails to reclaim anything. Big thanks go to the reporter for testing several iterations of the patch. Reported by: Yamagi <lists yamagi.org> Tested by: Yamagi <lists yamagi.org> Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D29324	2021-03-18 14:59:03 +00:00
Kyle Evans	f187d6dfbf	base: remove if_wg(4) and associated utilities, manpage After length decisions, we've decided that the if_wg(4) driver and related work is not yet ready to live in the tree. This driver has larger security implications than many, and thus will be held to more scrutiny than other drivers. Please also see the related message sent to the freebsd-hackers@ and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on 2021/03/16, with the subject line "Removing WireGuard Support From Base" for additional context.	2021-03-17 09:14:48 -05:00
Mark Johnston	4aa157dd5b	link_elf_obj: Add a case missing from `5e6989ba4f` Fixes: `5e6989ba4f` MFC after: 3 days Sponsored by: The FreeBSD Foundation	2021-03-16 15:01:41 -04:00
Kyle Evans	74ae3f3e33	if_wg: import latest fixup work from the wireguard-freebsd project This is the culmination of about a week of work from three developers to fix a number of functional and security issues. This patch consists of work done by the following folks: - Jason A. Donenfeld <Jason@zx2c4.com> - Matt Dunwoodie <ncon@noconroy.net> - Kyle Evans <kevans@FreeBSD.org> Notable changes include: - Packets are now correctly staged for processing once the handshake has completed, resulting in less packet loss in the interim. - Various race conditions have been resolved, particularly w.r.t. socket and packet lifetime (panics) - Various tests have been added to assure correct functionality and tooling conformance - Many security issues have been addressed - if_wg now maintains jail-friendly semantics: sockets are created in the interface's home vnet so that it can act as the sole network connection for a jail - if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0 - if_wg now exports via ioctl a format that is future proof and complete. It is additionally supported by the upstream wireguard-tools (which we plan to merge in to base soon) - if_wg now conforms to the WireGuard protocol and is more closely aligned with security auditing guidelines Note that the driver has been rebased away from using iflib. iflib poses a number of challenges for a cloned device trying to operate in a vnet that are non-trivial to solve and adds complexity to the implementation for little gain. The crypto implementation that was previously added to the tree was a super complex integration of what previously appeared in an old out of tree Linux module, which has been reduced to crypto.c containing simple boring reference implementations. This is part of a near-to-mid term goal to work with FreeBSD kernel crypto folks and take advantage of or improve accelerated crypto already offered elsewhere. There's additional test suite effort underway out-of-tree taking advantage of the aforementioned jail-friendly semantics to test a number of real-world topologies, based on netns.sh. Also note that this is still a work in progress; work going further will be much smaller in nature. MFC after: 1 month (maybe)	2021-03-14 23:52:04 -05:00
Jonathan T. Looney	dbec10e088	Fetch the sigfastblock value in syscalls that wait for signals We have seen several cases of processes which have become "stuck" in kern_sigsuspend(). When this occurs, the kernel's td_sigblock_val is set to 0x10 (one block outstanding) and the userspace copy of the word is set to 0 (unblocked). Because the kernel's cached value shows that signals are blocked, kern_sigsuspend() blocks almost all signals, which means the process hangs indefinitely in sigsuspend(). It is not entirely clear what is causing this condition to occur. However, it seems to make sense to add some protection against this case by fetching the latest sigfastblock value from userspace for syscalls which will sleep waiting for signals. Here, the change is applied to kern_sigsuspend() and kern_sigtimedwait(). Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29225	2021-03-12 18:14:17 +00:00
John Baldwin	640d54045b	Set TDP_KTHREAD before calling cpu_fork() and cpu_copy_thread(). This permits these routines to use special logic for initializing MD kthread state. For the kproc case, this required moving the logic to set these flags from kproc_create() into do_fork(). Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29207	2021-03-12 09:48:20 -08:00
Konstantin Belousov	44691b33cc	vlrureclaim: only skip vnode with resident pages if it own the pages Nullfs vnode which shares vm_object and pages with the lower vnode should not be exempt from the reclaim just because lower vnode cached a lot. Their reclamation is actually very cheap and should be preferred over real fs vnodes, but this change is already useful. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Warner Losh	e52368365d	config_intrhook: provide config_intrhook_drain config_intrhook_drain will remove the hook from the list as config_intrhook_disestablish does if the hook hasn't been called. If it has, config_intrhook_drain will wait for the hook to be disestablished in the normal course (or expedited, it's up to the driver to decide how and when to call config_intrhook_disestablish). This is intended for removable devices that use config_intrhook and might be attached early in boot, but that may be removed before the kernel can call the config_intrhook or before it ends. To prevent all races, the detach routine will need to call config_intrhook_train. Sponsored by: Netflix, Inc Reviewed by: jhb, mav, gde (in D29006 for man page) Differential Revision: https://reviews.freebsd.org/D29005	2021-03-11 09:45:10 -07:00
Hans Petter Selasky	6eb60f5b7f	Use the word "LinuxKPI" instead of "Linux compatibility", to not confuse with user-space Linux compatibility support. No functional change. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-10 12:35:16 +01:00
Kyle Evans	1ae20f7c70	kern: malloc: fix panic on M_WAITOK during THREAD_NO_SLEEPING() Simple condition flip; we wanted to panic here after epoch_trace_list(). Reviewed by: glebius, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29125	2021-03-09 05:16:39 -06:00
Warner Losh	88a5591203	config_intrhook: Move from TAILQ to STAILQ and padding config_intrhook doesn't need to be a two-pointer TAILQ. We rarely add/delete from this and so those need not be optimized. Instaed, use the one-pointer STAILQ plus a uintptr_t to be used as a flags word. This will allow these changes to be MFC'd to 12 and 13 to fix a race in removable devices. Feedback from: jhb Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D29004	2021-03-08 15:59:00 -07:00
Mark Johnston	435c7cfb24	Rename _cscan_atomic.h and _cscan_bus.h to atomic_san.h and bus_san.h Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so put these in a more generic place as a step towards importing the other sanitizers. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29103	2021-03-08 12:39:06 -05:00
Mark Johnston	7995dae9d3	posix timers: Improve the overrun calculation timer_settime(2) may be used to configure a timeout in the past. If the timer is also periodic, we also try to compute the number of timer overruns that occurred between the initial timeout and the time at which the timer fired. This is done in a loop which iterates once per period between the initial timeout and now. If the period is small and the initial timeout was a long time ago, this loop can take forever to run, so the system is effectively DOSed. Replace the loop with a more direct calculation of (now - initial timeout) / period to compute the number of overruns. Reported by: syzkaller Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29093	2021-03-08 12:39:06 -05:00
Mark Johnston	60d12ef952	posix timers: Sprinkle some style fixes MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Mark Johnston	8ff2b41c05	posix timers: Declare unexported functions as static MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Konstantin Belousov	56b9bee63a	Make kern.timecounter.hardware tunable Noted and reviewed by: kevans MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D29122	2021-03-08 03:48:21 +02:00
Hans Petter Selasky	c743a6bd4f	Implement mallocarray_domainset(9) variant of mallocarray(9). Reviewed by: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-06 11:38:55 +01:00
Konstantin Belousov	ead7697f04	Restore AT_RESOLVE_BENEATH support for funlinkat(2)/unlinkat(2). MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-06 07:24:18 +02:00
Mark Johnston	89b650872b	ktls: Hide initialization message behind bootverbose We don't typically print anything when a subsystem initializes itself, and KTLS is currently disabled by default anyway. Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29097	2021-03-05 13:11:02 -05:00
Mark Johnston	5e6989ba4f	link_elf_obj: Handle init_array sections in KLDs Reuse existing handling for .ctors, print a warning if multiple constructor sections are present. Destructors are not handled as of yet. This is required for KASAN. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29049	2021-03-04 10:07:10 -05:00
Mark Johnston	49f6925ca3	ktls: Cache output buffers for software encryption Maintain a cache of physically contiguous runs of pages for use as output buffers when software encryption is configured and in-place encryption is not possible. This makes allocation and free cheaper since in the common case we avoid touching the vm_page structures for the buffer, and fewer calls into UMA are needed. gallatin@ reports a ~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after this change. It is possible that we will not be able to allocate these buffers if physical memory is fragmented. To avoid frequently calling into the physical memory allocator in this scenario, rate-limit allocation attempts after a failure. In the failure case we fall back to the old behaviour of allocating a page at a time. N.B.: this scheme could be simplified, either by simply using malloc() and looking up the PAs of the pages backing the buffer, or by falling back to page by page allocation and creating a mapping in the cache zone. This requires some way to save a mapping of an M_EXTPG page array in the mbuf, though. m_data is not really appropriate. The second approach may be possible by saving the mapping in the plinks union of the first vm_page structure of the array, but this would force a vm_page access when freeing an mbuf. Reviewed by: gallatin, jhb Tested by: gallatin Sponsored by: Ampere Computing Submitted by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D28556	2021-03-03 17:34:01 -05:00
Alan Somers	ab63da3564	Speed up geom_stats_resync in the presence of many devices The old code had a O(n) loop, where n is the size of /dev/devstat. Multiply that by another O(n) loop in devstat_mmap for a total of O(n^2). This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can quickly determine the right amount of memory to map, eliminating the O(n) loop in userland. This change decreases the time to run "gstat -bI0.001" with 16,384 md devices from 29.7s to 4.2s. Also, fix a memory leak first reported as PR 203097. Sponsored by: Axcient Reviewed by: mav, imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28968	2021-03-02 18:33:45 -07:00
Robert Wing	0cc746f193	filt_fsevent: only record interested events Respect filter-specific flags for the EVFILT_FS filter. When a kevent is registered with the EVFILT_FS filter, it is always triggered when an EVFILT_FS event occurs, regardless of the filter-specific flags used. Fix that. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28974	2021-03-02 14:19:22 -09:00
Konstantin Belousov	28cd3a673e	O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path Requested and reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:40 +02:00
Konstantin Belousov	49c98a4bf3	nameicap_check_dotdot: trim tracker on check Tracker should contain exactly the path from the starting directory to the current lookup point. Otherwise we might not detect some cases of dotdot escape. Consequently, if we are walking up the tree by dotdot lookup, we must remove an entries below the walked directory. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:35 +02:00
Konstantin Belousov	e8a2862aa0	Add nameicap_cleanup_from(), to clean tracker list starting from some element Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:30 +02:00
Konstantin Belousov	2388ad7c29	nameicap_tracker_add: avoid duplicates in the tracker list Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:23 +02:00
Konstantin Belousov	59e7494281	Do not call nameicap_tracker_add() for dotdot case. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:14 +02:00
Konstantin Belousov	20e91ca36a	open(2): Remove O_BENEATH and AT_BENEATH with the reasoning that the flags did not worked properly, and were not shipped in a release. O_RESOLVE_BENEATH is kept as useful. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:16:55 +02:00
Kyle Evans	60c4ec806d	jail: allow root to implicitly widen its cpuset to attach The default behavior for attaching processes to jails is that the jail's cpuset augments the attaching processes, so that it cannot be used to escalate a user's ability to take advantage of more CPUs than the administrator wanted them to. This is problematic when root needs to manage jails that have disjoint sets with whatever process is attaching, as this would otherwise result in a deadlock. Therefore, if we did not have an appropriate common subset of cpus/domains for our new policy, we now allow the process to simply take on the jail set if it has the privilege to widen its mask anyways. With the new logic, root can still usefully cpuset a process that attaches to a jail with the desire of maintaining the set it was given pre-attachment while still retaining the ability to manage child jails without jumping through hoops. A test has been added to demonstrate the issue; cpuset of a process down to just the first CPU and attempting to attach to a jail without access to any of the same CPUs previously resulted in EDEADLK and now results in taking on the jail's mask for privileged users. PR: 253724 Reviewed by: jamie (also discussed with) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28952	2021-03-01 12:38:31 -06:00
Rick Macklem	a5f9fe2bab	copy_file_range(2): Fix for small values of input file offset and len r366302 broke copy_file_range(2) for small values of input file offset and len. It was possible for rem to be greater than len and then "len - rem" was a large value, since both variables are unsigned. Reported by: koobs, Pablo <pablogsal gmail com> (Python) Reviewed by: asomers, koobs MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28981	2021-03-01 06:31:10 -08:00
Konstantin Belousov	b5449c92b4	Use atomic_interrupt_fence() instead of bare __compiler_membar() for the which which definitely use membar to sync with interrupt handlers. libc and rtld uses of __compiler_membar() seems to want compiler barriers proper. The barrier in sched_unpin_lite() after td_pinned decrement seems to be not needed and removed, instead of convertion. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28956	2021-02-28 01:27:29 +02:00
Mateusz Guzik	1239a72221	cache: temporarily drop the assert that dvp != vp when adding an entry Historically it was allowed for any names, but arguably should never be even attempted. Allow it again since there is a release pending and allowing it is bug-compatible with previous behavior. Reported by: otis	2021-02-27 22:29:50 +00:00
Jamie Gritton	589e4c1df4	jail: Add safety around prison_deref() flags. do_jail_attach() now only uses the PD_XXX flags that refer to lock status, so make sure that something else like PD_KILL doesn't slip through. Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.	2021-02-25 20:10:42 -08:00
Jamie Gritton	108a9384e9	jail: Fix locking on an early jail_set error. I had locked allprison_lock without immediately setting PD_LIST_LOCKED.	2021-02-25 19:52:58 -08:00
John Baldwin	90972f0402	ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter. I missed updating this counter when rebasing the changes in `9c64fc4029` after the switch to COUNTER_U64_DEFINE_EARLY in `1755b2b989`. Fixes: `9c64fc4029` Add Chacha20-Poly1305 as a KTLS cipher suite. Sponsored by: Netflix	2021-02-25 15:00:13 -08:00
Ryan Libby	d7671ad8d6	Close races in vm object chain traversal for unlock We were unlocking the vm object before reading the backing_object field. In the meantime, the object could be freed and reused. This could cause us to go off the rails in the object chain traversal, failing to unlock the rest of the objects in the original chain and corrupting the lock state of the victim chain. Reviewed by: bdrewery, kib, markj, vangyzen MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28926	2021-02-25 12:11:19 -08:00
Edward Tomasz Napierala	e7a5b3bd05	Modify lock_delay() to increase the delay time after spinning Modify lock_delay() to increase the delay time after spinning, not before. Previously we would spin at least twice instead of once. In NetApp's benchmarks this fixes a performance regression compared to FreeBSD 10, which called cpu_spinwait() directly. Reviewed By: mjg Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27331	2021-02-25 18:55:26 +00:00
Mark Johnston	369706a6f8	buf: Fix the dirtybufthresh check dirtybufthresh is a watermark, slightly below the high watermark for dirty buffers. When a delayed write is issued, the dirtying thread will start flushing buffers if the dirtybufthresh watermark is reached. This helps ensure that the high watermark is not reached, otherwise performance will degrade as clustering and other optimizations are disabled (see buf_dirty_count_severe()). When the buffer cache was partitioned into "domains", the dirtybufthresh threshold checks were not updated. Fix this. Reported by: Shrikanth R Kamath <kshrikanth@juniper.net> Reviewed by: rlibby, mckusick, kib, bdrewery Sponsored by: Juniper Networks, Inc., Klara, Inc. Fixes: `3cec5c77d6` MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28901	2021-02-25 10:04:44 -05:00
Mark Johnston	faa998f6ff	sendfile: Use the pager size to determine the file extent when possible Previously sendfile would issue a VOP_GETATTR and use the returned size, i.e., the file size. When paging in file data, sendfile_swapin() will use the pager to determine whether it needs to zero-fill, most often because of a hole in a sparse file. An attempt to page in beyond the end of a file is treated this way, and occurs when the requested page is past the end of the pager. In other words, both the file size and pager size were used interchangeably. With ZFS, updates to the pager and file sizes are not synchronized by the exclusive vnode lock, at least partially due to its use of MNTK_SHARED_WRITES. In particular, the pager size is updated after the file size, so in the presence of a writer concurrently extending the file, sendfile could incorrectly instantiate "holes" in the page cache pages backing the file, which manifests as data corruption when reading the file back from the page cache. The on-disk copy is unaffected. Fix this by consistently using the pager size when available. Reported by: dumbbell Reviewed by: chs, kib Tested by: dumbbell, pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28811	2021-02-25 10:04:44 -05:00
Jamie Gritton	c861373bdf	jail: re-commit `811e27fa3c` with fixes Make sure PD_KILL isn't passed to do_jail_attach, where it might end up trying to kill the caller's prison (even prison0). Fix the child jail loop in prison_deref_kill, which was doing the post-order part during the pre-order part. That's not a system- killer, but make jails not always die correctly.	2021-02-24 21:54:49 -08:00
Jamie Gritton	ddfffb41a2	jail: back out `811e27fa3c` until it doesn't break Jenkins Reported by: arichardson	2021-02-24 21:10:47 -08:00
Mark Johnston	1d44514fcd	rmlock: Add a required compiler membar to the rlock slow path The tracker flags need to be loaded only after the tracker is removed from its per-CPU queue. Otherwise, readers may fail to synchronize with pending writers attempting to propagate priority to active readers, and readers and writers deadlock on each other. This was observed in a stable/12-based armv7 kernel where the compiler had reordered the load of rmp_flags to before the stores updating the queue. Reviewed by: rlibby, scottl Discussed with: kib Sponsored by: Rubicon Communications, LLC ("Netgate") MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28821	2021-02-23 21:17:12 -05:00
Alex Richardson	fa32350347	close_range: add audit support This fixes the closefrom test in sys/audit. Includes cherry-picks of the following commits from openbsm: `4dfc628aaf` `99ff6fe32a` `da48a0399e` Reviewed By: kevans Differential Revision: https://reviews.freebsd.org/D28388	2021-02-23 17:47:07 +00:00
Jamie Gritton	0a2a96f35a	jail: Don't allow jails under dying parents If a jail is created with jail_set(...JAIL_DYING), and it has a parent currently in a dying state, that will bring the parent jail back to life. Restrict that to require that the parent itself be explicitly brought back first, and not implicitly created along with the new child jail. Differential Revision: https://reviews.freebsd.org/D28515	2021-02-22 17:04:06 -08:00
Jamie Gritton	701d6b50ae	jail: Fix a LOR introduced in `1158508a80`	2021-02-22 15:51:10 -08:00
Jamie Gritton	811e27fa3c	jail: Add PD_KILL to remove a prison in prison_deref(). Add the PD_KILL flag that instructs prison_deref() to take steps to actively kill a prison and its descendents, namely marking it PRISON_STATE_DYING, clearing its PR_PERSIST flag, and killing any attached processes. This replaces a similar loop in sys_jail_remove(), bringing the operation under the same single hold on allprison_lock that it already has. It is also used to clean up failed jail (re-)creations in kern_jail_set(), which didn't generally take all the proper steps. Differential Revision: https://reviews.freebsd.org/D28473	2021-02-22 12:27:44 -08:00
Mark Johnston	608c44f96e	m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages The caller should not be passing M_ZERO in the first place, so PG_ZERO will not be preserved by the page allocator and clearing it accomplishes nothing. Reviewed by: gallatin, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28808	2021-02-22 10:04:46 -05:00
Jamie Gritton	1158508a80	jail: Add pr_state to struct prison Rather that using references (pr_ref and pr_uref) to deduce the state of a prison, keep track of its state explicitly. A prison is either "invalid" (pr_ref == 0), "alive" (pr_uref > 0) or "dying" (pr_uref == 0). State transitions are generally tied to the reference counts, but with some flexibility: a new prison is "invalid" even though it now starts with a reference, and jail_remove(2) sets the state to "dying" before the user reference count drops to zero (which was prviously accomplished via the PR_REMOVE flag). pr_state is protected by both the prison mutex and allprison_lock, so it has the same availablity guarantees as the reference counts do. Differential Revision: https://reviews.freebsd.org/D27876	2021-02-21 13:24:47 -08:00
Mateusz Guzik	ee9b37ae5c	jail: fix build after the previous commit Noted by: Michael Butler <imb protected-networks.net>	2021-02-21 21:05:25 +00:00
Jamie Gritton	f7496dcab0	jail: Change the locking around pr_ref and pr_uref Require both the prison mutex and allprison_lock when pr_ref or pr_uref go to/from zero. Adding a non-first or removing a non-last reference remain lock-free. This means that a shared hold on allprison_lock is sufficient for prison_isalive() to be useful, which removes a number of cases of lock/check/unlock on the prison mutex. Expand the locking in kern_jail_set() to keep allprison_lock held exclusive until the new prison is valid, thus making invalid prisons invisible to any thread holding allprison_lock (except of course the one creating or destroying the prison). This renders prison_isvalid() nearly redundant, now used only in asserts. Differential Revision: https://reviews.freebsd.org/D28419 Differential Revision: https://reviews.freebsd.org/D28458	2021-02-21 10:55:44 -08:00
Konstantin Belousov	2bfd8992c7	vnode: move write cluster support data to inodes. The data is only needed by filesystems that 1. use buffer cache 2. utilize clustering write support. Requested by: mjg Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Konstantin Belousov	750ea20d3f	Delete dead CLUSTERDEBUG config option. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Mateusz Guzik	81174cd8e2	vfs: employ vfs_ref_from_vp in statfs and fstatfs Avoids locking and unlocking the vnode. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28695	2021-02-21 00:43:05 +00:00
Mateusz Guzik	a15f787adb	vfs: add vfs_ref_from_vp This generalizes what vop_stdgetwritemount used to be doing. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28695	2021-02-21 00:43:05 +00:00
Jamie Gritton	6e1d1bfcac	jail: Improve locking when removing prisons Change the flow of prison_deref() so it doesn't let go of allprison_lock until it's completely done using it (except for a possible drop as part of an upgrade on its first try). Differential Revision: https://reviews.freebsd.org/D28458 MFC after: 3 days	2021-02-20 14:38:58 -08:00
Jamie Gritton	d4380c0cdd	jail: Change both root and working directories in jail_attach(2) jail_attach(2) performs an internal chroot operation, leaving it up to the calling process to assure the working directory is inside the jail. Add a matching internal chdir operation to the jail's root. Also ignore kern.chroot_allow_open_directories, and always disallow the operation if there are any directory descriptors open. Reported by: mjg Approved by: markj, kib MFC after: 3 days	2021-02-19 14:13:35 -08:00
John Baldwin	9c64fc4029	Add Chacha20-Poly1305 as a KTLS cipher suite. Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and TLS 1.3 (RFCs 7905 and 8446). For both versions, Chacha20 uses the server and client IVs as implicit nonces xored with the record sequence number to generate the per-record nonce matching the construction used with AES-GCM for TLS 1.3. Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27839	2021-02-18 09:26:32 -08:00
Thomas Skibo	9976b42b69	ddb: fix show devmap output on 32-bit arm The output has been broken since `1b6dd6d772`. Casting to uintmax_t before the call to printf is necessary to ensure that 32-bit addresses are interpreted correctly. PR: 243236 MFC after: 3 days	2021-02-18 11:53:14 -04:00
Konstantin Belousov	662283b108	vn_printf: handle VI_FOPENING Noted by: mjg Sponsored by: The FreeBSD Foundation MFC after: 6 days Fixes: `fa3bd463ce`	2021-02-18 16:28:28 +02:00
Alex Richardson	fa2528ac64	Use atomic loads/stores when updating td->td_state KCSAN complains about racy accesses in the locking code. Those races are fine since they are inside a TD_SET_RUNNING() loop that expects the value to be changed by another CPU. Use relaxed atomic stores/loads to indicate that this variable can be written/read by multiple CPUs at the same time. This will also prevent the compiler from doing unexpected re-ordering. Reported by: GENERIC-KCSAN Test Plan: KCSAN no longer complains, kernel still runs fine. Reviewed By: markj, mjg (earlier version) Differential Revision: https://reviews.freebsd.org/D28569	2021-02-18 14:02:48 +00:00
Konstantin Belousov	fa3bd463ce	lockf: ensure atomicity of lockf for open(O_CREAT\|O_EXCL\|O_EXLOCK) or EX_SHLOCK. Do it by setting a vnode iflag indicating that the locking exclusive open is in progress, and not allowing F_LOCK request to make a progress until the first open finishes. Requested by: mckusick Reviewed by: markj, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28697	2021-02-18 01:22:05 +02:00
Warner Losh	00065c7630	Giant: move back Giant removal until 14 Update the Giant Lock warning message to FreeBSD 14. It's growing increasling clear that this won't be done before 13.0. MFC: Insta (re@'s request)	2021-02-17 14:33:09 -07:00
Jamie Gritton	cc7b730653	jail: Handle a possible race between jail_remove(2) and fork(2) jail_remove(2) includes a loop that sends SIGKILL to all processes in a jail, but skips processes in PRS_NEW state. Thus it is possible the a process in mid-fork(2) during jail removal can survive the jail being removed. Add a prison flag PR_REMOVE, which is checked before the new process returns. If the jail is being removed, the process will then exit. Also check this flag in jail_attach(2) which has a similar issue. Reported by: trasz Approved by: kib MFC after: 3 days	2021-02-16 11:19:13 -08:00
Konstantin Belousov	c61fae1475	pgcache read: protect against reads past end of the vm object size If uio_offset is past end of the object size, calculated resid is negative. Delegate handling this case to the locked read, as any other non-trivial situation. PR: 253158 Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de> Tested by: cy Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-02-16 07:09:37 +02:00
Alex Richardson	0482d7c9e9	Fix fget_only_user() to return ENOTCAPABLE on a failed capsicum check After `eaad8d1303` four additional capsicum-test tests started failing. It turns out this is because fget_only_user() was returning EBADF on a failed capsicum check instead of forwarding the return value of cap_check_inline() like fget_unlocked_seq(). capsicum-test failures before this: ``` [ FAILED ] 7 tests, listed below: [ FAILED ] Capability.OperationsForked [ FAILED ] Capability.NoBypassDAC [ FAILED ] Pdfork.OtherUserForked [ FAILED ] PipePdfork.WildcardWait [ FAILED ] OpenatTest.WithFlag [ FAILED ] ForkedOpenatTest_WithFlagInCapabilityMode._ [ FAILED ] Select.LotsOFileDescriptorsForked ``` After: ``` [ FAILED ] 3 tests, listed below: [ FAILED ] Capability.NoBypassDAC [ FAILED ] Pdfork.OtherUserForked [ FAILED ] PipePdfork.WildcardWait ``` Reviewed By: mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28691	2021-02-15 22:55:12 +00:00
Jason A. Harmening	41032835dc	Fix divide-by-zero panic when ASLR is enabled and superpages disabled When locating the anonymous memory region for a vm_map with ASLR enabled, we try to keep the slid base address aligned on a superpage boundary to minimize pagetable fragmentation and maximize the potential usage of superpage mappings. We can't (portably) do this if superpages have been disabled by loader tunable and pagesizes[1] is 0, and it would be less beneficial in that case anyway. PR: 253511 Reported by: johannes@jo-t.de MFC after: 1 week Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28678	2021-02-15 10:38:04 -08:00
Mateusz Guzik	eac22dd480	lockmgr: shrink struct lock by 8 bytes on LP64 Currently the struct has a 4 byte padding stemming from 3 ints. 1. prio comfortably fits in short, unfortunately there is no dedicated type for it and plumbing it throughout the codebase is not worth it right now, instead an assert is added which covers also flags for safety 2. lk_exslpfail can in principle exceed u_short, but the count is already not considered reliable and it only ever gets modified straight to 0. In other words it can be incrementing with an upper bound of USHRT_MAX With these in place struct lock shrinks from 48 to 40 bytes. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D28680	2021-02-15 13:57:25 +00:00
Konstantin Belousov	25c6318c79	procstat: distinguish vm map guards in procstat vm output. Requested and reviewed by: rwatson (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28658	2021-02-14 03:24:58 +02:00
Konstantin Belousov	adf28ab456	fifo: minor comment and assert improvements. In particular, replace a note that reload through vget() is obsoleted, with explanation why this code is required. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:22 +02:00
Konstantin Belousov	b59a8e63d6	Stop ignoring ERELOOKUP from VOP_INACTIVE() When possible, relock the vnode and retry inactivation. Only vunref() is required not to drop the vnode lock, so handle it specially by not retrying. This is a part of the efforts to ensure that unlinked not referenced vnode does not prevent inode from reusing. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	3b2aa36024	Use VOP_VPUT_PAIR() for eligible VFS syscalls. The current list is limited to the cases where UFS needs to handle vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(), VOP_LINK(), and VOP_SYMLINK(). Reviewed by: chs, mkcusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	49c117c193	Add VOP_VPUT_PAIR() with trivial default implementation. The VOP is intended to be used in situations where VFS has two referenced locked vnodes, typically a directory vnode dvp and a vnode vp that is linked from the directory, and at least dvp is vput(9)ed. The child vnode can be also vput-ed, but optionally left referenced and locked. There, at least UFS may need to do some actions with dvp which cannot be done while vp is also locked, so its lock might be dropped temporary. For instance, in some cases UFS needs to sync dvp to avoid filesystem state that is currently not handled by either kernel nor fsck. Having such VOP provides the neccessary context for filesystem which can do correct locking and handle potential reclamation of vp after relock. Trivial implementation does vput(dvp) and optionally vput(vp). Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	ee965dfa64	vn_open(): If the vnode is reclaimed during open(2), do not return error. Most future operations on the returned file descriptor will fail anyway, and application should be ready to handle that failures. Not forcing it to understand the transient failure mode on open, which is implementation-specific, should make us less special without loss of reporting of errors. Suggested by: chs Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	bf0db19339	buf SU hooks: track buf_start() calls with B_IOSTARTED flag and only call buf_complete() if previously started. Some error paths, like CoW failire, might skip buf_start() and do bufdone(), which itself call buf_complete(). Various SU handle_written_XXX() functions check that io was started and incomplete parts of the buffer data reverted before restoring them. This is a useful invariant that B_IO_STARTED on buffer layer allows to keep instead of changing check and panic into check and return. Reported by: pho Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundations	2021-02-12 03:02:19 +02:00
Mateusz Guzik	39e0c3f686	cache: assorted comment fixups	2021-02-09 17:09:44 +01:00
Kyle Evans	504ebd612e	kern: sonewconn: set so_options before pru_attach() Protocol attachment has historically been able to observe and modify so->so_options as needed, and it still can for newly created sockets. `779f106aa1` moved this to after pru_attach() when we re-acquire the lock on the listening socket. Restore the historical behavior so that pru_attach implementations can consistently use it. Note that some pru_attach() do currently rely on this, though that may change in the future. D28265 contains a change to remove the use in TCP and IB/SDP bits, as resetting the requested linger time on incoming connections seems questionable at best. This does move the assignment out from under the head's listen lock, but glebius notes that head won't be going away and applications cannot assume any specific ordering with a race between a connection coming in and the application changing socket options anyways. Discussed-with: glebius MFC-after: 1 week	2021-02-08 21:44:43 -06:00
Alexander V. Chernikov	924d1c9a05	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit `4a01b854ca`.	2021-02-08 22:32:32 +00:00
Alexander V. Chernikov	4a01b854ca	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.	2021-02-08 21:42:20 +00:00
Mark Johnston	b5aa9ad43a	ktls: Make configuration sysctls available as tunables Reviewed by: gallatin, jhb Sponsored by: Ampere Computing Submitted by: Klara, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28499	2021-02-08 09:19:02 -05:00
Mark Johnston	1755b2b989	ktls: Use COUNTER_U64_DEFINE_EARLY This makes it a bit more straightforward to add new counters when debugging. No functional change intended. Reviewed by: jhb Sponsored by: Ampere Computing Submitted by: Klara, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28498	2021-02-08 09:18:51 -05:00
Mateusz Guzik	2f8a844635	cache: remove the largely obsolete general description Examples of inconsistencies with the current state: - references LRU of all entries, removed years ago - references a non-existent lock (neglist) - claims negative entries have a NULL target It will be replaced with a more accurate and more informative description. In the meantime take it out so it stops misleading.	2021-02-06 00:28:40 +01:00
Mateusz Guzik	0e1594e60e	cache: fix vfs:namecache:lookup:miss probe call sites	2021-02-06 00:28:40 +01:00
Mateusz Guzik	2e96132a7d	cache: drop spurious arg from panic in cache_validate vp is already reported when noting mismatch	2021-02-06 00:28:39 +01:00
Mateusz Guzik	b54ed778fe	cache: comment on FNV	2021-02-06 00:13:57 +01:00
Ed Maste	edc374e7c4	Correct description for kern.proc.proc_td kern.proc.proc_td returns the process table with an entry for each thread. Previously the description included "no threads", presumably a cut-and-pasteo in `2648efa621`. Description suggested by PauAmma. PR: 253146 MFC after: 3 days Sponsored by: The FreeBSD Foundation	2021-02-02 17:00:05 -05:00
Mateusz Guzik	45456abc4c	cache: fix trailing slash support in face of permission problems Reported by: Johan Hendriks <joh.hendriks gmail.com> Tested by: kevans	2021-02-02 18:13:51 +00:00

1 2 3 4 5 ...

18388 Commits