freebsd-skq

Author	SHA1	Message	Date
Konstantin Belousov	44691b33cc	vlrureclaim: only skip vnode with resident pages if it own the pages Nullfs vnode which shares vm_object and pages with the lower vnode should not be exempt from the reclaim just because lower vnode cached a lot. Their reclamation is actually very cheap and should be preferred over real fs vnodes, but this change is already useful. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Warner Losh	e52368365d	config_intrhook: provide config_intrhook_drain config_intrhook_drain will remove the hook from the list as config_intrhook_disestablish does if the hook hasn't been called. If it has, config_intrhook_drain will wait for the hook to be disestablished in the normal course (or expedited, it's up to the driver to decide how and when to call config_intrhook_disestablish). This is intended for removable devices that use config_intrhook and might be attached early in boot, but that may be removed before the kernel can call the config_intrhook or before it ends. To prevent all races, the detach routine will need to call config_intrhook_train. Sponsored by: Netflix, Inc Reviewed by: jhb, mav, gde (in D29006 for man page) Differential Revision: https://reviews.freebsd.org/D29005	2021-03-11 09:45:10 -07:00
Hans Petter Selasky	6eb60f5b7f	Use the word "LinuxKPI" instead of "Linux compatibility", to not confuse with user-space Linux compatibility support. No functional change. MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-10 12:35:16 +01:00
Kyle Evans	1ae20f7c70	kern: malloc: fix panic on M_WAITOK during THREAD_NO_SLEEPING() Simple condition flip; we wanted to panic here after epoch_trace_list(). Reviewed by: glebius, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29125	2021-03-09 05:16:39 -06:00
Warner Losh	88a5591203	config_intrhook: Move from TAILQ to STAILQ and padding config_intrhook doesn't need to be a two-pointer TAILQ. We rarely add/delete from this and so those need not be optimized. Instaed, use the one-pointer STAILQ plus a uintptr_t to be used as a flags word. This will allow these changes to be MFC'd to 12 and 13 to fix a race in removable devices. Feedback from: jhb Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D29004	2021-03-08 15:59:00 -07:00
Mark Johnston	435c7cfb24	Rename _cscan_atomic.h and _cscan_bus.h to atomic_san.h and bus_san.h Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so put these in a more generic place as a step towards importing the other sanitizers. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29103	2021-03-08 12:39:06 -05:00
Mark Johnston	7995dae9d3	posix timers: Improve the overrun calculation timer_settime(2) may be used to configure a timeout in the past. If the timer is also periodic, we also try to compute the number of timer overruns that occurred between the initial timeout and the time at which the timer fired. This is done in a loop which iterates once per period between the initial timeout and now. If the period is small and the initial timeout was a long time ago, this loop can take forever to run, so the system is effectively DOSed. Replace the loop with a more direct calculation of (now - initial timeout) / period to compute the number of overruns. Reported by: syzkaller Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29093	2021-03-08 12:39:06 -05:00
Mark Johnston	60d12ef952	posix timers: Sprinkle some style fixes MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Mark Johnston	8ff2b41c05	posix timers: Declare unexported functions as static MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Konstantin Belousov	56b9bee63a	Make kern.timecounter.hardware tunable Noted and reviewed by: kevans MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D29122	2021-03-08 03:48:21 +02:00
Hans Petter Selasky	c743a6bd4f	Implement mallocarray_domainset(9) variant of mallocarray(9). Reviewed by: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-06 11:38:55 +01:00
Konstantin Belousov	ead7697f04	Restore AT_RESOLVE_BENEATH support for funlinkat(2)/unlinkat(2). MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-06 07:24:18 +02:00
Mark Johnston	89b650872b	ktls: Hide initialization message behind bootverbose We don't typically print anything when a subsystem initializes itself, and KTLS is currently disabled by default anyway. Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29097	2021-03-05 13:11:02 -05:00
Mark Johnston	5e6989ba4f	link_elf_obj: Handle init_array sections in KLDs Reuse existing handling for .ctors, print a warning if multiple constructor sections are present. Destructors are not handled as of yet. This is required for KASAN. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29049	2021-03-04 10:07:10 -05:00
Mark Johnston	49f6925ca3	ktls: Cache output buffers for software encryption Maintain a cache of physically contiguous runs of pages for use as output buffers when software encryption is configured and in-place encryption is not possible. This makes allocation and free cheaper since in the common case we avoid touching the vm_page structures for the buffer, and fewer calls into UMA are needed. gallatin@ reports a ~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after this change. It is possible that we will not be able to allocate these buffers if physical memory is fragmented. To avoid frequently calling into the physical memory allocator in this scenario, rate-limit allocation attempts after a failure. In the failure case we fall back to the old behaviour of allocating a page at a time. N.B.: this scheme could be simplified, either by simply using malloc() and looking up the PAs of the pages backing the buffer, or by falling back to page by page allocation and creating a mapping in the cache zone. This requires some way to save a mapping of an M_EXTPG page array in the mbuf, though. m_data is not really appropriate. The second approach may be possible by saving the mapping in the plinks union of the first vm_page structure of the array, but this would force a vm_page access when freeing an mbuf. Reviewed by: gallatin, jhb Tested by: gallatin Sponsored by: Ampere Computing Submitted by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D28556	2021-03-03 17:34:01 -05:00
Alan Somers	ab63da3564	Speed up geom_stats_resync in the presence of many devices The old code had a O(n) loop, where n is the size of /dev/devstat. Multiply that by another O(n) loop in devstat_mmap for a total of O(n^2). This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can quickly determine the right amount of memory to map, eliminating the O(n) loop in userland. This change decreases the time to run "gstat -bI0.001" with 16,384 md devices from 29.7s to 4.2s. Also, fix a memory leak first reported as PR 203097. Sponsored by: Axcient Reviewed by: mav, imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28968	2021-03-02 18:33:45 -07:00
Robert Wing	0cc746f193	filt_fsevent: only record interested events Respect filter-specific flags for the EVFILT_FS filter. When a kevent is registered with the EVFILT_FS filter, it is always triggered when an EVFILT_FS event occurs, regardless of the filter-specific flags used. Fix that. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28974	2021-03-02 14:19:22 -09:00
Konstantin Belousov	28cd3a673e	O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path Requested and reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:40 +02:00
Konstantin Belousov	49c98a4bf3	nameicap_check_dotdot: trim tracker on check Tracker should contain exactly the path from the starting directory to the current lookup point. Otherwise we might not detect some cases of dotdot escape. Consequently, if we are walking up the tree by dotdot lookup, we must remove an entries below the walked directory. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:35 +02:00
Konstantin Belousov	e8a2862aa0	Add nameicap_cleanup_from(), to clean tracker list starting from some element Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:30 +02:00
Konstantin Belousov	2388ad7c29	nameicap_tracker_add: avoid duplicates in the tracker list Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:23 +02:00
Konstantin Belousov	59e7494281	Do not call nameicap_tracker_add() for dotdot case. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:14 +02:00
Konstantin Belousov	20e91ca36a	open(2): Remove O_BENEATH and AT_BENEATH with the reasoning that the flags did not worked properly, and were not shipped in a release. O_RESOLVE_BENEATH is kept as useful. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:16:55 +02:00
Kyle Evans	60c4ec806d	jail: allow root to implicitly widen its cpuset to attach The default behavior for attaching processes to jails is that the jail's cpuset augments the attaching processes, so that it cannot be used to escalate a user's ability to take advantage of more CPUs than the administrator wanted them to. This is problematic when root needs to manage jails that have disjoint sets with whatever process is attaching, as this would otherwise result in a deadlock. Therefore, if we did not have an appropriate common subset of cpus/domains for our new policy, we now allow the process to simply take on the jail set if it has the privilege to widen its mask anyways. With the new logic, root can still usefully cpuset a process that attaches to a jail with the desire of maintaining the set it was given pre-attachment while still retaining the ability to manage child jails without jumping through hoops. A test has been added to demonstrate the issue; cpuset of a process down to just the first CPU and attempting to attach to a jail without access to any of the same CPUs previously resulted in EDEADLK and now results in taking on the jail's mask for privileged users. PR: 253724 Reviewed by: jamie (also discussed with) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28952	2021-03-01 12:38:31 -06:00
Rick Macklem	a5f9fe2bab	copy_file_range(2): Fix for small values of input file offset and len r366302 broke copy_file_range(2) for small values of input file offset and len. It was possible for rem to be greater than len and then "len - rem" was a large value, since both variables are unsigned. Reported by: koobs, Pablo <pablogsal gmail com> (Python) Reviewed by: asomers, koobs MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28981	2021-03-01 06:31:10 -08:00
Konstantin Belousov	b5449c92b4	Use atomic_interrupt_fence() instead of bare __compiler_membar() for the which which definitely use membar to sync with interrupt handlers. libc and rtld uses of __compiler_membar() seems to want compiler barriers proper. The barrier in sched_unpin_lite() after td_pinned decrement seems to be not needed and removed, instead of convertion. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28956	2021-02-28 01:27:29 +02:00
Mateusz Guzik	1239a72221	cache: temporarily drop the assert that dvp != vp when adding an entry Historically it was allowed for any names, but arguably should never be even attempted. Allow it again since there is a release pending and allowing it is bug-compatible with previous behavior. Reported by: otis	2021-02-27 22:29:50 +00:00
Jamie Gritton	589e4c1df4	jail: Add safety around prison_deref() flags. do_jail_attach() now only uses the PD_XXX flags that refer to lock status, so make sure that something else like PD_KILL doesn't slip through. Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.	2021-02-25 20:10:42 -08:00
Jamie Gritton	108a9384e9	jail: Fix locking on an early jail_set error. I had locked allprison_lock without immediately setting PD_LIST_LOCKED.	2021-02-25 19:52:58 -08:00
John Baldwin	90972f0402	ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter. I missed updating this counter when rebasing the changes in `9c64fc4029` after the switch to COUNTER_U64_DEFINE_EARLY in `1755b2b989`. Fixes: `9c64fc4029` Add Chacha20-Poly1305 as a KTLS cipher suite. Sponsored by: Netflix	2021-02-25 15:00:13 -08:00
Ryan Libby	d7671ad8d6	Close races in vm object chain traversal for unlock We were unlocking the vm object before reading the backing_object field. In the meantime, the object could be freed and reused. This could cause us to go off the rails in the object chain traversal, failing to unlock the rest of the objects in the original chain and corrupting the lock state of the victim chain. Reviewed by: bdrewery, kib, markj, vangyzen MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28926	2021-02-25 12:11:19 -08:00
Edward Tomasz Napierala	e7a5b3bd05	Modify lock_delay() to increase the delay time after spinning Modify lock_delay() to increase the delay time after spinning, not before. Previously we would spin at least twice instead of once. In NetApp's benchmarks this fixes a performance regression compared to FreeBSD 10, which called cpu_spinwait() directly. Reviewed By: mjg Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27331	2021-02-25 18:55:26 +00:00
Mark Johnston	369706a6f8	buf: Fix the dirtybufthresh check dirtybufthresh is a watermark, slightly below the high watermark for dirty buffers. When a delayed write is issued, the dirtying thread will start flushing buffers if the dirtybufthresh watermark is reached. This helps ensure that the high watermark is not reached, otherwise performance will degrade as clustering and other optimizations are disabled (see buf_dirty_count_severe()). When the buffer cache was partitioned into "domains", the dirtybufthresh threshold checks were not updated. Fix this. Reported by: Shrikanth R Kamath <kshrikanth@juniper.net> Reviewed by: rlibby, mckusick, kib, bdrewery Sponsored by: Juniper Networks, Inc., Klara, Inc. Fixes: `3cec5c77d6` MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28901	2021-02-25 10:04:44 -05:00
Mark Johnston	faa998f6ff	sendfile: Use the pager size to determine the file extent when possible Previously sendfile would issue a VOP_GETATTR and use the returned size, i.e., the file size. When paging in file data, sendfile_swapin() will use the pager to determine whether it needs to zero-fill, most often because of a hole in a sparse file. An attempt to page in beyond the end of a file is treated this way, and occurs when the requested page is past the end of the pager. In other words, both the file size and pager size were used interchangeably. With ZFS, updates to the pager and file sizes are not synchronized by the exclusive vnode lock, at least partially due to its use of MNTK_SHARED_WRITES. In particular, the pager size is updated after the file size, so in the presence of a writer concurrently extending the file, sendfile could incorrectly instantiate "holes" in the page cache pages backing the file, which manifests as data corruption when reading the file back from the page cache. The on-disk copy is unaffected. Fix this by consistently using the pager size when available. Reported by: dumbbell Reviewed by: chs, kib Tested by: dumbbell, pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28811	2021-02-25 10:04:44 -05:00
Jamie Gritton	c861373bdf	jail: re-commit `811e27fa3c` with fixes Make sure PD_KILL isn't passed to do_jail_attach, where it might end up trying to kill the caller's prison (even prison0). Fix the child jail loop in prison_deref_kill, which was doing the post-order part during the pre-order part. That's not a system- killer, but make jails not always die correctly.	2021-02-24 21:54:49 -08:00
Jamie Gritton	ddfffb41a2	jail: back out `811e27fa3c` until it doesn't break Jenkins Reported by: arichardson	2021-02-24 21:10:47 -08:00
Mark Johnston	1d44514fcd	rmlock: Add a required compiler membar to the rlock slow path The tracker flags need to be loaded only after the tracker is removed from its per-CPU queue. Otherwise, readers may fail to synchronize with pending writers attempting to propagate priority to active readers, and readers and writers deadlock on each other. This was observed in a stable/12-based armv7 kernel where the compiler had reordered the load of rmp_flags to before the stores updating the queue. Reviewed by: rlibby, scottl Discussed with: kib Sponsored by: Rubicon Communications, LLC ("Netgate") MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28821	2021-02-23 21:17:12 -05:00
Alex Richardson	fa32350347	close_range: add audit support This fixes the closefrom test in sys/audit. Includes cherry-picks of the following commits from openbsm: `4dfc628aaf` `99ff6fe32a` `da48a0399e` Reviewed By: kevans Differential Revision: https://reviews.freebsd.org/D28388	2021-02-23 17:47:07 +00:00
Jamie Gritton	0a2a96f35a	jail: Don't allow jails under dying parents If a jail is created with jail_set(...JAIL_DYING), and it has a parent currently in a dying state, that will bring the parent jail back to life. Restrict that to require that the parent itself be explicitly brought back first, and not implicitly created along with the new child jail. Differential Revision: https://reviews.freebsd.org/D28515	2021-02-22 17:04:06 -08:00
Jamie Gritton	701d6b50ae	jail: Fix a LOR introduced in `1158508a80`	2021-02-22 15:51:10 -08:00
Jamie Gritton	811e27fa3c	jail: Add PD_KILL to remove a prison in prison_deref(). Add the PD_KILL flag that instructs prison_deref() to take steps to actively kill a prison and its descendents, namely marking it PRISON_STATE_DYING, clearing its PR_PERSIST flag, and killing any attached processes. This replaces a similar loop in sys_jail_remove(), bringing the operation under the same single hold on allprison_lock that it already has. It is also used to clean up failed jail (re-)creations in kern_jail_set(), which didn't generally take all the proper steps. Differential Revision: https://reviews.freebsd.org/D28473	2021-02-22 12:27:44 -08:00
Mark Johnston	608c44f96e	m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages The caller should not be passing M_ZERO in the first place, so PG_ZERO will not be preserved by the page allocator and clearing it accomplishes nothing. Reviewed by: gallatin, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28808	2021-02-22 10:04:46 -05:00
Jamie Gritton	1158508a80	jail: Add pr_state to struct prison Rather that using references (pr_ref and pr_uref) to deduce the state of a prison, keep track of its state explicitly. A prison is either "invalid" (pr_ref == 0), "alive" (pr_uref > 0) or "dying" (pr_uref == 0). State transitions are generally tied to the reference counts, but with some flexibility: a new prison is "invalid" even though it now starts with a reference, and jail_remove(2) sets the state to "dying" before the user reference count drops to zero (which was prviously accomplished via the PR_REMOVE flag). pr_state is protected by both the prison mutex and allprison_lock, so it has the same availablity guarantees as the reference counts do. Differential Revision: https://reviews.freebsd.org/D27876	2021-02-21 13:24:47 -08:00
Mateusz Guzik	ee9b37ae5c	jail: fix build after the previous commit Noted by: Michael Butler <imb protected-networks.net>	2021-02-21 21:05:25 +00:00
Jamie Gritton	f7496dcab0	jail: Change the locking around pr_ref and pr_uref Require both the prison mutex and allprison_lock when pr_ref or pr_uref go to/from zero. Adding a non-first or removing a non-last reference remain lock-free. This means that a shared hold on allprison_lock is sufficient for prison_isalive() to be useful, which removes a number of cases of lock/check/unlock on the prison mutex. Expand the locking in kern_jail_set() to keep allprison_lock held exclusive until the new prison is valid, thus making invalid prisons invisible to any thread holding allprison_lock (except of course the one creating or destroying the prison). This renders prison_isvalid() nearly redundant, now used only in asserts. Differential Revision: https://reviews.freebsd.org/D28419 Differential Revision: https://reviews.freebsd.org/D28458	2021-02-21 10:55:44 -08:00
Konstantin Belousov	2bfd8992c7	vnode: move write cluster support data to inodes. The data is only needed by filesystems that 1. use buffer cache 2. utilize clustering write support. Requested by: mjg Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Konstantin Belousov	750ea20d3f	Delete dead CLUSTERDEBUG config option. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Mateusz Guzik	81174cd8e2	vfs: employ vfs_ref_from_vp in statfs and fstatfs Avoids locking and unlocking the vnode. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28695	2021-02-21 00:43:05 +00:00
Mateusz Guzik	a15f787adb	vfs: add vfs_ref_from_vp This generalizes what vop_stdgetwritemount used to be doing. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28695	2021-02-21 00:43:05 +00:00
Jamie Gritton	6e1d1bfcac	jail: Improve locking when removing prisons Change the flow of prison_deref() so it doesn't let go of allprison_lock until it's completely done using it (except for a possible drop as part of an upgrade on its first try). Differential Revision: https://reviews.freebsd.org/D28458 MFC after: 3 days	2021-02-20 14:38:58 -08:00

1 2 3 4 5 ...

18219 Commits