Commit Graph

18219 Commits

Author SHA1 Message Date
Konstantin Belousov
44691b33cc vlrureclaim: only skip vnode with resident pages if it own the pages
Nullfs vnode which shares vm_object and pages with the lower vnode should
not be exempt from the reclaim just because lower vnode cached a lot.
Their reclamation is actually very cheap and should be preferred over
real fs vnodes, but this change is already useful.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Warner Losh
e52368365d config_intrhook: provide config_intrhook_drain
config_intrhook_drain will remove the hook from the list as
config_intrhook_disestablish does if the hook hasn't been called.  If it has,
config_intrhook_drain will wait for the hook to be disestablished in the normal
course (or expedited, it's up to the driver to decide how and when
to call config_intrhook_disestablish).

This is intended for removable devices that use config_intrhook and might be
attached early in boot, but that may be removed before the kernel can call the
config_intrhook or before it ends. To prevent all races, the detach routine will
need to call config_intrhook_train.

Sponsored by:		Netflix, Inc
Reviewed by:		jhb, mav, gde (in D29006 for man page)
Differential Revision:	https://reviews.freebsd.org/D29005
2021-03-11 09:45:10 -07:00
Hans Petter Selasky
6eb60f5b7f Use the word "LinuxKPI" instead of "Linux compatibility", to not confuse with
user-space Linux compatibility support. No functional change.

MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-03-10 12:35:16 +01:00
Kyle Evans
1ae20f7c70 kern: malloc: fix panic on M_WAITOK during THREAD_NO_SLEEPING()
Simple condition flip; we wanted to panic here after epoch_trace_list().

Reviewed by:	glebius, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D29125
2021-03-09 05:16:39 -06:00
Warner Losh
88a5591203 config_intrhook: Move from TAILQ to STAILQ and padding
config_intrhook doesn't need to be a two-pointer TAILQ. We rarely add/delete
from this and so those need not be optimized. Instaed, use the one-pointer
STAILQ plus a uintptr_t to be used as a flags word. This will allow these
changes to be MFC'd to 12 and 13 to fix a race in removable devices.

Feedback from: jhb
Reviewed by: mav
Differential Revision:	https://reviews.freebsd.org/D29004
2021-03-08 15:59:00 -07:00
Mark Johnston
435c7cfb24 Rename _cscan_atomic.h and _cscan_bus.h to atomic_san.h and bus_san.h
Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so
put these in a more generic place as a step towards importing the other
sanitizers.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29103
2021-03-08 12:39:06 -05:00
Mark Johnston
7995dae9d3 posix timers: Improve the overrun calculation
timer_settime(2) may be used to configure a timeout in the past.  If
the timer is also periodic, we also try to compute the number of timer
overruns that occurred between the initial timeout and the time at which
the timer fired.  This is done in a loop which iterates once per period
between the initial timeout and now.  If the period is small and the
initial timeout was a long time ago, this loop can take forever to run,
so the system is effectively DOSed.

Replace the loop with a more direct calculation of
(now - initial timeout) / period to compute the number of overruns.

Reported by:	syzkaller
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29093
2021-03-08 12:39:06 -05:00
Mark Johnston
60d12ef952 posix timers: Sprinkle some style fixes
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-08 12:39:06 -05:00
Mark Johnston
8ff2b41c05 posix timers: Declare unexported functions as static
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-08 12:39:06 -05:00
Konstantin Belousov
56b9bee63a Make kern.timecounter.hardware tunable
Noted and reviewed by:	kevans
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D29122
2021-03-08 03:48:21 +02:00
Hans Petter Selasky
c743a6bd4f Implement mallocarray_domainset(9) variant of mallocarray(9).
Reviewed by:	kib @
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-03-06 11:38:55 +01:00
Konstantin Belousov
ead7697f04 Restore AT_RESOLVE_BENEATH support for funlinkat(2)/unlinkat(2).
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-06 07:24:18 +02:00
Mark Johnston
89b650872b ktls: Hide initialization message behind bootverbose
We don't typically print anything when a subsystem initializes itself,
and KTLS is currently disabled by default anyway.

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29097
2021-03-05 13:11:02 -05:00
Mark Johnston
5e6989ba4f link_elf_obj: Handle init_array sections in KLDs
Reuse existing handling for .ctors, print a warning if multiple
constructor sections are present.   Destructors are not handled as of
yet.

This is required for KASAN.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29049
2021-03-04 10:07:10 -05:00
Mark Johnston
49f6925ca3 ktls: Cache output buffers for software encryption
Maintain a cache of physically contiguous runs of pages for use as
output buffers when software encryption is configured and in-place
encryption is not possible.  This makes allocation and free cheaper
since in the common case we avoid touching the vm_page structures for
the buffer, and fewer calls into UMA are needed.  gallatin@ reports a
~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after
this change.

It is possible that we will not be able to allocate these buffers if
physical memory is fragmented.  To avoid frequently calling into the
physical memory allocator in this scenario, rate-limit allocation
attempts after a failure.  In the failure case we fall back to the old
behaviour of allocating a page at a time.

N.B.: this scheme could be simplified, either by simply using malloc()
and looking up the PAs of the pages backing the buffer, or by falling
back to page by page allocation and creating a mapping in the cache
zone.  This requires some way to save a mapping of an M_EXTPG page array
in the mbuf, though.  m_data is not really appropriate.  The second
approach may be possible by saving the mapping in the plinks union of
the first vm_page structure of the array, but this would force a vm_page
access when freeing an mbuf.

Reviewed by:	gallatin, jhb
Tested by:	gallatin
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D28556
2021-03-03 17:34:01 -05:00
Alan Somers
ab63da3564 Speed up geom_stats_resync in the presence of many devices
The old code had a O(n) loop, where n is the size of /dev/devstat.
Multiply that by another O(n) loop in devstat_mmap for a total of
O(n^2).

This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can
quickly determine the right amount of memory to map, eliminating the
O(n) loop in userland.

This change decreases the time to run "gstat -bI0.001" with 16,384 md
devices from 29.7s to 4.2s.

Also, fix a memory leak first reported as PR 203097.

Sponsored by:	Axcient
Reviewed by:	mav, imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28968
2021-03-02 18:33:45 -07:00
Robert Wing
0cc746f193 filt_fsevent: only record interested events
Respect filter-specific flags for the EVFILT_FS filter.

When a kevent is registered with the EVFILT_FS filter, it is always
triggered when an EVFILT_FS event occurs, regardless of the
filter-specific flags used. Fix that.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28974
2021-03-02 14:19:22 -09:00
Konstantin Belousov
28cd3a673e O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path
Requested and reviewed by:	markj
Tested by:	arichardson,  pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:40 +02:00
Konstantin Belousov
49c98a4bf3 nameicap_check_dotdot: trim tracker on check
Tracker should contain exactly the path from the starting directory to
the current lookup point. Otherwise we might not detect some cases of
dotdot escape. Consequently, if we are walking up the tree by dotdot
lookup, we must remove an entries below the walked directory.

Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:35 +02:00
Konstantin Belousov
e8a2862aa0 Add nameicap_cleanup_from(), to clean tracker list starting from some element
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:30 +02:00
Konstantin Belousov
2388ad7c29 nameicap_tracker_add: avoid duplicates in the tracker list
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:23 +02:00
Konstantin Belousov
59e7494281 Do not call nameicap_tracker_add() for dotdot case.
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:14 +02:00
Konstantin Belousov
20e91ca36a open(2): Remove O_BENEATH and AT_BENEATH
with the reasoning that the flags did not worked properly, and were not
shipped in a release.

O_RESOLVE_BENEATH is kept as useful.

Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:16:55 +02:00
Kyle Evans
60c4ec806d jail: allow root to implicitly widen its cpuset to attach
The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.

This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.

With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.

A test has been added to demonstrate the issue; cpuset of a process
down to just the first CPU and attempting to attach to a jail without
access to any of the same CPUs previously resulted in EDEADLK and now
results in taking on the jail's mask for privileged users.

PR:		253724
Reviewed by:	jamie (also discussed with)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28952
2021-03-01 12:38:31 -06:00
Rick Macklem
a5f9fe2bab copy_file_range(2): Fix for small values of input file offset and len
r366302 broke copy_file_range(2) for small values of
input file offset and len.

It was possible for rem to be greater than len and then
"len - rem" was a large value, since both variables are
unsigned.

Reported by: koobs, Pablo <pablogsal gmail com> (Python)
Reviewed by:	asomers, koobs
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28981
2021-03-01 06:31:10 -08:00
Konstantin Belousov
b5449c92b4 Use atomic_interrupt_fence() instead of bare __compiler_membar()
for the which which definitely use membar to sync with interrupt handlers.

libc and rtld uses of __compiler_membar() seems to want compiler barriers
proper.

The barrier in sched_unpin_lite() after td_pinned decrement seems to be not
needed and removed, instead of convertion.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28956
2021-02-28 01:27:29 +02:00
Mateusz Guzik
1239a72221 cache: temporarily drop the assert that dvp != vp when adding an entry
Historically it was allowed for any names, but arguably should never be
even attempted. Allow it again since there is a release pending and
allowing it is bug-compatible with previous behavior.

Reported by:	otis
2021-02-27 22:29:50 +00:00
Jamie Gritton
589e4c1df4 jail: Add safety around prison_deref() flags.
do_jail_attach() now only uses the PD_XXX flags that refer to lock
status, so make sure that something else like PD_KILL doesn't slip
through.

Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.
2021-02-25 20:10:42 -08:00
Jamie Gritton
108a9384e9 jail: Fix locking on an early jail_set error.
I had locked allprison_lock without immediately setting PD_LIST_LOCKED.
2021-02-25 19:52:58 -08:00
John Baldwin
90972f0402 ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter.
I missed updating this counter when rebasing the changes in
9c64fc4029 after the switch to
COUNTER_U64_DEFINE_EARLY in 1755b2b989.

Fixes:		9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Sponsored by:	Netflix
2021-02-25 15:00:13 -08:00
Ryan Libby
d7671ad8d6 Close races in vm object chain traversal for unlock
We were unlocking the vm object before reading the backing_object field.
In the meantime, the object could be freed and reused.  This could cause
us to go off the rails in the object chain traversal, failing to unlock
the rest of the objects in the original chain and corrupting the lock
state of the victim chain.

Reviewed by:	bdrewery, kib, markj, vangyzen
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D28926
2021-02-25 12:11:19 -08:00
Edward Tomasz Napierala
e7a5b3bd05 Modify lock_delay() to increase the delay time after spinning
Modify lock_delay() to increase the delay time after spinning,
not before.  Previously we would spin at least twice instead of once.
In NetApp's benchmarks this fixes a performance regression compared
to FreeBSD 10, which called cpu_spinwait() directly.

Reviewed By:	mjg
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27331
2021-02-25 18:55:26 +00:00
Mark Johnston
369706a6f8 buf: Fix the dirtybufthresh check
dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers.  When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached.  This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).

When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated.  Fix this.

Reported by:	Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by:	rlibby, mckusick, kib, bdrewery
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Fixes:		3cec5c77d6
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28901
2021-02-25 10:04:44 -05:00
Mark Johnston
faa998f6ff sendfile: Use the pager size to determine the file extent when possible
Previously sendfile would issue a VOP_GETATTR and use the returned size,
i.e., the file size.  When paging in file data, sendfile_swapin() will
use the pager to determine whether it needs to zero-fill, most often
because of a hole in a sparse file.  An attempt to page in beyond the
end of a file is treated this way, and occurs when the requested page is
past the end of the pager.  In other words, both the file size and pager
size were used interchangeably.

With ZFS, updates to the pager and file sizes are not synchronized by
the exclusive vnode lock, at least partially due to its use of
MNTK_SHARED_WRITES.  In particular, the pager size is updated after the
file size, so in the presence of a writer concurrently extending the
file, sendfile could incorrectly instantiate "holes" in the page cache
pages backing the file, which manifests as data corruption when reading
the file back from the page cache.  The on-disk copy is unaffected.

Fix this by consistently using the pager size when available.

Reported by:	dumbbell
Reviewed by:	chs, kib
Tested by:	dumbbell, pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28811
2021-02-25 10:04:44 -05:00
Jamie Gritton
c861373bdf jail: re-commit 811e27fa3c with fixes
Make sure PD_KILL isn't passed to do_jail_attach, where it might end
up trying to kill the caller's prison (even prison0).

Fix the child jail loop in prison_deref_kill, which was doing the
post-order part during the pre-order part.  That's not a system-
killer, but make jails not always die correctly.
2021-02-24 21:54:49 -08:00
Jamie Gritton
ddfffb41a2 jail: back out 811e27fa3c until it doesn't break Jenkins
Reported by:	arichardson
2021-02-24 21:10:47 -08:00
Mark Johnston
1d44514fcd rmlock: Add a required compiler membar to the rlock slow path
The tracker flags need to be loaded only after the tracker is removed
from its per-CPU queue.  Otherwise, readers may fail to synchronize with
pending writers attempting to propagate priority to active readers, and
readers and writers deadlock on each other.  This was observed in a
stable/12-based armv7 kernel where the compiler had reordered the load
of rmp_flags to before the stores updating the queue.

Reviewed by:	rlibby, scottl
Discussed with:	kib
Sponsored by:	Rubicon Communications, LLC ("Netgate")
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28821
2021-02-23 21:17:12 -05:00
Alex Richardson
fa32350347 close_range: add audit support
This fixes the closefrom test in sys/audit.

Includes cherry-picks of the following commits from openbsm:

4dfc628aaf
99ff6fe32a
da48a0399e

Reviewed By:	kevans
Differential Revision: https://reviews.freebsd.org/D28388
2021-02-23 17:47:07 +00:00
Jamie Gritton
0a2a96f35a jail: Don't allow jails under dying parents
If a jail is created with jail_set(...JAIL_DYING), and it has a parent
currently in a dying state, that will bring the parent jail back to
life.  Restrict that to require that the parent itself be explicitly
brought back first, and not implicitly created along with the new
child jail.

Differential Revision:	https://reviews.freebsd.org/D28515
2021-02-22 17:04:06 -08:00
Jamie Gritton
701d6b50ae jail: Fix a LOR introduced in 1158508a80 2021-02-22 15:51:10 -08:00
Jamie Gritton
811e27fa3c jail: Add PD_KILL to remove a prison in prison_deref().
Add the PD_KILL flag that instructs prison_deref() to take steps
to actively kill a prison and its descendents, namely marking it
PRISON_STATE_DYING, clearing its PR_PERSIST flag, and killing any
attached processes.

This replaces a similar loop in sys_jail_remove(), bringing the
operation under the same single hold on allprison_lock that it already
has. It is also used to clean up failed jail (re-)creations in
kern_jail_set(), which didn't generally take all the proper steps.

Differential Revision:  https://reviews.freebsd.org/D28473
2021-02-22 12:27:44 -08:00
Mark Johnston
608c44f96e m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages
The caller should not be passing M_ZERO in the first place, so PG_ZERO
will not be preserved by the page allocator and clearing it accomplishes
nothing.

Reviewed by:	gallatin, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28808
2021-02-22 10:04:46 -05:00
Jamie Gritton
1158508a80 jail: Add pr_state to struct prison
Rather that using references (pr_ref and pr_uref) to deduce the state
of a prison, keep track of its state explicitly.  A prison is either
"invalid" (pr_ref == 0), "alive" (pr_uref > 0) or "dying"
(pr_uref == 0).

State transitions are generally tied to the reference counts, but with
some flexibility: a new prison is "invalid" even though it now starts
with a reference, and jail_remove(2) sets the state to "dying" before
the user reference count drops to zero (which was prviously
accomplished via the PR_REMOVE flag).

pr_state is protected by both the prison mutex and allprison_lock, so
it has the same availablity guarantees as the reference counts do.

Differential Revision:	https://reviews.freebsd.org/D27876
2021-02-21 13:24:47 -08:00
Mateusz Guzik
ee9b37ae5c jail: fix build after the previous commit
Noted by: Michael Butler <imb protected-networks.net>
2021-02-21 21:05:25 +00:00
Jamie Gritton
f7496dcab0 jail: Change the locking around pr_ref and pr_uref
Require both the prison mutex and allprison_lock when pr_ref or
pr_uref go to/from zero.  Adding a non-first or removing a non-last
reference remain lock-free.  This means that a shared hold on
allprison_lock is sufficient for prison_isalive() to be useful, which
removes a number of cases of lock/check/unlock on the prison mutex.

Expand the locking in kern_jail_set() to keep allprison_lock held
exclusive until the new prison is valid, thus making invalid prisons
invisible to any thread holding allprison_lock (except of course the
one creating or destroying the prison).  This renders prison_isvalid()
nearly redundant, now used only in asserts.

Differential Revision:	https://reviews.freebsd.org/D28419
Differential Revision:	https://reviews.freebsd.org/D28458
2021-02-21 10:55:44 -08:00
Konstantin Belousov
2bfd8992c7 vnode: move write cluster support data to inodes.
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by:	mjg
Reviewed by:	asomers (previous version), fsu (ext2 parts), mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Konstantin Belousov
750ea20d3f Delete dead CLUSTERDEBUG config option.
Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Mateusz Guzik
81174cd8e2 vfs: employ vfs_ref_from_vp in statfs and fstatfs
Avoids locking and unlocking the vnode.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28695
2021-02-21 00:43:05 +00:00
Mateusz Guzik
a15f787adb vfs: add vfs_ref_from_vp
This generalizes what vop_stdgetwritemount used to be doing.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28695
2021-02-21 00:43:05 +00:00
Jamie Gritton
6e1d1bfcac jail: Improve locking when removing prisons
Change the flow of prison_deref() so it doesn't let go of allprison_lock
until it's completely done using it (except for a possible drop as part
of an upgrade on its first try).

Differential Revision:	https://reviews.freebsd.org/D28458
MFC after:	3 days
2021-02-20 14:38:58 -08:00