with the reasoning that the flags did not worked properly, and were not
shipped in a release.
O_RESOLVE_BENEATH is kept as useful.
Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907
The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.
This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.
With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.
A test has been added to demonstrate the issue; cpuset of a process
down to just the first CPU and attempting to attach to a jail without
access to any of the same CPUs previously resulted in EDEADLK and now
results in taking on the jail's mask for privileged users.
PR: 253724
Reviewed by: jamie (also discussed with)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28952
r366302 broke copy_file_range(2) for small values of
input file offset and len.
It was possible for rem to be greater than len and then
"len - rem" was a large value, since both variables are
unsigned.
Reported by: koobs, Pablo <pablogsal gmail com> (Python)
Reviewed by: asomers, koobs
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28981
for the which which definitely use membar to sync with interrupt handlers.
libc and rtld uses of __compiler_membar() seems to want compiler barriers
proper.
The barrier in sched_unpin_lite() after td_pinned decrement seems to be not
needed and removed, instead of convertion.
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28956
Historically it was allowed for any names, but arguably should never be
even attempted. Allow it again since there is a release pending and
allowing it is bug-compatible with previous behavior.
Reported by: otis
do_jail_attach() now only uses the PD_XXX flags that refer to lock
status, so make sure that something else like PD_KILL doesn't slip
through.
Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.
I missed updating this counter when rebasing the changes in
9c64fc4029 after the switch to
COUNTER_U64_DEFINE_EARLY in 1755b2b989.
Fixes: 9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Sponsored by: Netflix
We were unlocking the vm object before reading the backing_object field.
In the meantime, the object could be freed and reused. This could cause
us to go off the rails in the object chain traversal, failing to unlock
the rest of the objects in the original chain and corrupting the lock
state of the victim chain.
Reviewed by: bdrewery, kib, markj, vangyzen
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28926
Modify lock_delay() to increase the delay time after spinning,
not before. Previously we would spin at least twice instead of once.
In NetApp's benchmarks this fixes a performance regression compared
to FreeBSD 10, which called cpu_spinwait() directly.
Reviewed By: mjg
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27331
dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers. When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached. This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).
When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated. Fix this.
Reported by: Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by: rlibby, mckusick, kib, bdrewery
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Fixes: 3cec5c77d6
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28901
Previously sendfile would issue a VOP_GETATTR and use the returned size,
i.e., the file size. When paging in file data, sendfile_swapin() will
use the pager to determine whether it needs to zero-fill, most often
because of a hole in a sparse file. An attempt to page in beyond the
end of a file is treated this way, and occurs when the requested page is
past the end of the pager. In other words, both the file size and pager
size were used interchangeably.
With ZFS, updates to the pager and file sizes are not synchronized by
the exclusive vnode lock, at least partially due to its use of
MNTK_SHARED_WRITES. In particular, the pager size is updated after the
file size, so in the presence of a writer concurrently extending the
file, sendfile could incorrectly instantiate "holes" in the page cache
pages backing the file, which manifests as data corruption when reading
the file back from the page cache. The on-disk copy is unaffected.
Fix this by consistently using the pager size when available.
Reported by: dumbbell
Reviewed by: chs, kib
Tested by: dumbbell, pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28811
Make sure PD_KILL isn't passed to do_jail_attach, where it might end
up trying to kill the caller's prison (even prison0).
Fix the child jail loop in prison_deref_kill, which was doing the
post-order part during the pre-order part. That's not a system-
killer, but make jails not always die correctly.
The tracker flags need to be loaded only after the tracker is removed
from its per-CPU queue. Otherwise, readers may fail to synchronize with
pending writers attempting to propagate priority to active readers, and
readers and writers deadlock on each other. This was observed in a
stable/12-based armv7 kernel where the compiler had reordered the load
of rmp_flags to before the stores updating the queue.
Reviewed by: rlibby, scottl
Discussed with: kib
Sponsored by: Rubicon Communications, LLC ("Netgate")
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28821
If a jail is created with jail_set(...JAIL_DYING), and it has a parent
currently in a dying state, that will bring the parent jail back to
life. Restrict that to require that the parent itself be explicitly
brought back first, and not implicitly created along with the new
child jail.
Differential Revision: https://reviews.freebsd.org/D28515
Add the PD_KILL flag that instructs prison_deref() to take steps
to actively kill a prison and its descendents, namely marking it
PRISON_STATE_DYING, clearing its PR_PERSIST flag, and killing any
attached processes.
This replaces a similar loop in sys_jail_remove(), bringing the
operation under the same single hold on allprison_lock that it already
has. It is also used to clean up failed jail (re-)creations in
kern_jail_set(), which didn't generally take all the proper steps.
Differential Revision: https://reviews.freebsd.org/D28473
The caller should not be passing M_ZERO in the first place, so PG_ZERO
will not be preserved by the page allocator and clearing it accomplishes
nothing.
Reviewed by: gallatin, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28808
Rather that using references (pr_ref and pr_uref) to deduce the state
of a prison, keep track of its state explicitly. A prison is either
"invalid" (pr_ref == 0), "alive" (pr_uref > 0) or "dying"
(pr_uref == 0).
State transitions are generally tied to the reference counts, but with
some flexibility: a new prison is "invalid" even though it now starts
with a reference, and jail_remove(2) sets the state to "dying" before
the user reference count drops to zero (which was prviously
accomplished via the PR_REMOVE flag).
pr_state is protected by both the prison mutex and allprison_lock, so
it has the same availablity guarantees as the reference counts do.
Differential Revision: https://reviews.freebsd.org/D27876
Require both the prison mutex and allprison_lock when pr_ref or
pr_uref go to/from zero. Adding a non-first or removing a non-last
reference remain lock-free. This means that a shared hold on
allprison_lock is sufficient for prison_isalive() to be useful, which
removes a number of cases of lock/check/unlock on the prison mutex.
Expand the locking in kern_jail_set() to keep allprison_lock held
exclusive until the new prison is valid, thus making invalid prisons
invisible to any thread holding allprison_lock (except of course the
one creating or destroying the prison). This renders prison_isvalid()
nearly redundant, now used only in asserts.
Differential Revision: https://reviews.freebsd.org/D28419
Differential Revision: https://reviews.freebsd.org/D28458
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.
Requested by: mjg
Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28679
Change the flow of prison_deref() so it doesn't let go of allprison_lock
until it's completely done using it (except for a possible drop as part
of an upgrade on its first try).
Differential Revision: https://reviews.freebsd.org/D28458
MFC after: 3 days
jail_attach(2) performs an internal chroot operation, leaving it up to
the calling process to assure the working directory is inside the jail.
Add a matching internal chdir operation to the jail's root. Also
ignore kern.chroot_allow_open_directories, and always disallow the
operation if there are any directory descriptors open.
Reported by: mjg
Approved by: markj, kib
MFC after: 3 days
Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and
TLS 1.3 (RFCs 7905 and 8446). For both versions, Chacha20 uses the
server and client IVs as implicit nonces xored with the record
sequence number to generate the per-record nonce matching the
construction used with AES-GCM for TLS 1.3.
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27839
The output has been broken since 1b6dd6d772. Casting to uintmax_t
before the call to printf is necessary to ensure that 32-bit addresses
are interpreted correctly.
PR: 243236
MFC after: 3 days
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.
Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.
Reported by: GENERIC-KCSAN
Test Plan: KCSAN no longer complains, kernel still runs fine.
Reviewed By: markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
or EX_SHLOCK. Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.
Requested by: mckusick
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28697
jail_remove(2) includes a loop that sends SIGKILL to all processes
in a jail, but skips processes in PRS_NEW state. Thus it is possible
the a process in mid-fork(2) during jail removal can survive the jail
being removed.
Add a prison flag PR_REMOVE, which is checked before the new process
returns. If the jail is being removed, the process will then exit.
Also check this flag in jail_attach(2) which has a similar issue.
Reported by: trasz
Approved by: kib
MFC after: 3 days
If uio_offset is past end of the object size, calculated resid is negative.
Delegate handling this case to the locked read, as any other non-trivial
situation.
PR: 253158
Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Tested by: cy
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When locating the anonymous memory region for a vm_map with ASLR
enabled, we try to keep the slid base address aligned on a superpage
boundary to minimize pagetable fragmentation and maximize the potential
usage of superpage mappings. We can't (portably) do this if superpages
have been disabled by loader tunable and pagesizes[1] is 0, and it
would be less beneficial in that case anyway.
PR: 253511
Reported by: johannes@jo-t.de
MFC after: 1 week
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28678
Currently the struct has a 4 byte padding stemming from 3 ints.
1. prio comfortably fits in short, unfortunately there is no dedicated
type for it and plumbing it throughout the codebase is not worth it
right now, instead an assert is added which covers also flags for
safety
2. lk_exslpfail can in principle exceed u_short, but the count is
already not considered reliable and it only ever gets modified
straight to 0. In other words it can be incrementing with an upper
bound of USHRT_MAX
With these in place struct lock shrinks from 48 to 40 bytes.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D28680
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
When possible, relock the vnode and retry inactivation. Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.
This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().
Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The VOP is intended to be used in situations where VFS has two
referenced locked vnodes, typically a directory vnode dvp and a vnode
vp that is linked from the directory, and at least dvp is vput(9)ed.
The child vnode can be also vput-ed, but optionally left referenced and
locked.
There, at least UFS may need to do some actions with dvp which cannot be
done while vp is also locked, so its lock might be dropped temporary.
For instance, in some cases UFS needs to sync dvp to avoid filesystem
state that is currently not handled by either kernel nor fsck. Having
such VOP provides the neccessary context for filesystem which can do
correct locking and handle potential reclamation of vp after relock.
Trivial implementation does vput(dvp) and optionally vput(vp).
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Most future operations on the returned file descriptor will fail
anyway, and application should be ready to handle that failures. Not
forcing it to understand the transient failure mode on open, which is
implementation-specific, should make us less special without loss of
reporting of errors.
Suggested by: chs
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
and only call buf_complete() if previously started. Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().
Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.
Reported by: pho
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundations