Commit Graph

18181 Commits

Author SHA1 Message Date
Jamie Gritton
0a2a96f35a jail: Don't allow jails under dying parents
If a jail is created with jail_set(...JAIL_DYING), and it has a parent
currently in a dying state, that will bring the parent jail back to
life.  Restrict that to require that the parent itself be explicitly
brought back first, and not implicitly created along with the new
child jail.

Differential Revision:	https://reviews.freebsd.org/D28515
2021-02-22 17:04:06 -08:00
Jamie Gritton
701d6b50ae jail: Fix a LOR introduced in 1158508a80 2021-02-22 15:51:10 -08:00
Jamie Gritton
811e27fa3c jail: Add PD_KILL to remove a prison in prison_deref().
Add the PD_KILL flag that instructs prison_deref() to take steps
to actively kill a prison and its descendents, namely marking it
PRISON_STATE_DYING, clearing its PR_PERSIST flag, and killing any
attached processes.

This replaces a similar loop in sys_jail_remove(), bringing the
operation under the same single hold on allprison_lock that it already
has. It is also used to clean up failed jail (re-)creations in
kern_jail_set(), which didn't generally take all the proper steps.

Differential Revision:  https://reviews.freebsd.org/D28473
2021-02-22 12:27:44 -08:00
Mark Johnston
608c44f96e m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages
The caller should not be passing M_ZERO in the first place, so PG_ZERO
will not be preserved by the page allocator and clearing it accomplishes
nothing.

Reviewed by:	gallatin, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28808
2021-02-22 10:04:46 -05:00
Jamie Gritton
1158508a80 jail: Add pr_state to struct prison
Rather that using references (pr_ref and pr_uref) to deduce the state
of a prison, keep track of its state explicitly.  A prison is either
"invalid" (pr_ref == 0), "alive" (pr_uref > 0) or "dying"
(pr_uref == 0).

State transitions are generally tied to the reference counts, but with
some flexibility: a new prison is "invalid" even though it now starts
with a reference, and jail_remove(2) sets the state to "dying" before
the user reference count drops to zero (which was prviously
accomplished via the PR_REMOVE flag).

pr_state is protected by both the prison mutex and allprison_lock, so
it has the same availablity guarantees as the reference counts do.

Differential Revision:	https://reviews.freebsd.org/D27876
2021-02-21 13:24:47 -08:00
Mateusz Guzik
ee9b37ae5c jail: fix build after the previous commit
Noted by: Michael Butler <imb protected-networks.net>
2021-02-21 21:05:25 +00:00
Jamie Gritton
f7496dcab0 jail: Change the locking around pr_ref and pr_uref
Require both the prison mutex and allprison_lock when pr_ref or
pr_uref go to/from zero.  Adding a non-first or removing a non-last
reference remain lock-free.  This means that a shared hold on
allprison_lock is sufficient for prison_isalive() to be useful, which
removes a number of cases of lock/check/unlock on the prison mutex.

Expand the locking in kern_jail_set() to keep allprison_lock held
exclusive until the new prison is valid, thus making invalid prisons
invisible to any thread holding allprison_lock (except of course the
one creating or destroying the prison).  This renders prison_isvalid()
nearly redundant, now used only in asserts.

Differential Revision:	https://reviews.freebsd.org/D28419
Differential Revision:	https://reviews.freebsd.org/D28458
2021-02-21 10:55:44 -08:00
Konstantin Belousov
2bfd8992c7 vnode: move write cluster support data to inodes.
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by:	mjg
Reviewed by:	asomers (previous version), fsu (ext2 parts), mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Konstantin Belousov
750ea20d3f Delete dead CLUSTERDEBUG config option.
Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Mateusz Guzik
81174cd8e2 vfs: employ vfs_ref_from_vp in statfs and fstatfs
Avoids locking and unlocking the vnode.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28695
2021-02-21 00:43:05 +00:00
Mateusz Guzik
a15f787adb vfs: add vfs_ref_from_vp
This generalizes what vop_stdgetwritemount used to be doing.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28695
2021-02-21 00:43:05 +00:00
Jamie Gritton
6e1d1bfcac jail: Improve locking when removing prisons
Change the flow of prison_deref() so it doesn't let go of allprison_lock
until it's completely done using it (except for a possible drop as part
of an upgrade on its first try).

Differential Revision:	https://reviews.freebsd.org/D28458
MFC after:	3 days
2021-02-20 14:38:58 -08:00
Jamie Gritton
d4380c0cdd jail: Change both root and working directories in jail_attach(2)
jail_attach(2) performs an internal chroot operation, leaving it up to
the calling process to assure the working directory is inside the jail.

Add a matching internal chdir operation to the jail's root.  Also
ignore kern.chroot_allow_open_directories, and always disallow the
operation if there are any directory descriptors open.

Reported by:    mjg
Approved by:    markj, kib
MFC after:      3 days
2021-02-19 14:13:35 -08:00
John Baldwin
9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and
TLS 1.3 (RFCs 7905 and 8446).  For both versions, Chacha20 uses the
server and client IVs as implicit nonces xored with the record
sequence number to generate the per-record nonce matching the
construction used with AES-GCM for TLS 1.3.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27839
2021-02-18 09:26:32 -08:00
Thomas Skibo
9976b42b69 ddb: fix show devmap output on 32-bit arm
The output has been broken since 1b6dd6d772. Casting to uintmax_t
before the call to printf is necessary to ensure that 32-bit addresses
are interpreted correctly.

PR:		243236
MFC after:	3 days
2021-02-18 11:53:14 -04:00
Konstantin Belousov
662283b108 vn_printf: handle VI_FOPENING
Noted by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
Fixes:	fa3bd463ce
2021-02-18 16:28:28 +02:00
Alex Richardson
fa2528ac64 Use atomic loads/stores when updating td->td_state
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.

Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.

Reported by:	GENERIC-KCSAN
Test Plan:	KCSAN no longer complains, kernel still runs fine.
Reviewed By:	markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
2021-02-18 14:02:48 +00:00
Konstantin Belousov
fa3bd463ce lockf: ensure atomicity of lockf for open(O_CREAT|O_EXCL|O_EXLOCK)
or EX_SHLOCK.  Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.

Requested by:	mckusick
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28697
2021-02-18 01:22:05 +02:00
Warner Losh
00065c7630 Giant: move back Giant removal until 14
Update the Giant Lock warning message to FreeBSD 14. It's growing increasling
clear that this won't be done before 13.0.

MFC: Insta (re@'s request)
2021-02-17 14:33:09 -07:00
Jamie Gritton
cc7b730653 jail: Handle a possible race between jail_remove(2) and fork(2)
jail_remove(2) includes a loop that sends SIGKILL to all processes
in a jail, but skips processes in PRS_NEW state.  Thus it is possible
the a process in mid-fork(2) during jail removal can survive the jail
being removed.

Add a prison flag PR_REMOVE, which is checked before the new process
returns.  If the jail is being removed, the process will then exit.
Also check this flag in jail_attach(2) which has a similar issue.

Reported by:    trasz
Approved by:    kib
MFC after:      3 days
2021-02-16 11:19:13 -08:00
Konstantin Belousov
c61fae1475 pgcache read: protect against reads past end of the vm object size
If uio_offset is past end of the object size, calculated resid is negative.
Delegate handling this case to the locked read, as any other non-trivial
situation.

PR:	253158
Reported by:	Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Tested by:	cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-16 07:09:37 +02:00
Alex Richardson
0482d7c9e9 Fix fget_only_user() to return ENOTCAPABLE on a failed capsicum check
After eaad8d1303 four additional
capsicum-test tests started failing. It turns out this is because
fget_only_user() was returning EBADF on a failed capsicum check instead
of forwarding the return value of cap_check_inline() like
fget_unlocked_seq().

capsicum-test failures before this:
```
[  FAILED  ] 7 tests, listed below:
[  FAILED  ] Capability.OperationsForked
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
[  FAILED  ] OpenatTest.WithFlag
[  FAILED  ] ForkedOpenatTest_WithFlagInCapabilityMode._
[  FAILED  ] Select.LotsOFileDescriptorsForked
```
After:
```
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
```

Reviewed By:	mjg
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D28691
2021-02-15 22:55:12 +00:00
Jason A. Harmening
41032835dc Fix divide-by-zero panic when ASLR is enabled and superpages disabled
When locating the anonymous memory region for a vm_map with ASLR
enabled, we try to keep the slid base address aligned on a superpage
boundary to minimize pagetable fragmentation and maximize the potential
usage of superpage mappings.  We can't (portably) do this if superpages
have been disabled by loader tunable and pagesizes[1] is 0, and it
would be less beneficial in that case anyway.

PR:		253511
Reported by:	johannes@jo-t.de
MFC after:	1 week
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28678
2021-02-15 10:38:04 -08:00
Mateusz Guzik
eac22dd480 lockmgr: shrink struct lock by 8 bytes on LP64
Currently the struct has a 4 byte padding stemming from 3 ints.

1. prio comfortably fits in short, unfortunately there is no dedicated
   type for it and plumbing it throughout the codebase is not worth it
   right now, instead an assert is added which covers also flags for
   safety
2. lk_exslpfail can in principle exceed u_short, but the count is
   already not considered reliable and it only ever gets modified
   straight to 0. In other words it can be incrementing with an upper
   bound of USHRT_MAX

With these in place struct lock shrinks from 48 to 40 bytes.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D28680
2021-02-15 13:57:25 +00:00
Konstantin Belousov
25c6318c79 procstat: distinguish vm map guards in procstat vm output.
Requested and reviewed by:	rwatson (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28658
2021-02-14 03:24:58 +02:00
Konstantin Belousov
adf28ab456 fifo: minor comment and assert improvements.
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov
b59a8e63d6 Stop ignoring ERELOOKUP from VOP_INACTIVE()
When possible, relock the vnode and retry inactivation.  Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov
3b2aa36024 Use VOP_VPUT_PAIR() for eligible VFS syscalls.
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
49c117c193 Add VOP_VPUT_PAIR() with trivial default implementation.
The VOP is intended to be used in situations where VFS has two
referenced locked vnodes, typically a directory vnode dvp and a vnode
vp that is linked from the directory, and at least dvp is vput(9)ed.
The child vnode can be also vput-ed, but optionally left referenced and
locked.

There, at least UFS may need to do some actions with dvp which cannot be
done while vp is also locked, so its lock might be dropped temporary.
For instance, in some cases UFS needs to sync dvp to avoid filesystem
state that is currently not handled by either kernel nor fsck. Having
such VOP provides the neccessary context for filesystem which can do
correct locking and handle potential reclamation of vp after relock.

Trivial implementation does vput(dvp) and optionally vput(vp).

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
ee965dfa64 vn_open(): If the vnode is reclaimed during open(2), do not return error.
Most future operations on the returned file descriptor will fail
anyway, and application should be ready to handle that failures.  Not
forcing it to understand the transient failure mode on open, which is
implementation-specific, should make us less special without loss of
reporting of errors.

Suggested by: chs
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
bf0db19339 buf SU hooks: track buf_start() calls with B_IOSTARTED flag
and only call buf_complete() if previously started.  Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by:	pho
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundations
2021-02-12 03:02:19 +02:00
Mateusz Guzik
39e0c3f686 cache: assorted comment fixups 2021-02-09 17:09:44 +01:00
Kyle Evans
504ebd612e kern: sonewconn: set so_options before pru_attach()
Protocol attachment has historically been able to observe and modify
so->so_options as needed, and it still can for newly created sockets.
779f106aa1 moved this to after pru_attach() when we re-acquire the
lock on the listening socket.

Restore the historical behavior so that pru_attach implementations can
consistently use it. Note that some pru_attach() do currently rely on
this, though that may change in the future. D28265 contains a change to
remove the use in TCP and IB/SDP bits, as resetting the requested linger
time on incoming connections seems questionable at best.

This does move the assignment out from under the head's listen lock, but
glebius notes that head won't be going away and applications cannot
assume any specific ordering with a race between a connection coming in
and the application changing socket options anyways.

Discussed-with:	glebius
MFC-after:	1 week
2021-02-08 21:44:43 -06:00
Alexander V. Chernikov
924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov
4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Mark Johnston
b5aa9ad43a ktls: Make configuration sysctls available as tunables
Reviewed by:	gallatin, jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28499
2021-02-08 09:19:02 -05:00
Mark Johnston
1755b2b989 ktls: Use COUNTER_U64_DEFINE_EARLY
This makes it a bit more straightforward to add new counters when
debugging.  No functional change intended.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28498
2021-02-08 09:18:51 -05:00
Mateusz Guzik
2f8a844635 cache: remove the largely obsolete general description
Examples of inconsistencies with the current state:
- references LRU of all entries, removed years ago
- references a non-existent lock (neglist)
- claims negative entries have a NULL target

It will be replaced with a more accurate and more informative
description.

In the meantime take it out so it stops misleading.
2021-02-06 00:28:40 +01:00
Mateusz Guzik
0e1594e60e cache: fix vfs:namecache:lookup:miss probe call sites 2021-02-06 00:28:40 +01:00
Mateusz Guzik
2e96132a7d cache: drop spurious arg from panic in cache_validate
vp is already reported when noting mismatch
2021-02-06 00:28:39 +01:00
Mateusz Guzik
b54ed778fe cache: comment on FNV 2021-02-06 00:13:57 +01:00
Ed Maste
edc374e7c4 Correct description for kern.proc.proc_td
kern.proc.proc_td returns the process table with an entry for each
thread.  Previously the description included "no threads", presumably
a cut-and-pasteo in 2648efa621.

Description suggested by PauAmma.

PR:		253146
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-02-02 17:00:05 -05:00
Mateusz Guzik
45456abc4c cache: fix trailing slash support in face of permission problems
Reported by:	Johan Hendriks <joh.hendriks gmail.com>
Tested by:	kevans
2021-02-02 18:13:51 +00:00
Mateusz Guzik
6f19dc2124 cache: add delayed degenerate path handling 2021-02-01 04:53:23 +00:00
Mateusz Guzik
bbfb1edd70 cache: move hash computation into the parsing loop 2021-02-01 04:36:45 +00:00
Mateusz Guzik
e027e24bfa cache: add trailing slash support
Tested by:	pho
2021-01-31 12:02:46 +00:00
Mateusz Guzik
8cbd164a17 cache: handle NOFOLLOW requests for symlinks
Tested by:	pho
2021-01-31 12:02:46 +00:00
Gleb Smirnoff
3f43ada98c Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2021-01-29 11:46:24 -08:00
Mateusz Guzik
45e1f85414 poll: use fget_unlocked or fget_only_user when feasible
This follows select by eleminating the use of filedesc lock.
This is a win for single-threaded processes and a mixed bag for others
as at small concurrency it is faster to take the lock instead of
refing/unrefing each file descriptor.

Nonetheless, removal of shared lock usage is a step towards a
mtx-protected fd table.
2021-01-29 11:23:44 +00:00
Mateusz Guzik
6affe1b712 select: employ fget_only_user
Since most select users are single-threaded this avoid a lot of work
in the common case.

For example select of 16 fds (ops/s):
before:	2114536
after:	2991010
2021-01-29 11:23:44 +00:00