Commit Graph

18219 Commits

Author SHA1 Message Date
Jamie Gritton
d4380c0cdd jail: Change both root and working directories in jail_attach(2)
jail_attach(2) performs an internal chroot operation, leaving it up to
the calling process to assure the working directory is inside the jail.

Add a matching internal chdir operation to the jail's root.  Also
ignore kern.chroot_allow_open_directories, and always disallow the
operation if there are any directory descriptors open.

Reported by:    mjg
Approved by:    markj, kib
MFC after:      3 days
2021-02-19 14:13:35 -08:00
John Baldwin
9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and
TLS 1.3 (RFCs 7905 and 8446).  For both versions, Chacha20 uses the
server and client IVs as implicit nonces xored with the record
sequence number to generate the per-record nonce matching the
construction used with AES-GCM for TLS 1.3.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27839
2021-02-18 09:26:32 -08:00
Thomas Skibo
9976b42b69 ddb: fix show devmap output on 32-bit arm
The output has been broken since 1b6dd6d772. Casting to uintmax_t
before the call to printf is necessary to ensure that 32-bit addresses
are interpreted correctly.

PR:		243236
MFC after:	3 days
2021-02-18 11:53:14 -04:00
Konstantin Belousov
662283b108 vn_printf: handle VI_FOPENING
Noted by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
Fixes:	fa3bd463ce
2021-02-18 16:28:28 +02:00
Alex Richardson
fa2528ac64 Use atomic loads/stores when updating td->td_state
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.

Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.

Reported by:	GENERIC-KCSAN
Test Plan:	KCSAN no longer complains, kernel still runs fine.
Reviewed By:	markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
2021-02-18 14:02:48 +00:00
Konstantin Belousov
fa3bd463ce lockf: ensure atomicity of lockf for open(O_CREAT|O_EXCL|O_EXLOCK)
or EX_SHLOCK.  Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.

Requested by:	mckusick
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28697
2021-02-18 01:22:05 +02:00
Warner Losh
00065c7630 Giant: move back Giant removal until 14
Update the Giant Lock warning message to FreeBSD 14. It's growing increasling
clear that this won't be done before 13.0.

MFC: Insta (re@'s request)
2021-02-17 14:33:09 -07:00
Jamie Gritton
cc7b730653 jail: Handle a possible race between jail_remove(2) and fork(2)
jail_remove(2) includes a loop that sends SIGKILL to all processes
in a jail, but skips processes in PRS_NEW state.  Thus it is possible
the a process in mid-fork(2) during jail removal can survive the jail
being removed.

Add a prison flag PR_REMOVE, which is checked before the new process
returns.  If the jail is being removed, the process will then exit.
Also check this flag in jail_attach(2) which has a similar issue.

Reported by:    trasz
Approved by:    kib
MFC after:      3 days
2021-02-16 11:19:13 -08:00
Konstantin Belousov
c61fae1475 pgcache read: protect against reads past end of the vm object size
If uio_offset is past end of the object size, calculated resid is negative.
Delegate handling this case to the locked read, as any other non-trivial
situation.

PR:	253158
Reported by:	Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Tested by:	cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-16 07:09:37 +02:00
Alex Richardson
0482d7c9e9 Fix fget_only_user() to return ENOTCAPABLE on a failed capsicum check
After eaad8d1303 four additional
capsicum-test tests started failing. It turns out this is because
fget_only_user() was returning EBADF on a failed capsicum check instead
of forwarding the return value of cap_check_inline() like
fget_unlocked_seq().

capsicum-test failures before this:
```
[  FAILED  ] 7 tests, listed below:
[  FAILED  ] Capability.OperationsForked
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
[  FAILED  ] OpenatTest.WithFlag
[  FAILED  ] ForkedOpenatTest_WithFlagInCapabilityMode._
[  FAILED  ] Select.LotsOFileDescriptorsForked
```
After:
```
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
```

Reviewed By:	mjg
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D28691
2021-02-15 22:55:12 +00:00
Jason A. Harmening
41032835dc Fix divide-by-zero panic when ASLR is enabled and superpages disabled
When locating the anonymous memory region for a vm_map with ASLR
enabled, we try to keep the slid base address aligned on a superpage
boundary to minimize pagetable fragmentation and maximize the potential
usage of superpage mappings.  We can't (portably) do this if superpages
have been disabled by loader tunable and pagesizes[1] is 0, and it
would be less beneficial in that case anyway.

PR:		253511
Reported by:	johannes@jo-t.de
MFC after:	1 week
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28678
2021-02-15 10:38:04 -08:00
Mateusz Guzik
eac22dd480 lockmgr: shrink struct lock by 8 bytes on LP64
Currently the struct has a 4 byte padding stemming from 3 ints.

1. prio comfortably fits in short, unfortunately there is no dedicated
   type for it and plumbing it throughout the codebase is not worth it
   right now, instead an assert is added which covers also flags for
   safety
2. lk_exslpfail can in principle exceed u_short, but the count is
   already not considered reliable and it only ever gets modified
   straight to 0. In other words it can be incrementing with an upper
   bound of USHRT_MAX

With these in place struct lock shrinks from 48 to 40 bytes.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D28680
2021-02-15 13:57:25 +00:00
Konstantin Belousov
25c6318c79 procstat: distinguish vm map guards in procstat vm output.
Requested and reviewed by:	rwatson (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28658
2021-02-14 03:24:58 +02:00
Konstantin Belousov
adf28ab456 fifo: minor comment and assert improvements.
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov
b59a8e63d6 Stop ignoring ERELOOKUP from VOP_INACTIVE()
When possible, relock the vnode and retry inactivation.  Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov
3b2aa36024 Use VOP_VPUT_PAIR() for eligible VFS syscalls.
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
49c117c193 Add VOP_VPUT_PAIR() with trivial default implementation.
The VOP is intended to be used in situations where VFS has two
referenced locked vnodes, typically a directory vnode dvp and a vnode
vp that is linked from the directory, and at least dvp is vput(9)ed.
The child vnode can be also vput-ed, but optionally left referenced and
locked.

There, at least UFS may need to do some actions with dvp which cannot be
done while vp is also locked, so its lock might be dropped temporary.
For instance, in some cases UFS needs to sync dvp to avoid filesystem
state that is currently not handled by either kernel nor fsck. Having
such VOP provides the neccessary context for filesystem which can do
correct locking and handle potential reclamation of vp after relock.

Trivial implementation does vput(dvp) and optionally vput(vp).

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
ee965dfa64 vn_open(): If the vnode is reclaimed during open(2), do not return error.
Most future operations on the returned file descriptor will fail
anyway, and application should be ready to handle that failures.  Not
forcing it to understand the transient failure mode on open, which is
implementation-specific, should make us less special without loss of
reporting of errors.

Suggested by: chs
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
bf0db19339 buf SU hooks: track buf_start() calls with B_IOSTARTED flag
and only call buf_complete() if previously started.  Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by:	pho
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundations
2021-02-12 03:02:19 +02:00
Mateusz Guzik
39e0c3f686 cache: assorted comment fixups 2021-02-09 17:09:44 +01:00
Kyle Evans
504ebd612e kern: sonewconn: set so_options before pru_attach()
Protocol attachment has historically been able to observe and modify
so->so_options as needed, and it still can for newly created sockets.
779f106aa1 moved this to after pru_attach() when we re-acquire the
lock on the listening socket.

Restore the historical behavior so that pru_attach implementations can
consistently use it. Note that some pru_attach() do currently rely on
this, though that may change in the future. D28265 contains a change to
remove the use in TCP and IB/SDP bits, as resetting the requested linger
time on incoming connections seems questionable at best.

This does move the assignment out from under the head's listen lock, but
glebius notes that head won't be going away and applications cannot
assume any specific ordering with a race between a connection coming in
and the application changing socket options anyways.

Discussed-with:	glebius
MFC-after:	1 week
2021-02-08 21:44:43 -06:00
Alexander V. Chernikov
924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov
4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Mark Johnston
b5aa9ad43a ktls: Make configuration sysctls available as tunables
Reviewed by:	gallatin, jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28499
2021-02-08 09:19:02 -05:00
Mark Johnston
1755b2b989 ktls: Use COUNTER_U64_DEFINE_EARLY
This makes it a bit more straightforward to add new counters when
debugging.  No functional change intended.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28498
2021-02-08 09:18:51 -05:00
Mateusz Guzik
2f8a844635 cache: remove the largely obsolete general description
Examples of inconsistencies with the current state:
- references LRU of all entries, removed years ago
- references a non-existent lock (neglist)
- claims negative entries have a NULL target

It will be replaced with a more accurate and more informative
description.

In the meantime take it out so it stops misleading.
2021-02-06 00:28:40 +01:00
Mateusz Guzik
0e1594e60e cache: fix vfs:namecache:lookup:miss probe call sites 2021-02-06 00:28:40 +01:00
Mateusz Guzik
2e96132a7d cache: drop spurious arg from panic in cache_validate
vp is already reported when noting mismatch
2021-02-06 00:28:39 +01:00
Mateusz Guzik
b54ed778fe cache: comment on FNV 2021-02-06 00:13:57 +01:00
Ed Maste
edc374e7c4 Correct description for kern.proc.proc_td
kern.proc.proc_td returns the process table with an entry for each
thread.  Previously the description included "no threads", presumably
a cut-and-pasteo in 2648efa621.

Description suggested by PauAmma.

PR:		253146
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-02-02 17:00:05 -05:00
Mateusz Guzik
45456abc4c cache: fix trailing slash support in face of permission problems
Reported by:	Johan Hendriks <joh.hendriks gmail.com>
Tested by:	kevans
2021-02-02 18:13:51 +00:00
Mateusz Guzik
6f19dc2124 cache: add delayed degenerate path handling 2021-02-01 04:53:23 +00:00
Mateusz Guzik
bbfb1edd70 cache: move hash computation into the parsing loop 2021-02-01 04:36:45 +00:00
Mateusz Guzik
e027e24bfa cache: add trailing slash support
Tested by:	pho
2021-01-31 12:02:46 +00:00
Mateusz Guzik
8cbd164a17 cache: handle NOFOLLOW requests for symlinks
Tested by:	pho
2021-01-31 12:02:46 +00:00
Gleb Smirnoff
3f43ada98c Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2021-01-29 11:46:24 -08:00
Mateusz Guzik
45e1f85414 poll: use fget_unlocked or fget_only_user when feasible
This follows select by eleminating the use of filedesc lock.
This is a win for single-threaded processes and a mixed bag for others
as at small concurrency it is faster to take the lock instead of
refing/unrefing each file descriptor.

Nonetheless, removal of shared lock usage is a step towards a
mtx-protected fd table.
2021-01-29 11:23:44 +00:00
Mateusz Guzik
6affe1b712 select: employ fget_only_user
Since most select users are single-threaded this avoid a lot of work
in the common case.

For example select of 16 fds (ops/s):
before:	2114536
after:	2991010
2021-01-29 11:23:44 +00:00
Mateusz Guzik
eaad8d1303 fd: add fget_only_user
This can be used by single-threaded processes which don't share a file
descriptor table to access their file objects without having to
reference them.

For example select consumers tend to match the requirement and have
several file descriptors to inspect.
2021-01-29 11:23:43 +00:00
Jamie Gritton
c050ea803e jail: Handle a parent jail when a child is added to it
It's possible when adding a jail that its dying parent comes back to
life.  Only allow that to happen when JAIL_DYING is specified.  And if
it does happen, call PR_METHOD_CREATE on it.
2021-01-28 21:51:09 -08:00
Bryan Drewery
c926114f2f Fix getblk() with GB_NOCREAT returning false-negatives.
It is possible for a buf to be reassigned between the dirty and clean
lists while gbincore_unlocked() looks in each list.  Avoid creating
a buffer in that case and fallback to a locked lookup.

This fixes a regression from r363482.

More discussion on potential improvements to the clean and dirty lists
handling is in the review.

Reviewed by:	cem, kib, markj, vangyzen, rlibby
Reported by:	Suraj.Raju at dell.com
Submitted by:	Suraj.Raju at dell.com, cem, [based on both]
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D28375
2021-01-28 11:24:24 -08:00
Mateusz Guzik
5c325977b1 cache: add missing MNT_NOSYMFOLLOW check to symlink traversal 2021-01-27 15:08:38 +00:00
Mateusz Guzik
5fc384d181 cache: fallback when encountering a mount point during .. lookup
The current abort is overzealous.
2021-01-27 16:00:31 +01:00
Bjoern A. Zeeb
6f65b50546 firmware(9): extend firmware_get() by a "no warn" flag.
With the upcoming usage from LinuxKPI but also from drivers
ported natively we are seeing more probing of various
firmware (names).

Add the ability to firmware(9) to silence the
"firmware image loading/registering errors" by adding a new
firmware_get_flags() functions extending firmware_get() and
taking a flags argument as firmware_put() already does.

Requested-by:	zeising (for future LinuxKPI/DRM)
Sponsored-by:	The FreeBSD Foundation
Sponsored-by:	Rubicon Communications, LLC ("Netgate")
MFC after:	3 days
Reviewed-by:	markj
Differential Revision:	https://reviews.freebsd.org/D27413
2021-01-27 13:51:26 +00:00
Mateusz Guzik
a098a831a1 cache: tidy up handling of foo/bar lookups where foo is not a directory
The code was performing an avoidable check for doomed state to account
for foo being a VDIR but turning VBAD. Now that dooming puts a vnode
in a permanent "modify" state this is no longer necessary as the final
status check will catch it.
2021-01-26 20:42:53 +00:00
Mateusz Guzik
a51eca7936 cache: stop referring to removing entries as invalidating them
Said use is a remnant from the old code and clashes with the NCF_INVALID
flag.
2021-01-26 20:42:53 +00:00
Brooks Davis
d89c1c461c Reserve gaps in syscall numbers for local use
It is best for auditing of syscalls.master if we only append to the
file.  Reserving unimplemented system call numbers for local use makes
this policy and provides a large set of syscall numbers FreeBSD
derivatives can use without risk of conflict.

Reviewed by:	jhb, kevans, kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27988
2021-01-26 18:27:45 +00:00
Brooks Davis
119fa6ee8a syscalls.master: Add a new syscall type: RESERVED
RESERVED syscall number are reserved for local/vendor use.  RESERVED is
identical to UNIMPL except that comments are ignored.

Reviewed by:	kevans
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27988
2021-01-26 18:27:44 +00:00
Brooks Davis
65a524b499 Remove documentation of unimplemented syscalls
We have not been able to run binaries from other BSDs well over a
decade.  There is no need to document their allocation decisions here.

We also don't need to reserve syscall numbers of never-implemented
syscalls.

Reviewed by:	jhb, kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27988
2021-01-26 18:27:44 +00:00
Mateusz Guzik
6943671b48 cache: convert cache_fplookup_parse to void now that it always succeeds 2021-01-26 13:24:03 +01:00
Mateusz Guzik
e7cf562a40 cache: change ->v_cache_dd synchronisation rules
Instead of resorting to seqc modification take advantage of immutability
of entries and check if the entry still matches after everything got
prepared.
2021-01-25 22:41:13 +00:00
Mateusz Guzik
6f08427649 cache: make ->v_cache_dd accesses atomic-clean for lockless usage 2021-01-25 22:41:13 +00:00
Mateusz Guzik
6ef8fede86 cache: make ->nc_flag accesses atomic-clean for lockless usage 2021-01-25 22:41:13 +00:00
Mateusz Guzik
ffcf8f97f8 cache: store vnodes in local vars in cache_zap_locked 2021-01-25 22:41:13 +00:00
Mateusz Guzik
868643e722 cache: assorted cleanups 2021-01-25 19:45:24 +00:00
Mateusz Guzik
1c7a65adb0 cache: track calls to cache_symlink_alloc with unsupported size
While here assert on size passed to free.
2021-01-25 19:45:23 +00:00
Mateusz Guzik
02ec31bdf6 cache: add back target entry on rename 2021-01-23 18:10:16 +00:00
Mateusz Guzik
739ecbcf1c cache: add symlink support to lockless lookup
Reviewed by:	kib (previous version)
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D27488
2021-01-23 15:04:43 +00:00
Jamie Gritton
195cd6ae24 jail: fix dangling reference bug from 6754ae2572
The change to use refcounts for pr_uref was mishandled in
prison_proc_free, so killing a jail's last process could add
an extra reference, leaving it an unkillable zombie.
2021-01-22 10:56:24 -08:00
Jamie Gritton
39c8ef90f6 jail: A jail could be removed without calling OSD methods
Fix a long-standing bug where setting nopersist on a process-less jail
would remove it without calling the the OSD PR_METHOD_REMOVE methods.
2021-01-22 10:50:10 -08:00
Marius Strobl
679e4cdabd kvprintf(9): add missing FALLTHROUGH
Reported by:	Coverity
CID:		1005166
2021-01-22 00:18:40 +01:00
Konstantin Belousov
1ac7c34486 malloc_aligned: roundup allocation size up to next power of two
to make it use the right aligned zone.

Reported by:	melifaro
Reviewed by:	alc, markj (previous version)
Discussed with:	jrtc27
Tested by:	pho (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28219
2021-01-21 23:34:10 +02:00
Konstantin Belousov
0781c79d48 Restrict supported alignment for malloc_domainset_aligned(9) to PAGE_SIZE.
UMA page_alloc() does not take an alignment, so UMA can only handle
alignment less then page size.

Noted by:	alc
Reviewed by:	alc, markj (previous version)
Discussed with:	jrtc27
Tested by:	pho (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28219
2021-01-21 23:34:10 +02:00
Jamie Gritton
6754ae2572 jail: Use refcount(9) for prison references.
Use refcount(9) for both pr_ref and pr_uref in struct prison.  This
allows prisons to held and freed without requiring the prison mutex.
An exception to this is that dropping the last reference will still
lock the prison, to keep the guarantee that a locked prison remains
valid and alive (provided it was at the time it was locked).

Among other things, this honors the promise made in a comment in
crcopy(9), that it will not block, which hasn't been true for two
decades.
2021-01-20 15:08:27 -08:00
Vladimir Kondratyev
e3dd8ed77b devinfo sysctl handler: Do not write zero-length strings in to sbuf twice
This fixes missing PnPinfo and location strings in devinfo(8) output
for devices with no attached drivers.
2021-01-21 02:06:16 +03:00
Alan Somers
2247f48941 aio: micro-optimize the lio_opcode assignments
This allows slightly more efficient opcode testing in-kernel.  It is
transparent to userland, except to applications that sneakily submit
aio fsync or aio mlock operations via lio_listio, which has never been
documented, requires the use of deliberately undefined constants
(LIO_SYNC and LIO_MLOCK), and is arguably a bug.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D27942
2021-01-20 09:02:25 -07:00
Alex Richardson
7e99c034f7 Emit uprintf() output for initproc if there is no controlling terminal
This patch helped me debug why /sbin/init was not being loaded after
making changes to the image activator in CheriBSD.

Reviewed By:	jhb (earlier version), kib
Differential Revision: https://reviews.freebsd.org/D28121
2021-01-20 09:54:46 +00:00
Mateusz Guzik
2171b8e8a2 cache: augment sdt probe in cache_fplookup_dot
Same as 6d386b4c ("cache: save a branch in cache_fplookup_next")
2021-01-20 07:23:14 +00:00
Mateusz Guzik
aae03cfe64 cache: whitespace nit in cache_fplookup_modifying 2021-01-20 07:22:04 +00:00
Mark Johnston
4dc1b17dbb ktls: Improve handling of the bind_threads tunable a bit
- Only check for empty domains if we actually tried to configure domain
  affinity in the first place.  Otherwise setting bind_threads=1 will
  always cause the sysctl value to be reported as zero.  This is
  harmless since the threads end up being bound, but it's confusing.
- Try to improve the sysctl description a bit.

Reviewed by:	gallatin, jhb
Submitted by:	Klara, Inc.
Sponsored by:	Ampere Computing
Differential Revision:	https://reviews.freebsd.org/D28161
2021-01-19 21:32:33 -05:00
Mateusz Guzik
38baca17e0 lockmgr: fix upgrade
TRYUPGRADE requests kept failing when they should not have due to wrong
macro used to count readers.

Fixes:	f6b091fbbd ("lockmgr: rewrite upgrade to stop always dropping the lock")
Noted by:	asomers
Differential Revision:	https://reviews.freebsd.org/D27947
2021-01-19 12:21:38 +00:00
Mateusz Guzik
57dab0292a cache: fix some typos 2021-01-19 10:17:14 +01:00
Mateusz Guzik
84ab77ad27 cache: drop-write only var from cache_fplookup_preparse 2021-01-19 10:13:30 +01:00
Mateusz Guzik
6d386b4c8a cache: save a branch in cache_fplookup_next
Previously the code would branch on top find out whether it should
branch on SDT probe and bumping the numposhits counter, depending
on cache_fplookup_cross_mount.

Arguably it should be done regardless of what said function returns.
2021-01-19 10:08:24 +01:00
Jamie Gritton
effad35ed1 jail: Clean up some function placement and improve comments.
Move prison_hold, prison_hold_locked ,prison_proc_hold, and
prison_proc_free to a more intuitive part of the file (together with
with prison_free and prison_free_locked), and add or improve comments
to these and others, to better describe what's going in the prison
reference cycle.

No functional changes.
2021-01-18 17:23:51 -08:00
Oleksandr Tymoshenko
248f0cabca make maximum interrupt number tunable on ARM, ARM64, MIPS, and RISC-V
Use a machdep.nirq tunable intead of compile-time constant NIRQ
as a value for maximum number of interrupts. It allows keep a system
footprint small by default with an option to increase the limit
for large systems like server-grade ARM64

Reviewd by:	mhorne
Differential Revision:	https://reviews.freebsd.org/D27844
Submitted by:	Klara, Inc.
Sponsored by:	Ampere Computing
2021-01-18 16:36:39 -08:00
Jamie Gritton
83bc72a04e jail: Fix a stray mutex from 76ad42abf9. 2021-01-18 15:47:09 -08:00
Jamie Gritton
76ad42abf9 jail: Add prison_isvalid() and prison_isalive()
prison_isvalid() checks if a prison record can be used at all, i.e.
pr_ref > 0.  This filters out prisons that aren't fully created, and
those that are either in the process of being dismantled, or will be
at the next opportunity.  While the check for pr_ref > 0 is simple
enough to make without a convenience function, this prepares the way
for other measures of prison validity.

prison_isalive() checks not only validity as far as the useablity of
the prison structure, but also whether the prison is visible to user
space.  It replaces a test for pr_uref > 0, which is currently only
used within kern_jail.c, and not often there.

Both of these functions also assert that either the prison mutex or
allprison_lock is held, since it's generally the case that unlocked
prisons aren't guaranteed to remain useable for any length of time.
This isn't entirely true, for example a thread can assume its own
prison is good, but most exceptions will exist inside of kern_jail.c.
2021-01-18 10:56:20 -08:00
Konstantin Belousov
36bcc44e2c Add ddb 'show timecounter' command.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-01-18 09:51:48 +02:00
Jamie Gritton
25c2c952e3 jail: Add proper prison locking in mqfs_prison_remove. 2021-01-17 17:41:09 -08:00
Konstantin Belousov
3b15beb30b Implement malloc_domainset_aligned(9).
Change the power-of-two malloc zones to require alignment equal to the
size [*].  Current uma allocator already provides such alignment, so in
fact this change does not change anything except providing future-proof
setup.

Suggested by:	markj [*]
Reviewed by:	andrew, jah, markj
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28147
2021-01-17 19:29:05 +02:00
Mateusz Guzik
fe258f23ef Save on getpid in setproctitle by supporting -1 as curproc. 2021-01-16 09:36:54 +01:00
Kirk McKusick
79a5c790bd Eliminate a locking panic when cleaning up UFS snapshots after a
disk failure.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.

Sponsored by: Netflix
2021-01-15 16:36:42 -08:00
Mitchell Horne
818390ce0c arm64: fix early devmap assertion
The purpose of this KASSERT is to ensure that we do not run out of space
in the early devmap. However, the devmap grew beyond its initial size of
2MB in r336519, and this assertion did not grow with it.

A devmap mapping of a 1080p framebuffer requires 1920x1080 bytes, or
1.977 MB, so it is just barely able to fit without triggering the
assertion, provided no other devices are mapped before it. With the
addition of `options GDB` in GENERIC by bbfa199cbc, the uart is now
mapped for the purposes of a debug port, before mapping the framebuffer.
The presence of both these conditions pushes the selected virtual
address just below the threshold, triggering the assertion.

To fix this, use the correct size of the devmap, defined by
PMAP_MAPDEV_EARLY_SIZE. Since this code is shared with RISC-V, define
it for that platform as well (although it is a different size).

PR:		25241
Reported by:	gbe
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-01-13 17:27:44 -04:00
Mateusz Guzik
ef23df1354 vfs: set NC_KEEPPOSENTRY alongside NOCACHE when creating a file
Arguably the entire NOCACHE logic should get retired, in the meantime
at least prevent the code from evicting existing entries.
2021-01-13 15:29:34 +00:00
Mateusz Guzik
5753be8e43 fd: add refcount argument to falloc_noinstall
This lets callers avoid atomic ops by initializing the count to required
value from the get go.

While here add falloc_abort to backpedal from this without having to
fdrop.
2021-01-13 15:29:34 +00:00
Mateusz Guzik
5171310e66 vfs: use finstall_refed in openat
This avoids 2 atomic ops in the common case: 1 to grab an extra
reference and 1 to release it.
2021-01-13 03:30:38 +00:00
Mateusz Guzik
530b699a62 fd: add finstall_refed
Can be used to consume an already existing reference and consequently
avoid atomic ops.
2021-01-13 03:27:03 +01:00
Mateusz Guzik
4faa375cdd fd: provide a dedicated closef variant for unix socket code
This avoids testing for td != NULL.
2021-01-13 03:27:03 +01:00
Konstantin Belousov
0659df6fad vm_map_protect: allow to set prot and max_prot in one go.
This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome.  E.g. you can get an error that is not
possible if operation is performed atomic.

Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.

Reviewed by:	brooks, markj (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28117
2021-01-13 01:35:22 +02:00
Mateusz Guzik
70ba77706d vfs: extend vfs:namei:lookup:return probe with nameidata 2021-01-12 13:35:27 +00:00
Mateusz Guzik
cdb62ab74e vfs: add NDFREE_NOTHING and convert several NDFREE_PNBUF callers
Check the comment above the routine for reasoning.
2021-01-12 13:16:10 +00:00
Mateusz Guzik
6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Konstantin Belousov
57f22c828e sigfastblock: do not skip cursig/postsig loop in ast()
Even if sigfastblock block is non-zero, non-blockable signals must be
checked on ast and delivered now.  This also affects debugger ability
to attach, because issignal() also calls ptracestop() if there is
a pending stop for debugee.

Instead of checking for sigfastblock, and either setting PENDING flag
for usermode or doing signal delivery loop, always do the loop after
checking, and then handle PENDING bit. issignal() already does the right
thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to
happen.

Reported by:	Vasily Postnicov <shamaz.mazum@gmail.com>, markj
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28089
2021-01-12 12:45:26 +02:00
Konstantin Belousov
513320c0f1 sigfastblock_setpend(): do not set PEND user flag unless TDP_SIGFASTPENDING is set.
User pending bit should not be set if kernel did not noted a pending signal.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28089
2021-01-12 12:43:34 +02:00
Alan Somers
ff1a307801 lio_listio: validate aio_lio_opcode
Previously, we would accept any kind of LIO_* opcode, including ones
that were intended for in-kernel use only like LIO_SYNC (which is not
defined in userland).  The situation became more serious with
022ca2fc7f.  After that revision, setting
aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion.

Note that POSIX does not specify what should happen if aio_lio_opcode is
invalid.

MFC-with:	022ca2fc7f
Reviewed by:	jhb, tmunro, 0mp
Differential Revision:	<https://reviews.freebsd.org/D28078
2021-01-11 19:53:01 -07:00
Jason A. Harmening
e8a5a1ad71 rctl(4): support throttling resource usage to 0
For rate-based resources that support throttling (e.g.
readiops/writeips), this fixes a divide-by-zero panic when rctl(8)
passes 0 as the throttle value.  For these resources, treat
zero-throttle requests as requests to suspend forward progress as long
as possible using the duration specified in
kern.racct.rctl.throttle_max.

PR:		251803
Reported by:	chris@cretaforce.gr
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27858
2021-01-11 15:36:57 -08:00
Konstantin Belousov
4ea65707d3 exec_new_vmspace: print useful error message on ctty if stack cannot be mapped.
After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.

In the relatively common case of failure to map stack, print some hints
on the control terminal.  Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.

Requested by:	jhb
Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Konstantin Belousov
2e1c94aa1f Implement enforcing write XOR execute mapping policy.
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Robert Watson
30b68ecda8 Changes that improve DTrace FBT reliability on freebsd/arm64:
- Implement a dtrace_getnanouptime(), matching the existing
  dtrace_getnanotime(), to avoid DTrace calling out to a potentially
  instrumentable function.

  (These should probably both be under KDTRACE_HOOKS.  Also, it's not clear
  to me that they are correct implementations for the DTrace thread time
  functions they are used in .. fixes for another commit.)

- Don't allow FBT to instrument functions involved in EL1 exception handling
  that are involved in FBT trap processing: handle_el1h_sync() and
  do_el1h_sync().

- Don't allow FBT to instrument DDB and KDB functions, as that makes it
  rather harder to debug FBT problems.

Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel
panics due to recursion in DTrace.

Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to
have the aarch64 instrumentor more carefully check that instructions it
replaces are against the stack pointer, which can otherwise lead to memory
corruption.  That change remains under review.

MFC after:	2 weeks
Reviewed by:	andrew, kp, markj (earlier version), jrtc27 (earlier version)
Differential revision:	https://reviews.freebsd.org/D27766
2021-01-11 15:42:22 +00:00