Commit Graph

18388 Commits

Author SHA1 Message Date
Mateusz Guzik
074abaccfa cache: remove incomplete lockless lockout support during resize
This is already properly handled thanks to 2 step hash replacement.
2021-04-28 19:53:25 +00:00
Mark Johnston
d1e9441583 pipe: Avoid calling selrecord() on a closing pipe
pipe_poll() may add the calling thread to the selinfo lists of both ends
of a pipe.  It is ok to do this for the local end, since we know we hold
a reference on the file and so the local end is not closed.  It is not
ok to do this for the remote end, which may already be closed and have
called seldrain().  In this scenario, when the polling thread wakes up,
it may end up referencing a freed selinfo.

Guard the selrecord() call appropriately.

Reviewed by:	kib
Reported by:	syzkaller+KASAN
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D30016
2021-04-28 10:43:29 -04:00
Thomas Munro
3aaaa2efde poll(2): Add POLLRDHUP.
Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if
requested.  Triggered when the remote peer shuts down writing or closes
its end.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D29757
2021-04-28 23:00:31 +12:00
Mark Johnston
409ab7e109 imgact_elf: Ensure that the return value in parse_notes is initialized
parse_notes relies on the caller-supplied callback to initialize "res".
Two callbacks are used in practice, brandnote_cb and note_fctl_cb, and
the latter fails to initialize res.  Fix it.

In the worst case, the bug would cause the inner loop of check_note to
examine more program headers than necessary, and the note header usually
comes last anyway.

Reviewed by:	kib
Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29986
2021-04-26 14:53:16 -04:00
Edward Tomasz Napierala
5d1d844a77 kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW
This makes this API match other kern_xxxat() functions.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29776
2021-04-25 14:13:12 +01:00
Robert Watson
af14713d49 Support run-time configuration of the PIPE_MINDIRECT threshold.
PIPE_MINDIRECT determines at what (blocking) write size one-copy
optimizations are applied in pipe(2) I/O.  That threshold hasn't
been tuned since the 1990s when this code was originally
committed, and allowing run-time reconfiguration will make it
easier to assess whether contemporary microarchitectures would
prefer a different threshold.

(On our local RPi4 baords, the 8k default would ideally be at least
32k, but it's not clear how generalizable that observation is.)

MFC after:	3 weeks
Reviewers:	jrtc27, arichardson
Differential Revision: https://reviews.freebsd.org/D29819
2021-04-24 20:04:28 +01:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Warner Losh
df456a1fcf newbus: style nit (align comments)
Sponsored by:		Netflix
2021-04-21 15:37:24 -06:00
Warner Losh
1eebd6158c newbus: Optimize/Simplify kobj_class_compile_common a little
"i" is not used in this loop at all. There's no need to initialize and
increment it.

Reviewed by:		markj@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D29898
2021-04-21 15:37:24 -06:00
Konstantin Belousov
54f98c4dbf vn_open_vnode(): handle error when fp == NULL
If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to
be called, we cannot use fp fo_close() when there is no fp.  This occurs
when e.g. kernel code directly calls vn_open() instead of the open(2)
syscall.

In this case, VOP_CLOSE() can be called directly, after possible lock
upgrade.

Reported by:	nvass@gmx.com
PR:	255119
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29830
2021-04-21 18:06:51 +03:00
Konstantin Belousov
ecfbddf0cd sysctl vm.objects: report backing object and swap use
For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object.  This allows userspace
to deconstruct the shadow chain.  Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field.  I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29771
2021-04-19 21:32:01 +03:00
Konstantin Belousov
4342ba184c sysctl_handle_string: do not malloc when SYSCTL_IN cannot fault
In particular, this avoids malloc(9) calls when from early tunable handling,
with no working malloc yet.

Reported and tested by:	mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-19 21:32:01 +03:00
Konstantin Belousov
578c26f31c linkat(2): check NIRES_EMPTYPATH on the first fd arg
Reported by:	arichardson
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29834
2021-04-19 21:32:01 +03:00
Warner Losh
571a1a64b1 Minor style tidy: if( -> if (
Fix a few 'if(' to be 'if (' in a few places, per style(9) and
overwhelming usage in the rest of the kernel / tree.

MFC After:		3 days
Sponsored by:		Netflix
2021-04-18 11:19:15 -06:00
Warner Losh
f1f9870668 Minor style cleanup
We prefer 'while (0)' to 'while(0)' according to grep and stlye(9)'s
space after keyword rule. Remove a few stragglers of the latter.
Many of these usages were inconsistent within the file.

MFC After:		3 days
Sponsored by:		Netflix
2021-04-18 11:14:17 -06:00
Konstantin Belousov
bbf7a4e878 O_PATH: allow vnode kevent filter on such files
if VREAD access is checked as allowed during open

Requested by:	wulf
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:49:18 +03:00
Konstantin Belousov
f9b923af34 O_PATH: Allow to open symlink
When O_NOFOLLOW is specified, namei() returns the symlink itself.  In
this case, open(O_PATH) should be allowed, to denote the location of symlink
itself.

Prevent O_EXEC in this case, execve(2) code is not ready to try to execute
symlinks.

Reported by:	wulf
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:49:09 +03:00
Konstantin Belousov
a5970a529c Make files opened with O_PATH to not block non-forced unmount
by only keeping hold count on the vnode, instead of the use count.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:48:27 +03:00
Konstantin Belousov
8d9ed174f3 open(2): Implement O_PATH
Reviewed by:	markj
Tested by:	pho
Discussed with:	walker.aj325_gmail.com, wulf
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:48:24 +03:00
Konstantin Belousov
509124b626 Add AT_EMPTY_PATH for several *at(2) syscalls
It is currently allowed to fchownat(2), fchmodat(2), fchflagsat(2),
utimensat(2), fstatat(2), and linkat(2).

For linkat(2), PRIV_VFS_FHOPEN privilege is required to exercise the flag.
It allows to link any open file.

Requested by:	trasz
Tested by:	pho, trasz
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29111
2021-04-15 12:48:11 +03:00
Konstantin Belousov
437c241d0c vfs_vnops.c: Make vn_statfile() non-static
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:47:56 +03:00
Konstantin Belousov
42be0a7b10 Style.
Add missed spaces, wrap long lines.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:47:46 +03:00
Mateusz Guzik
4f0279e064 cache: extend mismatch vnode assert print to include the name 2021-04-15 07:55:43 +00:00
Mark Johnston
29bb6c19f0 domainset: Define additional global policies
Add global definitions for first-touch and interleave policies.  The
former may be useful for UMA, which implements a similar policy without
using domainset iterators.

No functional change intended.

Reviewed by:	mav
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29104
2021-04-14 13:03:33 -04:00
Konstantin Belousov
75c5cf7a72 filt_timerexpire: avoid process lock recursion
Found by:	syzkaller
Reported and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29746
2021-04-14 10:53:28 +03:00
Konstantin Belousov
5cc1d19941 realtimer_expire: avoid proc lock recursion when called from itimer_proc_continue()
It is fine to drop the process lock there, process cannot exit until its
timers are cleared.

Found by:	syzkaller
Reported and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29746
2021-04-14 10:53:19 +03:00
Konstantin Belousov
116f26f947 sbuf_uionew(): sbuf_new() takes int as length
and length should be not less than SBUF_MINSIZE

Reported and tested by:	pho
Noted and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29752
2021-04-14 10:23:20 +03:00
Mark Johnston
06a53ecf24 malloc: Add state transitions for KASAN
- Reuse some REDZONE bits to keep track of the requested and allocated
  sizes, and use that to provide red zones.
- As in UMA, disable memory trashing to avoid unnecessary CPU overhead.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29461
2021-04-13 17:42:21 -04:00
Mark Johnston
f1c3adefd9 execve: Mark exec argument buffers
We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns.  Mark them invalid when they are freed to the cache.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29460
2021-04-13 17:42:21 -04:00
Mark Johnston
b261bb4057 vfs: Add KASAN state transitions for vnodes
vnodes are a bit special in that they may exist on per-CPU lists even
while free.  Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29459
2021-04-13 17:42:21 -04:00
Mark Johnston
6faf45b34b amd64: Implement a KASAN shadow map
The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map.  This region is the shadow map.  The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid.  Various kernel allocators call kasan_mark() to update
the shadow map.

Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN.  UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.

The shadow map uses one byte per eight bytes in the kernel map.  In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.

When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map.  kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29417
2021-04-13 17:42:20 -04:00
Mark Johnston
38da497a4d Add the KASAN runtime
KASAN enables the use of LLVM's AddressSanitizer in the kernel.  This
feature makes use of compiler instrumentation to validate memory
accesses in the kernel and detect several types of bugs, including
use-after-frees and out-of-bounds accesses.  It is particularly
effective when combined with test suites or syzkaller.  KASAN has high
CPU and memory usage overhead and so is not suited for production
environments.

The runtime and pmap maintain a shadow of the kernel map to store
information about the validity of memory mapped at a given kernel
address.

The runtime implements a number of functions defined by the compiler
ABI.  These are prefixed by __asan.  The compiler emits calls to
__asan_load*() and __asan_store*() around memory accesses, and the
runtime consults the shadow map to determine whether a given access is
valid.

kasan_mark() is called by various kernel allocators to update state in
the shadow map.  Updates to those allocators will come in subsequent
commits.

The runtime also defines various interceptors.  Some low-level routines
are implemented in assembly and are thus not amenable to compiler
instrumentation.  To handle this, the runtime implements these routines
on behalf of the rest of the kernel.  The sanitizer implementation
validates memory accesses manually before handing off to the real
implementation.

The sanitizer in a KASAN-configured kernel can be disabled by setting
the loader tunable debug.kasan.disable=1.

Obtained from:	NetBSD
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29416
2021-04-13 17:42:20 -04:00
Mitchell Horne
2816bd8442 rmlock(9): add an RM_DUPOK flag
Allows for duplicate locks to be acquired without witness complaining.
Similar flags exists already for rwlock(9) and sx(9).

Reviewed by:	markj
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
NetApp PR:	52
Differential Revision:	https://reviews.freebsd.org/D29683n
2021-04-12 11:42:21 -03:00
Mark Johnston
dfff37765c Rename struct device to struct _device
types.h defines device_t as a typedef of struct device *.  struct device
is defined in subr_bus.c and almost all of the kernel uses device_t.
The LinuxKPI also defines a struct device, so type confusion can occur.

This causes bugs and ambiguity for debugging tools.  Rename the FreeBSD
struct device to struct _device.

Reviewed by:	gbe (man pages)
Reviewed by:	rpokala, imp, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29676
2021-04-12 09:32:30 -04:00
Andrew Turner
5d2d599d3f Create VM_MEMATTR_DEVICE on all architectures
This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.

Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29692
2021-04-12 06:15:31 +00:00
Konstantin Belousov
a091c35323 ptrace: restructure comments around reparenting on PT_DETACH
style code, and use {} for both branches.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-11 14:44:30 +03:00
Konstantin Belousov
9d7e450b64 ptrace: remove dead call to FIX_SSTEP()
It was an alias for procfs_fix_sstep() long time ago.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-11 14:44:30 +03:00
Piotr Pawel Stefaniak
a212f56d10 Balance parentheses in sysctl descriptions 2021-04-11 10:30:55 +02:00
Konstantin Belousov
2fd1ffefaa Stop arming kqueue timers on knote owner suspend or terminate
This way, even if the process specified very tight reschedule
intervals, it should be stoppable/killable.

Reported and reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:43:51 +03:00
Konstantin Belousov
533e5057ed Add helper for kqueue timers callout scheduling
Reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:42:56 +03:00
Konstantin Belousov
4d27d8d2f3 Stop arming realtime posix process timers on suspend or terminate
Reported and reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:42:51 +03:00
Konstantin Belousov
dc47fdf131 Stop arming periodic process timers on suspend or terminate
Reported and reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:42:44 +03:00
Mateusz Guzik
72b3b5a941 vfs: replace vfs_smr_quiesce with vfs_smr_synchronize
This ends up using a smr specific method.

Suggested by:	markj
Tested by:	pho
2021-04-08 11:14:45 +00:00
Mark Johnston
0f07c234ca Remove more remnants of sio(4)
Reviewed by:	imp
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29626
2021-04-07 14:33:02 -04:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Mateusz Guzik
13b3862ee8 cache: update an assert on CACHE_FPL_STATUS_ABORTED
Since symlink support it can get upgraded to CACHE_FPL_STATUS_DESTROYED.

Reported by:	bdrewery
2021-04-06 22:31:58 +02:00
Mark Johnston
2425f5e912 mount: Disallow mounting over a jail root
Discussed with:	jamie
Approved by:	so
Security:	CVE-2020-25584
Security:	FreeBSD-SA-21:10.jail_mount
2021-04-06 14:49:36 -04:00
Edward Tomasz Napierala
7f6157f7fd lock_delay(9): improve interaction with restrict_starvation
After e7a5b3bd05, the la->delay value was adjusted after
being set by the starvation_limit code block, which is wrong.

Reported By:	avg
Reviewed By:	avg
Fixes:		e7a5b3bd05
Sponsored By:	NetApp, Inc.
Sponsored By:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D29513
2021-04-03 13:08:53 +01:00
Mark Johnston
52a99c72b5 sendfile: Fix error initialization in sendfile_getobj()
Reviewed by:	chs, kib
Reported by:	jhb
Fixes:		faa998f6ff
MFC after:	1 day
Differential Revision:	https://reviews.freebsd.org/D29540
2021-04-02 17:42:38 -04:00
Richard Scheffenegger
cad4fd0365 Make sbuf_drain safe for external use
While sbuf_drain was an internal function, two
KASSERTS checked the sanity of it being called.
However, an external caller may be ignorant if
there is any data to drain, or if an error has
already accumulated. Be nice and return immediately
with the accumulated error.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29544
2021-04-02 20:12:11 +02:00
Konstantin Belousov
aa3ea612be x86: remove gcov kernel support
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D29529
2021-04-02 15:41:51 +03:00
Mateusz Guzik
f79bd71def cache: add high level overview
Differential Revision:	https://reviews.freebsd.org/D28675
2021-04-02 05:11:05 +02:00
Mateusz Guzik
dc532884d5 cache: fix resizing in face of lockless lookup
Reported by:	pho
Tested by:	pho
2021-04-02 05:11:05 +02:00
Lawrence Stewart
1eb402e47a stats(3): Improve t-digest merging of samples which result in mu adjustment underflow.
Allow the calculation of the mu adjustment factor to underflow instead of
rejecting the VOI sample from the digest and logging an error. This trades off
some (currently unquantified) additional centroid error in exchange for better
fidelity of the distribution's density, which is the right trade off at the
moment until follow up work to better handle and track accumulated error can be
undertaken.

Obtained from:	Netflix
MFC after:	immediately
2021-04-02 13:17:53 +11:00
Richard Scheffenegger
c804c8f2c5 Export sbuf_drain to orchestrate lock and drain action
While exporting large amounts of data to a sysctl
request, datastructures may need to be locked.

Exporting the sbuf_drain function allows the
coordination between drain events and held
locks, to avoid stalls.

PR:		254333
Reviewed By:	jhb
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29481
2021-03-31 19:17:37 +02:00
Konstantin Belousov
e243367b64 mbuf: add a way to mark flowid as calculated from the internal headers
In some settings offload might calculate hash from decapsulated packet.
Reserve a bit in packet header rsstype to indicate that.

Add m_adj_decap() that acts similarly to m_adj, but also either clear
flowid if it is not marked as inner, or transfer it to the decapsulated
header, clearing inner indicator. It depends on the internals of m_adj()
that reuses the argument packet header for the result.

Use m_adj_decap() for decapsulating vxlan(4) and gif(4) input packets.

Reviewed by:	ae, hselasky, np
Sponsored by:	Nvidia Networking / Mellanox Technologies
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28773
2021-03-31 14:38:26 +03:00
Mark Johnston
3428b6c050 Fix several dev_clone callbacks to avoid out-of-bounds reads
Use strncmp() instead of bcmp(), so that we don't have to find the
minimum of the string lengths before comparing.

Reviewed by:	kib
Reported by:	KASAN
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29463
2021-03-28 11:08:36 -04:00
Mark Johnston
71c160a8f6 vfs: Add an assertion around name length limits
Some filesystems assume that they can copy a name component, with length
bounded by NAME_MAX, into a dirent buffer of size MAXNAMLEN.  These
constants have the same value; add a compile-time assertion to that
effect.

Reported by:	Alexey Kulaev <alex.qart@gmail.com>
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29431
2021-03-27 13:45:19 -04:00
Mark Johnston
653a437c04 accept_filter: Fix filter parameter handling
For filters which implement accf_create, the setsockopt(2) handler
caches the filter name in the socket, but it also incorrectly frees the
buffer containing the copy, leaving a dangling pointer.  Note that no
accept filters provided in the base system are susceptible to this, as
they don't implement accf_create.

Reported by:	Alexey Kulaev <alex.qart@gmail.com>
Discussed with:	emaste
Security:	kernel use-after-free
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-03-25 17:55:46 -04:00
Mark Johnston
ec8f1ea8d5 Generalize sanitizer interceptors for memory and string routines
Similar to commit 3ead60236f ("Generalize bus_space(9) and atomic(9)
sanitizer interceptors"), use a more generic scheme for interposing
sanitizer implementations of routines like memcpy().

No functional change intended.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2021-03-24 19:46:22 -04:00
Mark Johnston
3ead60236f Generalize bus_space(9) and atomic(9) sanitizer interceptors
Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN.  Lay a bit of groundwork for KASAN and KMSAN.

When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h.  These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.

No functional change intended.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2021-03-22 22:21:53 -04:00
Adrian Chadd
25bfa44860 Add device and ifnet logging methods, similar to device_printf / if_printf
* device_printf() is effectively a printf
* if_printf() is effectively a LOG_INFO

This allows subsystems to log device/netif stuff using different log levels,
rather than having to invent their own way to prefix unit/netif  names.

Differential Revision: https://reviews.freebsd.org/D29320
Reviewed by: imp
2021-03-22 00:02:34 +00:00
Alex Richardson
6ceacebdf5 Unbreak MSG_CMSG_CLOEXEC
MSG_CMSG_CLOEXEC has not been working since 2015 (SVN r284380) because
_finstall expects O_CLOEXEC and not UF_EXCLOSE as the flags argument.
This was probably not noticed because we don't have a test for this flag
so this commit adds one. I found this problem because one of the
libwayland tests was failing.

Fixes:		ea31808c3b ("fd: move out actual fp installation to _finstall")
MFC after:	3 days
Reviewed By:	mjg, kib
Differential Revision: https://reviews.freebsd.org/D29328
2021-03-18 20:52:20 +00:00
Mateusz Guzik
e9272225e6 vfs: fix vnlru marker handling for filtered/unfiltered cases
The global list has a marker with an invariant that free vnodes are
placed somewhere past that. A caller which performs filtering (like ZFS)
can move said marker all the way to the end, across free vnodes which
don't match. Then a caller which does not perform filtering will fail to
find them. This makes vn_alloc_hard sleep for 1 second instead of
reclaiming, resulting in significant stalls.

Fix the problem by requiring an explicit marker by callers which do
filtering.

As a temporary measure extend vnlru_free to restart if it fails to
reclaim anything.

Big thanks go to the reporter for testing several iterations of the
patch.

Reported by:	Yamagi <lists yamagi.org>
Tested by:	Yamagi <lists yamagi.org>
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29324
2021-03-18 14:59:03 +00:00
Kyle Evans
f187d6dfbf base: remove if_wg(4) and associated utilities, manpage
After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree.  This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.

Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.
2021-03-17 09:14:48 -05:00
Mark Johnston
4aa157dd5b link_elf_obj: Add a case missing from 5e6989ba4f
Fixes:		5e6989ba4f
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-03-16 15:01:41 -04:00
Kyle Evans
74ae3f3e33 if_wg: import latest fixup work from the wireguard-freebsd project
This is the culmination of about a week of work from three developers to
fix a number of functional and security issues.  This patch consists of
work done by the following folks:

- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>

Notable changes include:
- Packets are now correctly staged for processing once the handshake has
  completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
  and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
  tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
  the interface's home vnet so that it can act as the sole network
  connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
  complete.  It is additionally supported by the upstream
  wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
  aligned with security auditing guidelines

Note that the driver has been rebased away from using iflib.  iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.

The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations.  This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.

There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.

Also note that this is still a work in progress; work going further will
be much smaller in nature.

MFC after:	1 month (maybe)
2021-03-14 23:52:04 -05:00
Jonathan T. Looney
dbec10e088 Fetch the sigfastblock value in syscalls that wait for signals
We have seen several cases of processes which have become "stuck" in
kern_sigsuspend(). When this occurs, the kernel's td_sigblock_val
is set to 0x10 (one block outstanding) and the userspace copy of the
word is set to 0 (unblocked). Because the kernel's cached value
shows that signals are blocked, kern_sigsuspend() blocks almost all
signals, which means the process hangs indefinitely in sigsuspend().

It is not entirely clear what is causing this condition to occur.
However, it seems to make sense to add some protection against this
case by fetching the latest sigfastblock value from userspace for
syscalls which will sleep waiting for signals. Here, the change is
applied to kern_sigsuspend() and kern_sigtimedwait().

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D29225
2021-03-12 18:14:17 +00:00
John Baldwin
640d54045b Set TDP_KTHREAD before calling cpu_fork() and cpu_copy_thread().
This permits these routines to use special logic for initializing MD
kthread state.

For the kproc case, this required moving the logic to set these flags
from kproc_create() into do_fork().

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D29207
2021-03-12 09:48:20 -08:00
Konstantin Belousov
44691b33cc vlrureclaim: only skip vnode with resident pages if it own the pages
Nullfs vnode which shares vm_object and pages with the lower vnode should
not be exempt from the reclaim just because lower vnode cached a lot.
Their reclamation is actually very cheap and should be preferred over
real fs vnodes, but this change is already useful.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Warner Losh
e52368365d config_intrhook: provide config_intrhook_drain
config_intrhook_drain will remove the hook from the list as
config_intrhook_disestablish does if the hook hasn't been called.  If it has,
config_intrhook_drain will wait for the hook to be disestablished in the normal
course (or expedited, it's up to the driver to decide how and when
to call config_intrhook_disestablish).

This is intended for removable devices that use config_intrhook and might be
attached early in boot, but that may be removed before the kernel can call the
config_intrhook or before it ends. To prevent all races, the detach routine will
need to call config_intrhook_train.

Sponsored by:		Netflix, Inc
Reviewed by:		jhb, mav, gde (in D29006 for man page)
Differential Revision:	https://reviews.freebsd.org/D29005
2021-03-11 09:45:10 -07:00
Hans Petter Selasky
6eb60f5b7f Use the word "LinuxKPI" instead of "Linux compatibility", to not confuse with
user-space Linux compatibility support. No functional change.

MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-03-10 12:35:16 +01:00
Kyle Evans
1ae20f7c70 kern: malloc: fix panic on M_WAITOK during THREAD_NO_SLEEPING()
Simple condition flip; we wanted to panic here after epoch_trace_list().

Reviewed by:	glebius, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D29125
2021-03-09 05:16:39 -06:00
Warner Losh
88a5591203 config_intrhook: Move from TAILQ to STAILQ and padding
config_intrhook doesn't need to be a two-pointer TAILQ. We rarely add/delete
from this and so those need not be optimized. Instaed, use the one-pointer
STAILQ plus a uintptr_t to be used as a flags word. This will allow these
changes to be MFC'd to 12 and 13 to fix a race in removable devices.

Feedback from: jhb
Reviewed by: mav
Differential Revision:	https://reviews.freebsd.org/D29004
2021-03-08 15:59:00 -07:00
Mark Johnston
435c7cfb24 Rename _cscan_atomic.h and _cscan_bus.h to atomic_san.h and bus_san.h
Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so
put these in a more generic place as a step towards importing the other
sanitizers.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29103
2021-03-08 12:39:06 -05:00
Mark Johnston
7995dae9d3 posix timers: Improve the overrun calculation
timer_settime(2) may be used to configure a timeout in the past.  If
the timer is also periodic, we also try to compute the number of timer
overruns that occurred between the initial timeout and the time at which
the timer fired.  This is done in a loop which iterates once per period
between the initial timeout and now.  If the period is small and the
initial timeout was a long time ago, this loop can take forever to run,
so the system is effectively DOSed.

Replace the loop with a more direct calculation of
(now - initial timeout) / period to compute the number of overruns.

Reported by:	syzkaller
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29093
2021-03-08 12:39:06 -05:00
Mark Johnston
60d12ef952 posix timers: Sprinkle some style fixes
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-08 12:39:06 -05:00
Mark Johnston
8ff2b41c05 posix timers: Declare unexported functions as static
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-08 12:39:06 -05:00
Konstantin Belousov
56b9bee63a Make kern.timecounter.hardware tunable
Noted and reviewed by:	kevans
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D29122
2021-03-08 03:48:21 +02:00
Hans Petter Selasky
c743a6bd4f Implement mallocarray_domainset(9) variant of mallocarray(9).
Reviewed by:	kib @
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-03-06 11:38:55 +01:00
Konstantin Belousov
ead7697f04 Restore AT_RESOLVE_BENEATH support for funlinkat(2)/unlinkat(2).
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-06 07:24:18 +02:00
Mark Johnston
89b650872b ktls: Hide initialization message behind bootverbose
We don't typically print anything when a subsystem initializes itself,
and KTLS is currently disabled by default anyway.

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29097
2021-03-05 13:11:02 -05:00
Mark Johnston
5e6989ba4f link_elf_obj: Handle init_array sections in KLDs
Reuse existing handling for .ctors, print a warning if multiple
constructor sections are present.   Destructors are not handled as of
yet.

This is required for KASAN.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29049
2021-03-04 10:07:10 -05:00
Mark Johnston
49f6925ca3 ktls: Cache output buffers for software encryption
Maintain a cache of physically contiguous runs of pages for use as
output buffers when software encryption is configured and in-place
encryption is not possible.  This makes allocation and free cheaper
since in the common case we avoid touching the vm_page structures for
the buffer, and fewer calls into UMA are needed.  gallatin@ reports a
~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after
this change.

It is possible that we will not be able to allocate these buffers if
physical memory is fragmented.  To avoid frequently calling into the
physical memory allocator in this scenario, rate-limit allocation
attempts after a failure.  In the failure case we fall back to the old
behaviour of allocating a page at a time.

N.B.: this scheme could be simplified, either by simply using malloc()
and looking up the PAs of the pages backing the buffer, or by falling
back to page by page allocation and creating a mapping in the cache
zone.  This requires some way to save a mapping of an M_EXTPG page array
in the mbuf, though.  m_data is not really appropriate.  The second
approach may be possible by saving the mapping in the plinks union of
the first vm_page structure of the array, but this would force a vm_page
access when freeing an mbuf.

Reviewed by:	gallatin, jhb
Tested by:	gallatin
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D28556
2021-03-03 17:34:01 -05:00
Alan Somers
ab63da3564 Speed up geom_stats_resync in the presence of many devices
The old code had a O(n) loop, where n is the size of /dev/devstat.
Multiply that by another O(n) loop in devstat_mmap for a total of
O(n^2).

This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can
quickly determine the right amount of memory to map, eliminating the
O(n) loop in userland.

This change decreases the time to run "gstat -bI0.001" with 16,384 md
devices from 29.7s to 4.2s.

Also, fix a memory leak first reported as PR 203097.

Sponsored by:	Axcient
Reviewed by:	mav, imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28968
2021-03-02 18:33:45 -07:00
Robert Wing
0cc746f193 filt_fsevent: only record interested events
Respect filter-specific flags for the EVFILT_FS filter.

When a kevent is registered with the EVFILT_FS filter, it is always
triggered when an EVFILT_FS event occurs, regardless of the
filter-specific flags used. Fix that.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28974
2021-03-02 14:19:22 -09:00
Konstantin Belousov
28cd3a673e O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path
Requested and reviewed by:	markj
Tested by:	arichardson,  pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:40 +02:00
Konstantin Belousov
49c98a4bf3 nameicap_check_dotdot: trim tracker on check
Tracker should contain exactly the path from the starting directory to
the current lookup point. Otherwise we might not detect some cases of
dotdot escape. Consequently, if we are walking up the tree by dotdot
lookup, we must remove an entries below the walked directory.

Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:35 +02:00
Konstantin Belousov
e8a2862aa0 Add nameicap_cleanup_from(), to clean tracker list starting from some element
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:30 +02:00
Konstantin Belousov
2388ad7c29 nameicap_tracker_add: avoid duplicates in the tracker list
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:23 +02:00
Konstantin Belousov
59e7494281 Do not call nameicap_tracker_add() for dotdot case.
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:14 +02:00
Konstantin Belousov
20e91ca36a open(2): Remove O_BENEATH and AT_BENEATH
with the reasoning that the flags did not worked properly, and were not
shipped in a release.

O_RESOLVE_BENEATH is kept as useful.

Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:16:55 +02:00
Kyle Evans
60c4ec806d jail: allow root to implicitly widen its cpuset to attach
The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.

This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.

With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.

A test has been added to demonstrate the issue; cpuset of a process
down to just the first CPU and attempting to attach to a jail without
access to any of the same CPUs previously resulted in EDEADLK and now
results in taking on the jail's mask for privileged users.

PR:		253724
Reviewed by:	jamie (also discussed with)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28952
2021-03-01 12:38:31 -06:00
Rick Macklem
a5f9fe2bab copy_file_range(2): Fix for small values of input file offset and len
r366302 broke copy_file_range(2) for small values of
input file offset and len.

It was possible for rem to be greater than len and then
"len - rem" was a large value, since both variables are
unsigned.

Reported by: koobs, Pablo <pablogsal gmail com> (Python)
Reviewed by:	asomers, koobs
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28981
2021-03-01 06:31:10 -08:00
Konstantin Belousov
b5449c92b4 Use atomic_interrupt_fence() instead of bare __compiler_membar()
for the which which definitely use membar to sync with interrupt handlers.

libc and rtld uses of __compiler_membar() seems to want compiler barriers
proper.

The barrier in sched_unpin_lite() after td_pinned decrement seems to be not
needed and removed, instead of convertion.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28956
2021-02-28 01:27:29 +02:00
Mateusz Guzik
1239a72221 cache: temporarily drop the assert that dvp != vp when adding an entry
Historically it was allowed for any names, but arguably should never be
even attempted. Allow it again since there is a release pending and
allowing it is bug-compatible with previous behavior.

Reported by:	otis
2021-02-27 22:29:50 +00:00
Jamie Gritton
589e4c1df4 jail: Add safety around prison_deref() flags.
do_jail_attach() now only uses the PD_XXX flags that refer to lock
status, so make sure that something else like PD_KILL doesn't slip
through.

Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.
2021-02-25 20:10:42 -08:00
Jamie Gritton
108a9384e9 jail: Fix locking on an early jail_set error.
I had locked allprison_lock without immediately setting PD_LIST_LOCKED.
2021-02-25 19:52:58 -08:00
John Baldwin
90972f0402 ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter.
I missed updating this counter when rebasing the changes in
9c64fc4029 after the switch to
COUNTER_U64_DEFINE_EARLY in 1755b2b989.

Fixes:		9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Sponsored by:	Netflix
2021-02-25 15:00:13 -08:00
Ryan Libby
d7671ad8d6 Close races in vm object chain traversal for unlock
We were unlocking the vm object before reading the backing_object field.
In the meantime, the object could be freed and reused.  This could cause
us to go off the rails in the object chain traversal, failing to unlock
the rest of the objects in the original chain and corrupting the lock
state of the victim chain.

Reviewed by:	bdrewery, kib, markj, vangyzen
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D28926
2021-02-25 12:11:19 -08:00
Edward Tomasz Napierala
e7a5b3bd05 Modify lock_delay() to increase the delay time after spinning
Modify lock_delay() to increase the delay time after spinning,
not before.  Previously we would spin at least twice instead of once.
In NetApp's benchmarks this fixes a performance regression compared
to FreeBSD 10, which called cpu_spinwait() directly.

Reviewed By:	mjg
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27331
2021-02-25 18:55:26 +00:00
Mark Johnston
369706a6f8 buf: Fix the dirtybufthresh check
dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers.  When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached.  This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).

When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated.  Fix this.

Reported by:	Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by:	rlibby, mckusick, kib, bdrewery
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Fixes:		3cec5c77d6
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28901
2021-02-25 10:04:44 -05:00
Mark Johnston
faa998f6ff sendfile: Use the pager size to determine the file extent when possible
Previously sendfile would issue a VOP_GETATTR and use the returned size,
i.e., the file size.  When paging in file data, sendfile_swapin() will
use the pager to determine whether it needs to zero-fill, most often
because of a hole in a sparse file.  An attempt to page in beyond the
end of a file is treated this way, and occurs when the requested page is
past the end of the pager.  In other words, both the file size and pager
size were used interchangeably.

With ZFS, updates to the pager and file sizes are not synchronized by
the exclusive vnode lock, at least partially due to its use of
MNTK_SHARED_WRITES.  In particular, the pager size is updated after the
file size, so in the presence of a writer concurrently extending the
file, sendfile could incorrectly instantiate "holes" in the page cache
pages backing the file, which manifests as data corruption when reading
the file back from the page cache.  The on-disk copy is unaffected.

Fix this by consistently using the pager size when available.

Reported by:	dumbbell
Reviewed by:	chs, kib
Tested by:	dumbbell, pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28811
2021-02-25 10:04:44 -05:00
Jamie Gritton
c861373bdf jail: re-commit 811e27fa3c with fixes
Make sure PD_KILL isn't passed to do_jail_attach, where it might end
up trying to kill the caller's prison (even prison0).

Fix the child jail loop in prison_deref_kill, which was doing the
post-order part during the pre-order part.  That's not a system-
killer, but make jails not always die correctly.
2021-02-24 21:54:49 -08:00
Jamie Gritton
ddfffb41a2 jail: back out 811e27fa3c until it doesn't break Jenkins
Reported by:	arichardson
2021-02-24 21:10:47 -08:00
Mark Johnston
1d44514fcd rmlock: Add a required compiler membar to the rlock slow path
The tracker flags need to be loaded only after the tracker is removed
from its per-CPU queue.  Otherwise, readers may fail to synchronize with
pending writers attempting to propagate priority to active readers, and
readers and writers deadlock on each other.  This was observed in a
stable/12-based armv7 kernel where the compiler had reordered the load
of rmp_flags to before the stores updating the queue.

Reviewed by:	rlibby, scottl
Discussed with:	kib
Sponsored by:	Rubicon Communications, LLC ("Netgate")
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28821
2021-02-23 21:17:12 -05:00
Alex Richardson
fa32350347 close_range: add audit support
This fixes the closefrom test in sys/audit.

Includes cherry-picks of the following commits from openbsm:

4dfc628aaf
99ff6fe32a
da48a0399e

Reviewed By:	kevans
Differential Revision: https://reviews.freebsd.org/D28388
2021-02-23 17:47:07 +00:00
Jamie Gritton
0a2a96f35a jail: Don't allow jails under dying parents
If a jail is created with jail_set(...JAIL_DYING), and it has a parent
currently in a dying state, that will bring the parent jail back to
life.  Restrict that to require that the parent itself be explicitly
brought back first, and not implicitly created along with the new
child jail.

Differential Revision:	https://reviews.freebsd.org/D28515
2021-02-22 17:04:06 -08:00
Jamie Gritton
701d6b50ae jail: Fix a LOR introduced in 1158508a80 2021-02-22 15:51:10 -08:00
Jamie Gritton
811e27fa3c jail: Add PD_KILL to remove a prison in prison_deref().
Add the PD_KILL flag that instructs prison_deref() to take steps
to actively kill a prison and its descendents, namely marking it
PRISON_STATE_DYING, clearing its PR_PERSIST flag, and killing any
attached processes.

This replaces a similar loop in sys_jail_remove(), bringing the
operation under the same single hold on allprison_lock that it already
has. It is also used to clean up failed jail (re-)creations in
kern_jail_set(), which didn't generally take all the proper steps.

Differential Revision:  https://reviews.freebsd.org/D28473
2021-02-22 12:27:44 -08:00
Mark Johnston
608c44f96e m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages
The caller should not be passing M_ZERO in the first place, so PG_ZERO
will not be preserved by the page allocator and clearing it accomplishes
nothing.

Reviewed by:	gallatin, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28808
2021-02-22 10:04:46 -05:00
Jamie Gritton
1158508a80 jail: Add pr_state to struct prison
Rather that using references (pr_ref and pr_uref) to deduce the state
of a prison, keep track of its state explicitly.  A prison is either
"invalid" (pr_ref == 0), "alive" (pr_uref > 0) or "dying"
(pr_uref == 0).

State transitions are generally tied to the reference counts, but with
some flexibility: a new prison is "invalid" even though it now starts
with a reference, and jail_remove(2) sets the state to "dying" before
the user reference count drops to zero (which was prviously
accomplished via the PR_REMOVE flag).

pr_state is protected by both the prison mutex and allprison_lock, so
it has the same availablity guarantees as the reference counts do.

Differential Revision:	https://reviews.freebsd.org/D27876
2021-02-21 13:24:47 -08:00
Mateusz Guzik
ee9b37ae5c jail: fix build after the previous commit
Noted by: Michael Butler <imb protected-networks.net>
2021-02-21 21:05:25 +00:00
Jamie Gritton
f7496dcab0 jail: Change the locking around pr_ref and pr_uref
Require both the prison mutex and allprison_lock when pr_ref or
pr_uref go to/from zero.  Adding a non-first or removing a non-last
reference remain lock-free.  This means that a shared hold on
allprison_lock is sufficient for prison_isalive() to be useful, which
removes a number of cases of lock/check/unlock on the prison mutex.

Expand the locking in kern_jail_set() to keep allprison_lock held
exclusive until the new prison is valid, thus making invalid prisons
invisible to any thread holding allprison_lock (except of course the
one creating or destroying the prison).  This renders prison_isvalid()
nearly redundant, now used only in asserts.

Differential Revision:	https://reviews.freebsd.org/D28419
Differential Revision:	https://reviews.freebsd.org/D28458
2021-02-21 10:55:44 -08:00
Konstantin Belousov
2bfd8992c7 vnode: move write cluster support data to inodes.
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by:	mjg
Reviewed by:	asomers (previous version), fsu (ext2 parts), mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Konstantin Belousov
750ea20d3f Delete dead CLUSTERDEBUG config option.
Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Mateusz Guzik
81174cd8e2 vfs: employ vfs_ref_from_vp in statfs and fstatfs
Avoids locking and unlocking the vnode.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28695
2021-02-21 00:43:05 +00:00
Mateusz Guzik
a15f787adb vfs: add vfs_ref_from_vp
This generalizes what vop_stdgetwritemount used to be doing.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28695
2021-02-21 00:43:05 +00:00
Jamie Gritton
6e1d1bfcac jail: Improve locking when removing prisons
Change the flow of prison_deref() so it doesn't let go of allprison_lock
until it's completely done using it (except for a possible drop as part
of an upgrade on its first try).

Differential Revision:	https://reviews.freebsd.org/D28458
MFC after:	3 days
2021-02-20 14:38:58 -08:00
Jamie Gritton
d4380c0cdd jail: Change both root and working directories in jail_attach(2)
jail_attach(2) performs an internal chroot operation, leaving it up to
the calling process to assure the working directory is inside the jail.

Add a matching internal chdir operation to the jail's root.  Also
ignore kern.chroot_allow_open_directories, and always disallow the
operation if there are any directory descriptors open.

Reported by:    mjg
Approved by:    markj, kib
MFC after:      3 days
2021-02-19 14:13:35 -08:00
John Baldwin
9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and
TLS 1.3 (RFCs 7905 and 8446).  For both versions, Chacha20 uses the
server and client IVs as implicit nonces xored with the record
sequence number to generate the per-record nonce matching the
construction used with AES-GCM for TLS 1.3.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27839
2021-02-18 09:26:32 -08:00
Thomas Skibo
9976b42b69 ddb: fix show devmap output on 32-bit arm
The output has been broken since 1b6dd6d772. Casting to uintmax_t
before the call to printf is necessary to ensure that 32-bit addresses
are interpreted correctly.

PR:		243236
MFC after:	3 days
2021-02-18 11:53:14 -04:00
Konstantin Belousov
662283b108 vn_printf: handle VI_FOPENING
Noted by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
Fixes:	fa3bd463ce
2021-02-18 16:28:28 +02:00
Alex Richardson
fa2528ac64 Use atomic loads/stores when updating td->td_state
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.

Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.

Reported by:	GENERIC-KCSAN
Test Plan:	KCSAN no longer complains, kernel still runs fine.
Reviewed By:	markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
2021-02-18 14:02:48 +00:00
Konstantin Belousov
fa3bd463ce lockf: ensure atomicity of lockf for open(O_CREAT|O_EXCL|O_EXLOCK)
or EX_SHLOCK.  Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.

Requested by:	mckusick
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28697
2021-02-18 01:22:05 +02:00
Warner Losh
00065c7630 Giant: move back Giant removal until 14
Update the Giant Lock warning message to FreeBSD 14. It's growing increasling
clear that this won't be done before 13.0.

MFC: Insta (re@'s request)
2021-02-17 14:33:09 -07:00
Jamie Gritton
cc7b730653 jail: Handle a possible race between jail_remove(2) and fork(2)
jail_remove(2) includes a loop that sends SIGKILL to all processes
in a jail, but skips processes in PRS_NEW state.  Thus it is possible
the a process in mid-fork(2) during jail removal can survive the jail
being removed.

Add a prison flag PR_REMOVE, which is checked before the new process
returns.  If the jail is being removed, the process will then exit.
Also check this flag in jail_attach(2) which has a similar issue.

Reported by:    trasz
Approved by:    kib
MFC after:      3 days
2021-02-16 11:19:13 -08:00
Konstantin Belousov
c61fae1475 pgcache read: protect against reads past end of the vm object size
If uio_offset is past end of the object size, calculated resid is negative.
Delegate handling this case to the locked read, as any other non-trivial
situation.

PR:	253158
Reported by:	Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Tested by:	cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-16 07:09:37 +02:00
Alex Richardson
0482d7c9e9 Fix fget_only_user() to return ENOTCAPABLE on a failed capsicum check
After eaad8d1303 four additional
capsicum-test tests started failing. It turns out this is because
fget_only_user() was returning EBADF on a failed capsicum check instead
of forwarding the return value of cap_check_inline() like
fget_unlocked_seq().

capsicum-test failures before this:
```
[  FAILED  ] 7 tests, listed below:
[  FAILED  ] Capability.OperationsForked
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
[  FAILED  ] OpenatTest.WithFlag
[  FAILED  ] ForkedOpenatTest_WithFlagInCapabilityMode._
[  FAILED  ] Select.LotsOFileDescriptorsForked
```
After:
```
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
```

Reviewed By:	mjg
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D28691
2021-02-15 22:55:12 +00:00
Jason A. Harmening
41032835dc Fix divide-by-zero panic when ASLR is enabled and superpages disabled
When locating the anonymous memory region for a vm_map with ASLR
enabled, we try to keep the slid base address aligned on a superpage
boundary to minimize pagetable fragmentation and maximize the potential
usage of superpage mappings.  We can't (portably) do this if superpages
have been disabled by loader tunable and pagesizes[1] is 0, and it
would be less beneficial in that case anyway.

PR:		253511
Reported by:	johannes@jo-t.de
MFC after:	1 week
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28678
2021-02-15 10:38:04 -08:00
Mateusz Guzik
eac22dd480 lockmgr: shrink struct lock by 8 bytes on LP64
Currently the struct has a 4 byte padding stemming from 3 ints.

1. prio comfortably fits in short, unfortunately there is no dedicated
   type for it and plumbing it throughout the codebase is not worth it
   right now, instead an assert is added which covers also flags for
   safety
2. lk_exslpfail can in principle exceed u_short, but the count is
   already not considered reliable and it only ever gets modified
   straight to 0. In other words it can be incrementing with an upper
   bound of USHRT_MAX

With these in place struct lock shrinks from 48 to 40 bytes.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D28680
2021-02-15 13:57:25 +00:00
Konstantin Belousov
25c6318c79 procstat: distinguish vm map guards in procstat vm output.
Requested and reviewed by:	rwatson (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28658
2021-02-14 03:24:58 +02:00
Konstantin Belousov
adf28ab456 fifo: minor comment and assert improvements.
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov
b59a8e63d6 Stop ignoring ERELOOKUP from VOP_INACTIVE()
When possible, relock the vnode and retry inactivation.  Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov
3b2aa36024 Use VOP_VPUT_PAIR() for eligible VFS syscalls.
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
49c117c193 Add VOP_VPUT_PAIR() with trivial default implementation.
The VOP is intended to be used in situations where VFS has two
referenced locked vnodes, typically a directory vnode dvp and a vnode
vp that is linked from the directory, and at least dvp is vput(9)ed.
The child vnode can be also vput-ed, but optionally left referenced and
locked.

There, at least UFS may need to do some actions with dvp which cannot be
done while vp is also locked, so its lock might be dropped temporary.
For instance, in some cases UFS needs to sync dvp to avoid filesystem
state that is currently not handled by either kernel nor fsck. Having
such VOP provides the neccessary context for filesystem which can do
correct locking and handle potential reclamation of vp after relock.

Trivial implementation does vput(dvp) and optionally vput(vp).

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
ee965dfa64 vn_open(): If the vnode is reclaimed during open(2), do not return error.
Most future operations on the returned file descriptor will fail
anyway, and application should be ready to handle that failures.  Not
forcing it to understand the transient failure mode on open, which is
implementation-specific, should make us less special without loss of
reporting of errors.

Suggested by: chs
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
bf0db19339 buf SU hooks: track buf_start() calls with B_IOSTARTED flag
and only call buf_complete() if previously started.  Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by:	pho
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundations
2021-02-12 03:02:19 +02:00
Mateusz Guzik
39e0c3f686 cache: assorted comment fixups 2021-02-09 17:09:44 +01:00
Kyle Evans
504ebd612e kern: sonewconn: set so_options before pru_attach()
Protocol attachment has historically been able to observe and modify
so->so_options as needed, and it still can for newly created sockets.
779f106aa1 moved this to after pru_attach() when we re-acquire the
lock on the listening socket.

Restore the historical behavior so that pru_attach implementations can
consistently use it. Note that some pru_attach() do currently rely on
this, though that may change in the future. D28265 contains a change to
remove the use in TCP and IB/SDP bits, as resetting the requested linger
time on incoming connections seems questionable at best.

This does move the assignment out from under the head's listen lock, but
glebius notes that head won't be going away and applications cannot
assume any specific ordering with a race between a connection coming in
and the application changing socket options anyways.

Discussed-with:	glebius
MFC-after:	1 week
2021-02-08 21:44:43 -06:00
Alexander V. Chernikov
924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov
4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Mark Johnston
b5aa9ad43a ktls: Make configuration sysctls available as tunables
Reviewed by:	gallatin, jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28499
2021-02-08 09:19:02 -05:00
Mark Johnston
1755b2b989 ktls: Use COUNTER_U64_DEFINE_EARLY
This makes it a bit more straightforward to add new counters when
debugging.  No functional change intended.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28498
2021-02-08 09:18:51 -05:00
Mateusz Guzik
2f8a844635 cache: remove the largely obsolete general description
Examples of inconsistencies with the current state:
- references LRU of all entries, removed years ago
- references a non-existent lock (neglist)
- claims negative entries have a NULL target

It will be replaced with a more accurate and more informative
description.

In the meantime take it out so it stops misleading.
2021-02-06 00:28:40 +01:00
Mateusz Guzik
0e1594e60e cache: fix vfs:namecache:lookup:miss probe call sites 2021-02-06 00:28:40 +01:00
Mateusz Guzik
2e96132a7d cache: drop spurious arg from panic in cache_validate
vp is already reported when noting mismatch
2021-02-06 00:28:39 +01:00
Mateusz Guzik
b54ed778fe cache: comment on FNV 2021-02-06 00:13:57 +01:00
Ed Maste
edc374e7c4 Correct description for kern.proc.proc_td
kern.proc.proc_td returns the process table with an entry for each
thread.  Previously the description included "no threads", presumably
a cut-and-pasteo in 2648efa621.

Description suggested by PauAmma.

PR:		253146
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-02-02 17:00:05 -05:00
Mateusz Guzik
45456abc4c cache: fix trailing slash support in face of permission problems
Reported by:	Johan Hendriks <joh.hendriks gmail.com>
Tested by:	kevans
2021-02-02 18:13:51 +00:00