Commit Graph

19424 Commits

Author SHA1 Message Date
Doug Rabson
71e9be1bd5 Don't allow stacking of file mounts
Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:46:27 +00:00
Doug Rabson
a1d74b2dab Allow realpath to work for file mounts
For file mounts, the directory vnode is not available from namei and this
prevents the use of vn_fullpath_hardlink. In this case, we can use the
vnode which was covered by the file mount with vn_fullpath.

This also disallows file mounts over files with link counts greater than
one to ensure a deterministic path to the mount point.

Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:46:27 +00:00
Doug Rabson
521fbb722c Add support for mounting single files in nullfs
The main use-case for this is to support mounting config files and
secrets into OCI containers. My current workaround copies the files into
the container which is messy and risks secrets leaking into container
images if the cleanup fails.

This adds a VFCF flag to indicate whether the filesystem supports file
mounts and allows fspath to be either a directory or a file if the flag
is set.

Test Plan:
$ sudo mkdir -p /mnt
$ sudo touch /mnt/foo
$ sudo mount -t nullfs /COPYRIGHT /mnt/foo

Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:46:13 +00:00
Doug Rabson
78d35459a2 Add vn_path_to_global_path_hardlink
This is similar to vn_path_to_global_path but allows for regular files
which may not be present in the cache.

Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:44:59 +00:00
Mateusz Guzik
8f7859e800 vfs: retire the now unused SAVESTART flag
Bump __FreeBSD_version to 1400075

Tested by:      pho
2022-12-19 08:11:08 +00:00
Mateusz Guzik
56da4aa554 vfs: stop using SAVESTART for rename
ni_startdir has never reached rename routines anyway

Reviewed by:	mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D34468
2022-12-19 08:09:37 +00:00
Mateusz Guzik
8f874e92eb vfs: make relookup take an additional argument
instead of looking at SAVESTART

This is a step towards removing the flag.

Reviewed by:	mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D34468
2022-12-19 08:09:00 +00:00
Mateusz Guzik
269c564b90 vfs: retire NDFREE
There are no consumers anymore. Interested parties can NDFREE_PNBUF
and vput or vrele relevant vnodes.

Tested by:	pho
2022-12-19 08:07:54 +00:00
Mateusz Guzik
85dac03e30 vfs: stop using NDFREE
It provides nothing but a branchfest and next to no consumers want it
anyway.

Tested by:	pho
2022-12-19 08:07:23 +00:00
Rick Macklem
bba7a2e896 kern_jail.c: Allow mountd/nfsd to optionally run in a jail
This patch adds "allow.nfsd" to the jail code based on a
new kernel build option VNET_NFSD.  This will not work
until future patches fix nmount(2) to allow mountd to
run in a vnet prison and the NFS server code is patched
so that global variables are in a vnet.

The jail(8) man page will be patched in a future commit.

Reviewed by:	jamie
MFC after:	4 months
Differential Revision:	https://reviews.freebsd.org/D37637
2022-12-17 13:43:49 -08:00
Rick Macklem
195f1b124d vfs_mount.c: fix vfs_domount() for PRIV_VFS_MOUNT_EXPORTED
It appears that, prior to r158857 vfs_domount() checked
suser() when MNT_EXPORTED was specified.

r158857 appears to have broken this, since MNT_EXPORTED
was no longer set when mountd.c was converted to use nmount(2).
r164033 replaced the suser() check with
priv_check(td, PRIV_VFS_MOUNT_EXPORTED), which does the
same thing (ie. checks for effective uid == 0 assuming suses_enabled
is set).

This patch restores this check by setting MNT_EXPORTED when the
"export" mount option is specified to nmount().

I think this is reasonable since only mountd(8) should be setting
exports and I doubt any non-root mounted file system would
be setting its own exports.

Reviewed by:	kib, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D37718
2022-12-16 13:01:23 -08:00
John Baldwin
69542f2682 ktls: Close a race with setting so_error when dropping a connection.
pr_abort calls tcp_usr_abort which calls tcp_drop with ECONNABORTED.
After pr_abort returns, the so_error is then set to a more specific
error.  However, a reader can observe and return the ECONNABORTED
error before so_error is set to the desired error value.  This is
resulting in spurious test failures of recently added tests for
invalid conditions such as invalid headers.

To fix, refactor the code to abort a connection to call tcp_drop
directly with the desired error value.  ktls_reset_send_tag already
calls tcp_drop directly when it aborts a connection due to an error.

Reviewed by:	gallatin
Reported by:	CI (jenkins), gallatin, olivier
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37692
2022-12-15 12:06:26 -08:00
Andrew Gallatin
ac4e3a27ab Unbreak the build when MAC is not defined
7a2c93b86e removed the use of "error" when MAC was not
defined, resulting in an unused variable error.

Sponsored by: Netflix
Reviewed by: jhb
2022-12-14 17:39:25 -05:00
Gleb Smirnoff
7a2c93b86e sockets: provide sousrsend() that does socket specific error handling
Sockets have special handling for EPIPE on a write, that was spread out
into several places.  Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31).

- Provide sousrsend() that expects a valid uio, and leave sosend() for
  kernel consumers only.  Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
  error handling for kern_sendit(), soo_write() and soaio_process_job().

PR:			265087
Reported by:            rz-rpi03 at h-ka.de
Reviewed by:            markj
Differential revision:	https://reviews.freebsd.org/D35863
2022-12-14 10:02:44 -08:00
Jason A. Harmening
42442d7a6e Generalize the VV_CROSSLOCK logic in vfs_lookup()
When VV_CROSSLOCK is present, the lock for the vnode at the current
stage of lookup must be held across the VFS_ROOT() call for the
filesystem mounted at the vnode.  Since VV_CROSSLOCK implies that
the root vnode reuses the already-held lock, the possibility for
recursion should be made clear in the flags passed to VFS_ROOT().

For cases in which the lock is held exclusive, this means passing
LK_CANRECURSE.  For cases in which the lock is held shared, it
means clearing LK_NODDLKTREAT to allow VFS_ROOT() to potentially
recurse on the shared lock even in the presence of an exclusive
waiter.

That the existing code works for unionfs is due to a coincidence
of the current unionfs implementation.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D37458
2022-12-10 22:02:38 -06:00
Mateusz Guzik
ebdf27b6f3 uipc: remove accept_mtx
It is unused since 779f106aa1 ("Listening sockets improvements.")

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-12-11 02:47:07 +00:00
Konstantin Belousov
0919f29d91 shmfd: account for the actually allocated pages
Return the value as stat(2) st_blocks.

Suggested and reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:17:12 +02:00
Konstantin Belousov
37aea2649f tmpfs: for used pages, account really allocated pages, instead of file sizes
This makes tmpfs size accounting correct for the sparce files. Also
correct report st_blocks/va_bytes. Previously the reported value did not
accounted for the swapped out pages.

PR:	223015
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:17:12 +02:00
Konstantin Belousov
7ec4b29b08 uiomove_object: hide diagnostic under bootverbose
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:15:37 +02:00
Doug Rabson
5eeb4f737f imgact_binmisc: Optionally pre-open the interpreter vnode
This allows the use of chroot and/or jail environments which depend on
interpreters registed with imgact_binmisc to use emulator binaries from
the host to emulate programs inside the chroot.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D37432
2022-12-08 14:32:03 +00:00
Warner Losh
7014e78fb7 boot: Remove stray free()
Early versions of this code had a free, but this one doesn't need
it. Remove the forgotten free(vv); from earlier versions.

Fixes:			ed56dcfc6b
Noticed by:		Michael Butler
Sponsored by:		Netflix
2022-12-07 11:30:04 -07:00
Warner Losh
ed56dcfc6b boot: pass in args as const
Copy the arg that sets a variable to maximize the reuse of this
routine. There are places we call it from that are const char * and it
might not be safe to cast that away.

Sponsored by:		Netflix
2022-12-07 11:00:54 -07:00
Warner Losh
3cf97e91fa Revert "newbus: Change attach failure behavior"
This reverts commit 68c3f03021. There are
some weird crashes when KVMs switch caused by this, so revert this
commit until they are sorted out.

Reported by:		cy@
Sponsored by:		Netflix
2022-12-05 17:00:26 -07:00
Warner Losh
68c3f03021 newbus: Change attach failure behavior
In the rare case that we succeed in probing, but fail to attach, flip
the default to be to disable the
device. hw.bus.disable_failed_devices=false is no required to restore
the old behavior. The old behavior dates form a time when dynamic
control of devices wasn't yet present (devctl didn't exist). Now that
one can retry probe/attach the device with devctl, the default doesn't
make sense: The more desirable behaivor is to have stable device numbers
when one has several instances of the same device in a system (common
for NICs or HBAs).

Reviewed by:		jhb (verbal)
Sponsored by:		Netflix
2022-12-04 16:29:03 -07:00
Warner Losh
aa52c6bdd7 newbus: Create a knob to disable devices that fail to attach.
Normally, when a device fails to attach, we tear down the newbus state
for that device so that future driver loads can try again (maybe with a
different driver, or maybe with a re-loaded and fixed kld).

Sometimes, however, it is desirable to have the device fail
permanantly. We do this by calling device_disable() on a failed
attached, as well as keeping the device in DS_ATTACHING forever. This
prevents retries on that device. This is enabled via
hw.bus.disable_failed_devices=1 in either a hint via the loader, or at
runtime with a sysctl setting. Setting from 1 -> 0 at runtime will not
affect previously disabled devices, however: they remain disabled.
They can be re-enabled manually with devctl enable, however.

Sponsored by:		Netflix

Reviewed by:	gallatin, hselasky, jhb
Differential Revision:	https://reviews.freebsd.org/D37517
2022-12-04 16:20:24 -07:00
Warner Losh
7652743540 devd: Warn for deprecated 'kern' system type
One year ago, I deprecated 'kern' in favor of 'kernel' for the system
name for some power events. I'm about to remove it from the kernel, but
realized there's been no warning generated for users. Preserve POLA by
converting on the fly here and issuing a warning for 14.x, and an fatal
error after we branch 15. Make compiling it an error on 16 to remove
the gross hack after we branch.

Sponsored by:		Netflix
Reviewed by:		bapt
Differential Revision:	https://reviews.freebsd.org/D37584
2022-12-02 10:48:02 -07:00
Warner Losh
8d147537bf newbus: Remove deprecated "kern" system name for resume events.
The new "kernel" system name is the one that's documented and has
been generated for a year now. Remove the old one now that 14.0
is getting close.

Sponsored by:		Netflix
Reviewed by:		bapt
Differential Revision:	https://reviews.freebsd.org/D37582
2022-12-02 10:48:02 -07:00
Warner Losh
e59fa9b2e7 newbus: Comment style nit
Sponsored by:		Netflix
2022-11-29 13:11:24 -07:00
Warner Losh
1ea42a2880 md5: Use c89 function definitions
Use the c89 function definitions rather than the old K&R definitions.

Sponsored by:		Netflix
2022-11-27 13:22:31 -07:00
Eric van Gyzen
a134a12b14 Mark the debug.vnlru_nowhere sysctl as CTLFLAG_STATS
The kernel doesn't read it.  It's only writable so it can be cleared.

Sponsored by:	Dell EMC Isilon
2022-11-17 10:44:58 -06:00
Rick Macklem
4ee16246f9 vfs_vnops.c: Fix blksize for ZFS
Since ZFS reports _PC_MIN_HOLE_SIZE as 512 (although it
appears that an unwritten region must be at least f_iosize
to remain unallocated), vn_generic_copy_file_range()
uses 4096 for the copy blksize for ZFS, reulting in slow copies.

For most other file systems, _PC_MIN_HOLE_SIZE and f_iosize
are the same value, so this patch modifies the code to
use f_iosize for most cases.  It also documents in comments
why the blksize is being set a certain way, so that the code
does not appear to be doing "magic math".

Reported by:	allanjude
Reviewed by:	allanjude, asomers
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D37076
2022-11-16 17:37:22 -08:00
John Baldwin
9a673b7158 ktls: Add software support for AES-CBC decryption for TLS 1.1+.
This is mainly intended to provide a fallback for TOE TLS which may
need to use software decryption for an initial record at the start
of a connection.

Reviewed by:	markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37370
2022-11-15 12:02:03 -08:00
Mateusz Guzik
c3f1a13902 Retire broken GPROF support from the kernel
The option is not even recognized and with that patched it does not
compile. Even if it did work, it would be prohibitively expensive to
use.

Interested parties can use pmcstat or dtrace instead.
2022-11-15 14:17:10 +00:00
John Baldwin
5920f99d21 ktls: Inline ktls_cleanup() into ktls_destroy().
Reviewed by:	gallatin, markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37353
2022-11-11 16:01:02 -08:00
John Baldwin
d01db2b837 ktls: Don't leak ktls session objects for certain errors.
ktls_cleanup() does not free ktls session objects, it merely
cleans (and frees) members of the object.

Change callers to use ktls_free() instead.

Reviewed by:	gallatin, markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37352
2022-11-11 16:00:37 -08:00
Mateusz Guzik
83286682f8 vfs: whack mips remnant
This reverts commit 8ffa01a061.
2022-11-09 00:31:50 +00:00
Gleb Smirnoff
8840ae2288 tcp: don't store VNET in every tcpcb, take it from the inpcbinfo
Reviewed by:		rscheff
Differential revision:	https://reviews.freebsd.org/D37125
2022-11-08 10:24:40 -08:00
Gleb Smirnoff
9eb0e8326d tcp: provide macros to access inpcb and socket from a tcpcb
There should be no functional changes with this commit.

Reviewed by:		rscheff
Differential revision:	https://reviews.freebsd.org/D37123
2022-11-08 10:24:40 -08:00
Mark Johnston
2c10be9e06 arm64: Handle translation faults for thread structures
The break-before-make requirement poses a problem when promoting or
demoting mappings containing thread structures: a CPU may raise a
translation fault while accessing curthread, and data_abort() accesses
the thread again before pmap_fault() can translate the address and
return.

Normally this isn't a problem because we have a hack to ensure that
slabs used by the thread zone are always accessed via the direct map,
where promotions and demotions are rare.  However, this hack doesn't
work properly with UMA_MD_SMALL_ALLOC disabled, as is the case with
KASAN configured (since our KASAN implementation does not shadow the
direct map and so tries to force the use of the kernel map wherever
possible).

Fix the problem by modifying data_abort() to handle translation faults
in the kernel map without dereferencing "td", i.e., curthread, and
without enabling interrupts.  pmap_klookup() has special handling for
translation faults which makes it safe to call in this context.  Then,
revert the aforementioned hack.

Reviewed by:	kevans, alc, kib, andrew
MFC after:	1 month
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37231
2022-11-02 13:46:25 -04:00
Andrew Gallatin
8b19898a78 Fix a panic on boot introduced by 555a861d68
First, an sbuf_new() in device_get_path() shadows the sb
passed in by dev_wired_cache_add(), leaving its sb in an
unfinished state, leading to a failed KASSERT().  Fixing this
is as simple as removing the sbuf_new() from device_get_path()

Second, we cannot simply take a pointer to the sbuf memory and
store it in the device location cache, because that sbuf
is freed immediately after we add data to the cache, leading
to a use-after-free and eventually a double-free.  Fixing this
requires allocating memory for the path.

After a discussion with jhb, we decided that one malloc was
better than two in dev_wired_cache_add, which is why it changed
so much.

Reviewed by: jhb
Sponsored by: Netflix
MFC after: 14 days
2022-11-01 13:44:39 -04:00
Mark Johnston
1f6b6cf177 atomic: Intercept atomic_(load|store)_bool for kernel sanitizers
Fixes:		2bed73739a ("atomic: Add plain atomic_load/store_bool()")
2022-10-29 11:10:58 -04:00
Konstantin Belousov
6b69465efb vfs_domount(): ensure that v_mountedhere and VIRF_MOUNTPOINT are set under the vnode lock
Fixes:	f7833196bd
Reported and tested by:	pho
Reviewed by:	jah, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37198
2022-10-29 14:29:55 +03:00
John Baldwin
744bfb2131 Import the WireGuard driver from zx2c4.com.
This commit brings back the driver from FreeBSD commit
f187d6dfbf plus subsequent fixes from
upstream.

Relative to upstream this commit includes a few other small fixes such
as additional INET and INET6 #ifdef's, #include cleanups, and updates
for recent API changes in main.

Reviewed by:	pauamma, gbe, kevans, emaste
Obtained from:	git@git.zx2c4.com:wireguard-freebsd @ 3cc22b2
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D36909
2022-10-28 13:36:12 -07:00
Jason A. Harmening
f7833196bd vfs_lookup(): Minor performance optimizations
Refactor the symlink and mountpoint traversal logic to avoid
repeatedly checking the vnode type; a symlink cannot be a mountpoint
and vice versa.  Avoid repeatedly checking cn_flags for NOCROSSMOUNT
and simplify the check which determines whether the vnode is a
mountpoint.

Suggested by:	mjg
Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D35054
2022-10-26 19:33:33 -05:00
Jason A. Harmening
4390622c8d vfs_busy(): fix wording in comment
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D35054
2022-10-26 19:33:30 -05:00
Jason A. Harmening
706f15c5fa Remove witness directives from crossmp locking VOPs
These are of limited use since the crossmp vnode locking ops have not
actually used a lock since commit
a2d3554542.  We in fact require that
these operations are always issued with LK_SHARED.  Additionally,
these directives can produce a false positive in certain VV_CROSSLOCK
cases which require upgrading of the covered vnode lock from shared
to exclusive.

While here, replace the runtime check of LK_SHARED with a KASSERT and
expand the check to include LK_NOWAIT, which all callers pass.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D35054
2022-10-26 19:33:18 -05:00
Jason A. Harmening
080ef8a418 Add VV_CROSSLOCK vnode flag to avoid cross-mount lookup LOR
When a lookup operation crosses into a new mountpoint, the mountpoint
must first be busied before the root vnode can be locked. When a
filesystem is unmounted, the vnode covered by the mountpoint must
first be locked, and then the busy count for the mountpoint drained.
Ordinarily, these two operations work fine if executed concurrently,
but with a stacked filesystem the root vnode may in fact use the
same lock as the covered vnode. By design, this will always be
the case for unionfs (with either the upper or lower root vnode
depending on mount options), and can also be the case for nullfs
if the target and mount point are the same (which admittedly is
very unlikely in practice).

In this case, we have LOR. The lookup path holds the mountpoint
busy while waiting on what is effectively the covered vnode lock,
while a concurrent unmount holds the covered vnode lock and waits
for the mountpoint's busy count to drain.

Attempt to resolve this LOR by allowing the stacked filesystem
to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary.
Upon observing this flag, the vfs_lookup() will leave the covered
vnode lock held while crossing into the mountpoint. Employ this flag
for unionfs with the caveat that it can't be used for '-o below' mounts
until other unionfs locking issues are resolved.

Reported by:	pho
Tested by:	pho
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D35054
2022-10-26 19:33:03 -05:00
Mateusz Guzik
d346e3ac33 vfs: use cache_assert_no_entries instead of open-coding it 2022-10-26 15:54:19 +00:00
Warner Losh
deb1e3b719 physmem: Add physmem_excluded to query if a region is excluded
In order to safely reuse excluded memory when it's reserved for special
purpose, we need to test whether or not the memory has been reserved
early in boot. physmem_excluded will return true when the entire range
is excluded, false otherwise.

Sponsored by:		Netflix
2022-10-25 09:32:49 -06:00
Mateusz Guzik
d653aaec7a cache: add cache_assert_no_entries 2022-10-24 15:37:43 +00:00