Commit Graph

19455 Commits

Author SHA1 Message Date
Gleb Smirnoff
fd53298799 unix: add myself to the copyright notice
for the new implementation of PF_UNIX/SOCK_DGRAM
2023-02-01 09:39:28 -08:00
Justin Hibbits
9507d03bfe IfAPI: Use the ifnet APIs in kern_poll()
The only API used is if_name().

Sponsored by:	Juniper Networks, Inc.
2023-01-31 15:02:16 -05:00
Sebastian Huber
c7c53e3ca6 Clarify hardpps() parameter name and comment
Since 32c203577a by phk in 1999 (Make even more of the PPSAPI
implementations generic), the "nsec" parameter of hardpps() is a time
difference and no longer a time point.  Change the name to "delta_nsec"
and adjust the comment.

Remove comment about a clock tick adjustment which is no longer in the code.

Pull Request: https://github.com/freebsd/freebsd-src/pull/640
Reviewed by: imp
2023-01-30 11:07:40 -07:00
Jose Luis Duran
df949e762c kern_environment: Partially apply style(9)
Sort include files, remove duplicates and remove trailing whitespce.

Pull Request:	https://github.com/freebsd/freebsd-src/pull/589
Reviewed by:	imp
2023-01-30 10:47:56 -07:00
Dmitry Chagin
2058f075b4 cpuset: Handle CPU_WHICH_TIDPID wherever cpuset_which() is called.
cpuset_which() resolves the argument pair which and id and returns references
to an appropriate resources. To avoid leaking resources or accessing unresolved
references to a resources handle new which CPU_WHICH_TIDPID wherever
cpuset_which() is called.
To avoid code duplication cpuset_which2() has been added.

Reported by:		syzbot+331e8402e0f7347f0f2a@syzkaller.appspotmail.com
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D38272
MFC after:		2 weeks
2023-01-30 19:28:54 +03:00
Dmitry Chagin
e4754c8036 subr_smp: Trim trailing whitespaces.
MFC after:		1 week
2023-01-29 16:18:17 +03:00
Dmitry Chagin
c21b080f3d cpuset: Fix sched_[g|s]etaffinity() for better compatibility with Linux.
Under Linux to sched_[g|s]etaffinity() functions the value returned from a call
to gettid(2) (thread id) can be passed in the argument pid. Specifying pid as 0
will set the attribute for the calling thread, and passing the value returned
from a call to getpid(2) (process id) will set the attribute for the main thread
of the thread group.

Native cpuset(2) family of system calls has "which" argument to determine how
the value of id argument is interpreted, i.e., CPU_WHICH_TID is used to pass
a thread id and CPU_WHICH_PID - to pass a process id.

For now native sched_[g|s]etaffinity() implementation is wrong as uses "which"
CPU_WHICH_PID to pass both (process and thread id) to the kernel. To fix this
adding a new "which" CPU_WHICH_TIDPID intended to handle both id's.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D38209
MFC after:		1 week
2023-01-29 16:17:33 +03:00
Dmitry Chagin
01f74ccd5a libthr: Fix pthread_attr_[g|s]etaffinity_np to match it's manual and the kernel.
Since f35093f8 semantics of a thread affinity functions is changed to be a
compatible with Linux:

In case of getaffinity(), the minimum cpuset_t size that the kernel permits is
the maximum CPU id, present in the system, / NBBY bytes, the maximum size is not
limited.
In case of setaffinity(), the kernel does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where the upper
bound is the maximum CPU id, present in the system, no larger than the size of
the kernel cpuset_t.

To match pthread_attr_[g|s]etaffinity_np checks of the user-provided cpusets to
the kernel behavior export the minimum cpuset_t size allowed by running kernel
via new sysctl kern.sched.cpusetsizemin and use it in checks.

Reviewed by:
Differential Revision:	https://reviews.freebsd.org/D38112
MFC after:		1 week
2023-01-29 15:35:18 +03:00
Allan Jude
5ff13fbc19 MFV: zstd 1.5.2
Merge commit 'b3392d84da5bf2162baf937c77e0557f3fd8a52b' into zstd_1.5.2

full changelog: https://github.com/facebook/zstd/compare/v1.4.8...v1.5.2

Updated sys/kern/subr_compressor.c to new API

MFC after:	3 days
Relnotes:	yes
Sponsored by:	Klara, Inc.
2023-01-27 17:22:31 +00:00
Gleb Smirnoff
f394d9c0a4 sysctl: use correct types and names in sysctl_*sec_to_sbintime
The functions are intended to report kernel variables that are
stored as sbintime_t (pointed to by arg1) as human readable
nanoseconds or milliseconds (reported via sysctl_handle_64).
The variable types and names were reversed.  I guess there is
no functional change here, as all types flipped around were
signed 64.  Note that these function aren't used yet anywhere
in the kernel.

Reviewed by:		mav
Differential revision:	https://reviews.freebsd.org/D38217
2023-01-27 07:09:22 -08:00
Mitchell Horne
627ca221c3 kern_reboot: unconditionally call shutdown_reset()
Currently shutdown_reset() is registered as the final entry of the
shutdown_final event handler. However, if a panic occurs early in boot
before the event is registered (SI_SUB_INTRINSIC), we may end up
spinning in the subsequent infinite for loop and failing to reset
altogether. Instead we can simply call this function unconditionally.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D37981
2023-01-23 15:10:24 -04:00
Jiajie Chen
dec7db4960 Add kf_file_nlink field to kf_file and populate it
This will allow user-space programs (e.g. lsof) to locate deleted files
whose nlink equals zero. Prior to this commit, programs has to use
stat(kf_path) to get nlink, but that will fail if the file is deleted.

[mjg: s/fail/file in the commit message]

Reviewed by:	mjg
Differential Revision:  https://reviews.freebsd.org/D38169
2023-01-23 17:09:52 +00:00
Konstantin Belousov
456f05756b Handle int rank issues in in vn_getsize_locked() and vn_seek()
In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into
off_t size, we must avoid overflow.

Then, the check for fsize < 0, introduced in the commit
f45feecfb2 'vfs: add vn_getsize', is nop [1].

Reported and reviewed by:	jhb
Coverity CID:	1502346
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38133
2023-01-20 23:56:29 +02:00
Konstantin Belousov
5657f49ef3 kern_umtx.c do_wait(): correct confusing indent
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-01-20 23:33:11 +02:00
Brooks Davis
fa1d803c0f epoch: replace hand coded assertion
The assertion is equivalent to kstack_contains() so use that rather
than spelling it out.

Suggested by:	jhb
Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D38107
2023-01-20 18:04:40 +00:00
John Baldwin
846e4a206f ktls_disable_ifnet_help: Set curvnet around sorele().
This is required in kernels with VIMAGE such as GENERIC.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2023-01-18 15:39:04 -08:00
Konstantin Belousov
0f80d5ebc8 Require INVARIANTS and WITNESS if DEBUG_VFS_LOCKS is set
Reported by:	pho
Reviewed by:	markj, mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38070
2023-01-16 05:55:47 +02:00
Zhenlei Huang
8bce8d28ab jail: Avoid multipurpose return value of function prison_ip_restrict()
Currently function prison_ip_restrict() returns true if the replacement
buffer was used, or no buffer provided and allocation fails and should
redo. The logic is confusing and cause possibly infinite loop from
eb8dcdeac2 .

Reviewed by:	jamie, glebius
Approved by:	kp (mentor)
Differential Revision:	https://reviews.freebsd.org/D37918
2023-01-13 18:45:14 +08:00
Zhenlei Huang
89ddfbbac8 jail: Fix regression panic from eb8dcdeac2
And possibly infinite loop calling prison_ip_restrict() in
kern_jail_set() [2].

[1] It is possible that prisons do not have any IPv4 or IPv6 addresses.
[2] If prison_ip_restrict() is not provided with prison_ip, when it
    allocates prison_ip successfully, then it should return false to
    indicate not redo prison_ip_restrict() later.

Reviewed by:	glebius
Approved by:	kp (mentor)
Fixes:	eb8dcdeac2 jail: network epoch protection for IP address lists
Differential Revision:	https://reviews.freebsd.org/D37906
2023-01-13 18:45:14 +08:00
Zhenlei Huang
ddbf879d79 jail: Correctly access IPv[46] addresses of prison_ip
* Fix wrong IPv[46] addresses inherited from parent jail
* Properly restrict the child jail's IPv[46] addresses

Reviewed by:	melifaro, glebius
Approved by:	kp (mentor)
Fixes:	eb8dcdeac2 jail: network epoch protection for IP address lists
Differential Revision:	https://reviews.freebsd.org/D37871
Differential Revision:	https://reviews.freebsd.org/D37872
2023-01-13 18:45:14 +08:00
Konstantin Belousov
37b9fb1696 Add descrip_check_write_mp() helper
... which verifies that given file table does not have file descriptors
referencing vnodes on the specified mount point.  It is up to the caller
to ensure that the check is not racy.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37896
2022-12-29 22:55:39 +02:00
Mateusz Guzik
f45feecfb2 vfs: add vn_getsize
getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.

As a step towards that goal make size-only consumers use a dedicated
routine.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D37885
2022-12-28 22:43:49 +00:00
John Baldwin
07be751727 ktls: Post receive errors on partially closed sockets.
If an error such as an invalid record or one whose decryption fails is
detected on a socket that has received a RST then ktls_drop() could
ignore the error since INP_DROPPED could already be set.  In this case
soreceive_generic hangs since it does not return from a KTLS socket
with pending encrypted data unless there is an error (so_error) (this
behavior is to ensure that soreceive_generic doesn't return a
premature EOF when there is pending data still being decrypted).

Note that this was a bug prior to
69542f2682 as tcp_usr_abort would also
have ignored the error in this case.

Reviewed by:	gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37775
2022-12-27 16:00:17 -08:00
Mateusz Guzik
829f0bcb5f vfs: add the concept of vnode state transitions
To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>

As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).

Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).

The new field lands in a 1-byte hole, thus it does not grow the struct.

Bump __FreeBSD_version to 1400077

Reviewed by:	kib (previous version)
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D37759
2022-12-26 17:35:12 +00:00
Mateusz Guzik
94267fc907 vfs: use designated initializers for the typename array
While here prefix with v for better consistency with the vnode stuff.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D37759
2022-12-26 17:34:41 +00:00
Konstantin Belousov
974be51b3f Fixes for ptrace_syscallreq()
Re-assign the sc local (syscall number) before moving args for SYS_syscall.
Correct the audit and kdtrace hooks invocations.

Fixes:	140ceb5d95
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-12-23 01:53:41 +02:00
Konstantin Belousov
140ceb5d95 ptrace(2): add PT_SC_REMOTE remote syscall request
Reviewed by:	markj
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37590
2022-12-22 23:11:35 +02:00
Konstantin Belousov
f0592b3c8d Add a thread debugging flag TDB_BOUNDARY
It indicates to a debugger that the thread is stopped at the
kernel->user exit path.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37590
2022-12-22 23:11:35 +02:00
Konstantin Belousov
e6feeae2f9 sys: rename td_coredump to td_remotereq
and TDB_COREDUMPRQ to TDB_COREDUMPREQ

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37590
2022-12-22 23:11:35 +02:00
Zhenlei Huang
21ad3e27fa jail: Fix output of IPv[46] addresses of DDB show prison
Reviewed by:	melifaro, jamie
Approved by:	kp (mentor)
Fixes:		eb8dcdeac2 jail: network epoch protection for IP address lists
Differential Revision:	https://reviews.freebsd.org/D37732
2022-12-21 09:53:28 +08:00
Alfredo Dal'Ava Junior
b13110e9f3 ufs/ffs: detect endian mismatch between machine and filesystem
Mount on a LE machine a filesystem formatted for BE is not supported
currently. This adds a check for the superblock magic number using
swapped bytes to guess and warn the user that it may be a valid
superblock but endian is incompatible.

MFC after:	2 weeks
Reviewed by:	mckusick
Obtained from:	mckusick, alfredo
Differential Revision: https://reviews.freebsd.org/D37675
2022-12-20 00:20:11 -03:00
Doug Rabson
71e9be1bd5 Don't allow stacking of file mounts
Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:46:27 +00:00
Doug Rabson
a1d74b2dab Allow realpath to work for file mounts
For file mounts, the directory vnode is not available from namei and this
prevents the use of vn_fullpath_hardlink. In this case, we can use the
vnode which was covered by the file mount with vn_fullpath.

This also disallows file mounts over files with link counts greater than
one to ensure a deterministic path to the mount point.

Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:46:27 +00:00
Doug Rabson
521fbb722c Add support for mounting single files in nullfs
The main use-case for this is to support mounting config files and
secrets into OCI containers. My current workaround copies the files into
the container which is messy and risks secrets leaking into container
images if the cleanup fails.

This adds a VFCF flag to indicate whether the filesystem supports file
mounts and allows fspath to be either a directory or a file if the flag
is set.

Test Plan:
$ sudo mkdir -p /mnt
$ sudo touch /mnt/foo
$ sudo mount -t nullfs /COPYRIGHT /mnt/foo

Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:46:13 +00:00
Doug Rabson
78d35459a2 Add vn_path_to_global_path_hardlink
This is similar to vn_path_to_global_path but allows for regular files
which may not be present in the cache.

Reviewed by:    mjg, kib
Tested by:      pho
2022-12-19 16:44:59 +00:00
Mateusz Guzik
8f7859e800 vfs: retire the now unused SAVESTART flag
Bump __FreeBSD_version to 1400075

Tested by:      pho
2022-12-19 08:11:08 +00:00
Mateusz Guzik
56da4aa554 vfs: stop using SAVESTART for rename
ni_startdir has never reached rename routines anyway

Reviewed by:	mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D34468
2022-12-19 08:09:37 +00:00
Mateusz Guzik
8f874e92eb vfs: make relookup take an additional argument
instead of looking at SAVESTART

This is a step towards removing the flag.

Reviewed by:	mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D34468
2022-12-19 08:09:00 +00:00
Mateusz Guzik
269c564b90 vfs: retire NDFREE
There are no consumers anymore. Interested parties can NDFREE_PNBUF
and vput or vrele relevant vnodes.

Tested by:	pho
2022-12-19 08:07:54 +00:00
Mateusz Guzik
85dac03e30 vfs: stop using NDFREE
It provides nothing but a branchfest and next to no consumers want it
anyway.

Tested by:	pho
2022-12-19 08:07:23 +00:00
Rick Macklem
bba7a2e896 kern_jail.c: Allow mountd/nfsd to optionally run in a jail
This patch adds "allow.nfsd" to the jail code based on a
new kernel build option VNET_NFSD.  This will not work
until future patches fix nmount(2) to allow mountd to
run in a vnet prison and the NFS server code is patched
so that global variables are in a vnet.

The jail(8) man page will be patched in a future commit.

Reviewed by:	jamie
MFC after:	4 months
Differential Revision:	https://reviews.freebsd.org/D37637
2022-12-17 13:43:49 -08:00
Rick Macklem
195f1b124d vfs_mount.c: fix vfs_domount() for PRIV_VFS_MOUNT_EXPORTED
It appears that, prior to r158857 vfs_domount() checked
suser() when MNT_EXPORTED was specified.

r158857 appears to have broken this, since MNT_EXPORTED
was no longer set when mountd.c was converted to use nmount(2).
r164033 replaced the suser() check with
priv_check(td, PRIV_VFS_MOUNT_EXPORTED), which does the
same thing (ie. checks for effective uid == 0 assuming suses_enabled
is set).

This patch restores this check by setting MNT_EXPORTED when the
"export" mount option is specified to nmount().

I think this is reasonable since only mountd(8) should be setting
exports and I doubt any non-root mounted file system would
be setting its own exports.

Reviewed by:	kib, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D37718
2022-12-16 13:01:23 -08:00
John Baldwin
69542f2682 ktls: Close a race with setting so_error when dropping a connection.
pr_abort calls tcp_usr_abort which calls tcp_drop with ECONNABORTED.
After pr_abort returns, the so_error is then set to a more specific
error.  However, a reader can observe and return the ECONNABORTED
error before so_error is set to the desired error value.  This is
resulting in spurious test failures of recently added tests for
invalid conditions such as invalid headers.

To fix, refactor the code to abort a connection to call tcp_drop
directly with the desired error value.  ktls_reset_send_tag already
calls tcp_drop directly when it aborts a connection due to an error.

Reviewed by:	gallatin
Reported by:	CI (jenkins), gallatin, olivier
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37692
2022-12-15 12:06:26 -08:00
Andrew Gallatin
ac4e3a27ab Unbreak the build when MAC is not defined
7a2c93b86e removed the use of "error" when MAC was not
defined, resulting in an unused variable error.

Sponsored by: Netflix
Reviewed by: jhb
2022-12-14 17:39:25 -05:00
Gleb Smirnoff
7a2c93b86e sockets: provide sousrsend() that does socket specific error handling
Sockets have special handling for EPIPE on a write, that was spread out
into several places.  Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31).

- Provide sousrsend() that expects a valid uio, and leave sosend() for
  kernel consumers only.  Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
  error handling for kern_sendit(), soo_write() and soaio_process_job().

PR:			265087
Reported by:            rz-rpi03 at h-ka.de
Reviewed by:            markj
Differential revision:	https://reviews.freebsd.org/D35863
2022-12-14 10:02:44 -08:00
Jason A. Harmening
42442d7a6e Generalize the VV_CROSSLOCK logic in vfs_lookup()
When VV_CROSSLOCK is present, the lock for the vnode at the current
stage of lookup must be held across the VFS_ROOT() call for the
filesystem mounted at the vnode.  Since VV_CROSSLOCK implies that
the root vnode reuses the already-held lock, the possibility for
recursion should be made clear in the flags passed to VFS_ROOT().

For cases in which the lock is held exclusive, this means passing
LK_CANRECURSE.  For cases in which the lock is held shared, it
means clearing LK_NODDLKTREAT to allow VFS_ROOT() to potentially
recurse on the shared lock even in the presence of an exclusive
waiter.

That the existing code works for unionfs is due to a coincidence
of the current unionfs implementation.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D37458
2022-12-10 22:02:38 -06:00
Mateusz Guzik
ebdf27b6f3 uipc: remove accept_mtx
It is unused since 779f106aa1 ("Listening sockets improvements.")

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-12-11 02:47:07 +00:00
Konstantin Belousov
0919f29d91 shmfd: account for the actually allocated pages
Return the value as stat(2) st_blocks.

Suggested and reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:17:12 +02:00
Konstantin Belousov
37aea2649f tmpfs: for used pages, account really allocated pages, instead of file sizes
This makes tmpfs size accounting correct for the sparce files. Also
correct report st_blocks/va_bytes. Previously the reported value did not
accounted for the swapped out pages.

PR:	223015
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:17:12 +02:00
Konstantin Belousov
7ec4b29b08 uiomove_object: hide diagnostic under bootverbose
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:15:37 +02:00