There's no reason to use one over the other here, let's prefer the
interface that's used elsewhere in the kernel.
No functional change intended.
Reviewed by: mjg
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38360
Summary:
Add 2 new APIs for supporting recent mbuf changes:
* 36e0a362ac added the m_snd_tag_alloc() wrapper around
if_snd_tag_alloc(). Push this down to the ifnet level.
* 4d7a1361ef adds the m_rcvif_serialize()/m_rcvif_restore() KPIs to
serialize and restore an ifnet pointer. Add the necessary wrapper to
get the index generation for this.
Reviewed By: jhb
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38340
During testing of exporting file systems in jails, I
noticed that the export structures on a mount
were not being free'd when the mount is dismounted.
This bug appears to have been in the system for a
very long time. It would have resulted in a slow memory
leak when exported file systems were dismounted.
Prior to r362158, freeing the structures during dismount
would not have been safe, since VFS_CHECKEXP() returned
a pointer into an export structure, which might still have been
used by the NFS server for an in-progress RPC when the file system
is dismounted. r362158 fixed this, so it should now be safe
to free the structures in vfs_mount_destroy(), which is what
this patch does.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D38385
Writes on UFS through a mapped region do not allocate disk blocks in
holes immediately. The blocks are allocated when the pages are paged out
first time.
This breaks the algorithm in vn_bmap_seekhole() and ufs_bmap_seekdata(),
because VOP_BMAP() reports hole for the place which already contains a
valid data.
Clean the pages before doing VOP_BMAP() in the affected functions. In
principle, we could clean less by only requesting clean starting from
the offset, but it is probably not very important.
PR: 269261
Reported by: asomers
Reviewed by: asomers, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38379
mountd(8) basically does the following:
getmntinfo()
for each mount
delete_exports
using nmount(2) to do the creation/deletion of individual exports.
For prison0 (and for other prisons if enforce_statfs == 0) getmntinfo()
returns all mount points, including ones being used within other prisons.
This can cause confusion if the same file system is specified in the
exports(5) file for multiple prisons.
This patch adds a perminent identifier to each prison
and marks which prison did the exports in a field of
the mount structure called mnt_exjail. This field can
then be compared to the perminent identifier for the
prison that the thread's credentials is in.
Also required was a new function called prison_isalive_permid()
which returns if the prison is alive, so that the check can be
ignored for prisons that have been removed.
This prepares the system to allow mountd(8) to run in multiple
prisons, including prison0.
Future commits will complete the modifications to allow mountd(8)
to run in vnet prisons. Until then, these changes should not affect
semantics.
Reviewed by: markj
MFC after: 3 months
Differential Revision: https://reviews.freebsd.org/D38144
Since mountd(8) will not be able to do exports
when running in a vnet prison if enforce_statfs is
set to 0, add a check for this to prison_check_nfsd().
Reviewed by: jamie, markj
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D38189
This somewhat undermines the initial goal of sousrsend() to have all
the special error handling for a write on a socket in a single place.
The aio(4) needs to see EWOULDBLOCK to re-schedule the job. Because
aio(4) handles return from soreceive() and sousrsend() with the same
code, we can't check for (error == 0 && done < job_nbytes). Keeping
this exclusion for aio(4) seems a lesser evil.
Fixes: 7a2c93b86e
Since 32c203577a by phk in 1999 (Make even more of the PPSAPI
implementations generic), the "nsec" parameter of hardpps() is a time
difference and no longer a time point. Change the name to "delta_nsec"
and adjust the comment.
Remove comment about a clock tick adjustment which is no longer in the code.
Pull Request: https://github.com/freebsd/freebsd-src/pull/640
Reviewed by: imp
cpuset_which() resolves the argument pair which and id and returns references
to an appropriate resources. To avoid leaking resources or accessing unresolved
references to a resources handle new which CPU_WHICH_TIDPID wherever
cpuset_which() is called.
To avoid code duplication cpuset_which2() has been added.
Reported by: syzbot+331e8402e0f7347f0f2a@syzkaller.appspotmail.com
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D38272
MFC after: 2 weeks
Under Linux to sched_[g|s]etaffinity() functions the value returned from a call
to gettid(2) (thread id) can be passed in the argument pid. Specifying pid as 0
will set the attribute for the calling thread, and passing the value returned
from a call to getpid(2) (process id) will set the attribute for the main thread
of the thread group.
Native cpuset(2) family of system calls has "which" argument to determine how
the value of id argument is interpreted, i.e., CPU_WHICH_TID is used to pass
a thread id and CPU_WHICH_PID - to pass a process id.
For now native sched_[g|s]etaffinity() implementation is wrong as uses "which"
CPU_WHICH_PID to pass both (process and thread id) to the kernel. To fix this
adding a new "which" CPU_WHICH_TIDPID intended to handle both id's.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D38209
MFC after: 1 week
Since f35093f8 semantics of a thread affinity functions is changed to be a
compatible with Linux:
In case of getaffinity(), the minimum cpuset_t size that the kernel permits is
the maximum CPU id, present in the system, / NBBY bytes, the maximum size is not
limited.
In case of setaffinity(), the kernel does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where the upper
bound is the maximum CPU id, present in the system, no larger than the size of
the kernel cpuset_t.
To match pthread_attr_[g|s]etaffinity_np checks of the user-provided cpusets to
the kernel behavior export the minimum cpuset_t size allowed by running kernel
via new sysctl kern.sched.cpusetsizemin and use it in checks.
Reviewed by:
Differential Revision: https://reviews.freebsd.org/D38112
MFC after: 1 week
Merge commit 'b3392d84da5bf2162baf937c77e0557f3fd8a52b' into zstd_1.5.2
full changelog: https://github.com/facebook/zstd/compare/v1.4.8...v1.5.2
Updated sys/kern/subr_compressor.c to new API
MFC after: 3 days
Relnotes: yes
Sponsored by: Klara, Inc.
The functions are intended to report kernel variables that are
stored as sbintime_t (pointed to by arg1) as human readable
nanoseconds or milliseconds (reported via sysctl_handle_64).
The variable types and names were reversed. I guess there is
no functional change here, as all types flipped around were
signed 64. Note that these function aren't used yet anywhere
in the kernel.
Reviewed by: mav
Differential revision: https://reviews.freebsd.org/D38217
Currently shutdown_reset() is registered as the final entry of the
shutdown_final event handler. However, if a panic occurs early in boot
before the event is registered (SI_SUB_INTRINSIC), we may end up
spinning in the subsequent infinite for loop and failing to reset
altogether. Instead we can simply call this function unconditionally.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37981
This will allow user-space programs (e.g. lsof) to locate deleted files
whose nlink equals zero. Prior to this commit, programs has to use
stat(kf_path) to get nlink, but that will fail if the file is deleted.
[mjg: s/fail/file in the commit message]
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D38169
In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into
off_t size, we must avoid overflow.
Then, the check for fsize < 0, introduced in the commit
f45feecfb2 'vfs: add vn_getsize', is nop [1].
Reported and reviewed by: jhb
Coverity CID: 1502346
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38133
The assertion is equivalent to kstack_contains() so use that rather
than spelling it out.
Suggested by: jhb
Reviewed by: jhb
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D38107
Currently function prison_ip_restrict() returns true if the replacement
buffer was used, or no buffer provided and allocation fails and should
redo. The logic is confusing and cause possibly infinite loop from
eb8dcdeac2 .
Reviewed by: jamie, glebius
Approved by: kp (mentor)
Differential Revision: https://reviews.freebsd.org/D37918
And possibly infinite loop calling prison_ip_restrict() in
kern_jail_set() [2].
[1] It is possible that prisons do not have any IPv4 or IPv6 addresses.
[2] If prison_ip_restrict() is not provided with prison_ip, when it
allocates prison_ip successfully, then it should return false to
indicate not redo prison_ip_restrict() later.
Reviewed by: glebius
Approved by: kp (mentor)
Fixes: eb8dcdeac2 jail: network epoch protection for IP address lists
Differential Revision: https://reviews.freebsd.org/D37906
... which verifies that given file table does not have file descriptors
referencing vnodes on the specified mount point. It is up to the caller
to ensure that the check is not racy.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37896
getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.
As a step towards that goal make size-only consumers use a dedicated
routine.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D37885
If an error such as an invalid record or one whose decryption fails is
detected on a socket that has received a RST then ktls_drop() could
ignore the error since INP_DROPPED could already be set. In this case
soreceive_generic hangs since it does not return from a KTLS socket
with pending encrypted data unless there is an error (so_error) (this
behavior is to ensure that soreceive_generic doesn't return a
premature EOF when there is pending data still being decrypted).
Note that this was a bug prior to
69542f2682 as tcp_usr_abort would also
have ignored the error in this case.
Reviewed by: gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37775
To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>
As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).
Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).
The new field lands in a 1-byte hole, thus it does not grow the struct.
Bump __FreeBSD_version to 1400077
Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759
While here prefix with v for better consistency with the vnode stuff.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D37759
Re-assign the sc local (syscall number) before moving args for SYS_syscall.
Correct the audit and kdtrace hooks invocations.
Fixes: 140ceb5d95
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It indicates to a debugger that the thread is stopped at the
kernel->user exit path.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37590
and TDB_COREDUMPRQ to TDB_COREDUMPREQ
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37590
Mount on a LE machine a filesystem formatted for BE is not supported
currently. This adds a check for the superblock magic number using
swapped bytes to guess and warn the user that it may be a valid
superblock but endian is incompatible.
MFC after: 2 weeks
Reviewed by: mckusick
Obtained from: mckusick, alfredo
Differential Revision: https://reviews.freebsd.org/D37675
For file mounts, the directory vnode is not available from namei and this
prevents the use of vn_fullpath_hardlink. In this case, we can use the
vnode which was covered by the file mount with vn_fullpath.
This also disallows file mounts over files with link counts greater than
one to ensure a deterministic path to the mount point.
Reviewed by: mjg, kib
Tested by: pho
The main use-case for this is to support mounting config files and
secrets into OCI containers. My current workaround copies the files into
the container which is messy and risks secrets leaking into container
images if the cleanup fails.
This adds a VFCF flag to indicate whether the filesystem supports file
mounts and allows fspath to be either a directory or a file if the flag
is set.
Test Plan:
$ sudo mkdir -p /mnt
$ sudo touch /mnt/foo
$ sudo mount -t nullfs /COPYRIGHT /mnt/foo
Reviewed by: mjg, kib
Tested by: pho