Use atomic_fetchadd in place of separate atomic_subtract / atomic_load.
Reviewed by: markj
Sponsored by: HPE TidalScale
Differential Revision: https://reviews.freebsd.org/D38559
Crucially, this allows releasing counters, and interrupt sources by
extension. Where before we were incrementing intrcnt_index with atomics,
now we protect the bitmap using the existing isrc_table_lock mutex.
Reviewed by: mmel
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D38437
soconnectat() tries to ensure that one cannot connect a connected
socket. However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.
Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().
Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38507
Also do minor style adjustments.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38549
For sched_relinquish(). This fixes the build for some kernel configs.
Reported by: Jenkins
Fixes: 1029dab634 ("mi_switch(): clean up switch types and their usage")
Address some last minute review feedback on c0e4090e3d
by fixing spacing around comments, and clarifying that the
newly added destroy_task is not related to tls 1.0.
No functional change intended.
Pointed out by: jhb
Sponsored by: Netflix
This allows us to avoid spurious calls to ktls_disable_ifnet()
When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.
This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.
While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.
This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.
Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380
Overall, this is a non-functional change, except for kernels built with
SCHED_STATS. However, the switch types are useful for communicating the
intent of the caller.
1. Ensure that every caller provides a type. In most cases, we upgrade
the basic yield to sched_relinquish() aka SWT_RELINQUISH.
2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND.
3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO.
4. Remove SWT_NONE altogether and assert that callers always provide
a type flag.
5. Reference the mi_switch(9) man page in the comments, as these flags
will be documented there.
Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38184
ULE uses the more specific SWT_REMOTEPREEMPT and SWT_REMOTEWAKEIDLE
switch types, let's do that here as well. SWT_PREEMPT is somewhat
redundant when we also have the SW_PREEMPT flag.
This only has an effect for kernels built with SCHED_STATS.
Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38183
Do this ahead of adding a man page that describes the function. No
functional change.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38181
The equivalent function is now named thread_create(). Mention
kthread_add() where it is also relevant.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38180
Its purpose is to reserve all I/O space belonging to physical memory
from nexus, preventing it from being handed out by bus_alloc_resource()
to callers such as xenpv_alloc_physmem(), which looks for the first
available free range it can get. This mimics the existing pseudo-driver
on x86.
If needed, the device can be disabled with hint.ram.0.disabled="1" in
/boot/device.hints.
Reviewed by: imp
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D32343
There's no reason to use one over the other here, let's prefer the
interface that's used elsewhere in the kernel.
No functional change intended.
Reviewed by: mjg
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38360
Summary:
Add 2 new APIs for supporting recent mbuf changes:
* 36e0a362ac added the m_snd_tag_alloc() wrapper around
if_snd_tag_alloc(). Push this down to the ifnet level.
* 4d7a1361ef adds the m_rcvif_serialize()/m_rcvif_restore() KPIs to
serialize and restore an ifnet pointer. Add the necessary wrapper to
get the index generation for this.
Reviewed By: jhb
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38340
During testing of exporting file systems in jails, I
noticed that the export structures on a mount
were not being free'd when the mount is dismounted.
This bug appears to have been in the system for a
very long time. It would have resulted in a slow memory
leak when exported file systems were dismounted.
Prior to r362158, freeing the structures during dismount
would not have been safe, since VFS_CHECKEXP() returned
a pointer into an export structure, which might still have been
used by the NFS server for an in-progress RPC when the file system
is dismounted. r362158 fixed this, so it should now be safe
to free the structures in vfs_mount_destroy(), which is what
this patch does.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D38385
Writes on UFS through a mapped region do not allocate disk blocks in
holes immediately. The blocks are allocated when the pages are paged out
first time.
This breaks the algorithm in vn_bmap_seekhole() and ufs_bmap_seekdata(),
because VOP_BMAP() reports hole for the place which already contains a
valid data.
Clean the pages before doing VOP_BMAP() in the affected functions. In
principle, we could clean less by only requesting clean starting from
the offset, but it is probably not very important.
PR: 269261
Reported by: asomers
Reviewed by: asomers, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38379
mountd(8) basically does the following:
getmntinfo()
for each mount
delete_exports
using nmount(2) to do the creation/deletion of individual exports.
For prison0 (and for other prisons if enforce_statfs == 0) getmntinfo()
returns all mount points, including ones being used within other prisons.
This can cause confusion if the same file system is specified in the
exports(5) file for multiple prisons.
This patch adds a perminent identifier to each prison
and marks which prison did the exports in a field of
the mount structure called mnt_exjail. This field can
then be compared to the perminent identifier for the
prison that the thread's credentials is in.
Also required was a new function called prison_isalive_permid()
which returns if the prison is alive, so that the check can be
ignored for prisons that have been removed.
This prepares the system to allow mountd(8) to run in multiple
prisons, including prison0.
Future commits will complete the modifications to allow mountd(8)
to run in vnet prisons. Until then, these changes should not affect
semantics.
Reviewed by: markj
MFC after: 3 months
Differential Revision: https://reviews.freebsd.org/D38144
Since mountd(8) will not be able to do exports
when running in a vnet prison if enforce_statfs is
set to 0, add a check for this to prison_check_nfsd().
Reviewed by: jamie, markj
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D38189
This somewhat undermines the initial goal of sousrsend() to have all
the special error handling for a write on a socket in a single place.
The aio(4) needs to see EWOULDBLOCK to re-schedule the job. Because
aio(4) handles return from soreceive() and sousrsend() with the same
code, we can't check for (error == 0 && done < job_nbytes). Keeping
this exclusion for aio(4) seems a lesser evil.
Fixes: 7a2c93b86e
Since 32c203577a by phk in 1999 (Make even more of the PPSAPI
implementations generic), the "nsec" parameter of hardpps() is a time
difference and no longer a time point. Change the name to "delta_nsec"
and adjust the comment.
Remove comment about a clock tick adjustment which is no longer in the code.
Pull Request: https://github.com/freebsd/freebsd-src/pull/640
Reviewed by: imp
cpuset_which() resolves the argument pair which and id and returns references
to an appropriate resources. To avoid leaking resources or accessing unresolved
references to a resources handle new which CPU_WHICH_TIDPID wherever
cpuset_which() is called.
To avoid code duplication cpuset_which2() has been added.
Reported by: syzbot+331e8402e0f7347f0f2a@syzkaller.appspotmail.com
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D38272
MFC after: 2 weeks
Under Linux to sched_[g|s]etaffinity() functions the value returned from a call
to gettid(2) (thread id) can be passed in the argument pid. Specifying pid as 0
will set the attribute for the calling thread, and passing the value returned
from a call to getpid(2) (process id) will set the attribute for the main thread
of the thread group.
Native cpuset(2) family of system calls has "which" argument to determine how
the value of id argument is interpreted, i.e., CPU_WHICH_TID is used to pass
a thread id and CPU_WHICH_PID - to pass a process id.
For now native sched_[g|s]etaffinity() implementation is wrong as uses "which"
CPU_WHICH_PID to pass both (process and thread id) to the kernel. To fix this
adding a new "which" CPU_WHICH_TIDPID intended to handle both id's.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D38209
MFC after: 1 week
Since f35093f8 semantics of a thread affinity functions is changed to be a
compatible with Linux:
In case of getaffinity(), the minimum cpuset_t size that the kernel permits is
the maximum CPU id, present in the system, / NBBY bytes, the maximum size is not
limited.
In case of setaffinity(), the kernel does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where the upper
bound is the maximum CPU id, present in the system, no larger than the size of
the kernel cpuset_t.
To match pthread_attr_[g|s]etaffinity_np checks of the user-provided cpusets to
the kernel behavior export the minimum cpuset_t size allowed by running kernel
via new sysctl kern.sched.cpusetsizemin and use it in checks.
Reviewed by:
Differential Revision: https://reviews.freebsd.org/D38112
MFC after: 1 week
Merge commit 'b3392d84da5bf2162baf937c77e0557f3fd8a52b' into zstd_1.5.2
full changelog: https://github.com/facebook/zstd/compare/v1.4.8...v1.5.2
Updated sys/kern/subr_compressor.c to new API
MFC after: 3 days
Relnotes: yes
Sponsored by: Klara, Inc.
The functions are intended to report kernel variables that are
stored as sbintime_t (pointed to by arg1) as human readable
nanoseconds or milliseconds (reported via sysctl_handle_64).
The variable types and names were reversed. I guess there is
no functional change here, as all types flipped around were
signed 64. Note that these function aren't used yet anywhere
in the kernel.
Reviewed by: mav
Differential revision: https://reviews.freebsd.org/D38217
Currently shutdown_reset() is registered as the final entry of the
shutdown_final event handler. However, if a panic occurs early in boot
before the event is registered (SI_SUB_INTRINSIC), we may end up
spinning in the subsequent infinite for loop and failing to reset
altogether. Instead we can simply call this function unconditionally.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37981
This will allow user-space programs (e.g. lsof) to locate deleted files
whose nlink equals zero. Prior to this commit, programs has to use
stat(kf_path) to get nlink, but that will fail if the file is deleted.
[mjg: s/fail/file in the commit message]
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D38169
In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into
off_t size, we must avoid overflow.
Then, the check for fsize < 0, introduced in the commit
f45feecfb2 'vfs: add vn_getsize', is nop [1].
Reported and reviewed by: jhb
Coverity CID: 1502346
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38133
The assertion is equivalent to kstack_contains() so use that rather
than spelling it out.
Suggested by: jhb
Reviewed by: jhb
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D38107
Currently function prison_ip_restrict() returns true if the replacement
buffer was used, or no buffer provided and allocation fails and should
redo. The logic is confusing and cause possibly infinite loop from
eb8dcdeac2 .
Reviewed by: jamie, glebius
Approved by: kp (mentor)
Differential Revision: https://reviews.freebsd.org/D37918
And possibly infinite loop calling prison_ip_restrict() in
kern_jail_set() [2].
[1] It is possible that prisons do not have any IPv4 or IPv6 addresses.
[2] If prison_ip_restrict() is not provided with prison_ip, when it
allocates prison_ip successfully, then it should return false to
indicate not redo prison_ip_restrict() later.
Reviewed by: glebius
Approved by: kp (mentor)
Fixes: eb8dcdeac2 jail: network epoch protection for IP address lists
Differential Revision: https://reviews.freebsd.org/D37906
... which verifies that given file table does not have file descriptors
referencing vnodes on the specified mount point. It is up to the caller
to ensure that the check is not racy.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37896
getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.
As a step towards that goal make size-only consumers use a dedicated
routine.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D37885
If an error such as an invalid record or one whose decryption fails is
detected on a socket that has received a RST then ktls_drop() could
ignore the error since INP_DROPPED could already be set. In this case
soreceive_generic hangs since it does not return from a KTLS socket
with pending encrypted data unless there is an error (so_error) (this
behavior is to ensure that soreceive_generic doesn't return a
premature EOF when there is pending data still being decrypted).
Note that this was a bug prior to
69542f2682 as tcp_usr_abort would also
have ignored the error in this case.
Reviewed by: gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37775
To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>
As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).
Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).
The new field lands in a 1-byte hole, thus it does not grow the struct.
Bump __FreeBSD_version to 1400077
Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759
While here prefix with v for better consistency with the vnode stuff.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D37759
Re-assign the sc local (syscall number) before moving args for SYS_syscall.
Correct the audit and kdtrace hooks invocations.
Fixes: 140ceb5d95
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It indicates to a debugger that the thread is stopped at the
kernel->user exit path.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37590
and TDB_COREDUMPRQ to TDB_COREDUMPREQ
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37590
Mount on a LE machine a filesystem formatted for BE is not supported
currently. This adds a check for the superblock magic number using
swapped bytes to guess and warn the user that it may be a valid
superblock but endian is incompatible.
MFC after: 2 weeks
Reviewed by: mckusick
Obtained from: mckusick, alfredo
Differential Revision: https://reviews.freebsd.org/D37675
For file mounts, the directory vnode is not available from namei and this
prevents the use of vn_fullpath_hardlink. In this case, we can use the
vnode which was covered by the file mount with vn_fullpath.
This also disallows file mounts over files with link counts greater than
one to ensure a deterministic path to the mount point.
Reviewed by: mjg, kib
Tested by: pho
The main use-case for this is to support mounting config files and
secrets into OCI containers. My current workaround copies the files into
the container which is messy and risks secrets leaking into container
images if the cleanup fails.
This adds a VFCF flag to indicate whether the filesystem supports file
mounts and allows fspath to be either a directory or a file if the flag
is set.
Test Plan:
$ sudo mkdir -p /mnt
$ sudo touch /mnt/foo
$ sudo mount -t nullfs /COPYRIGHT /mnt/foo
Reviewed by: mjg, kib
Tested by: pho
instead of looking at SAVESTART
This is a step towards removing the flag.
Reviewed by: mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D34468
This patch adds "allow.nfsd" to the jail code based on a
new kernel build option VNET_NFSD. This will not work
until future patches fix nmount(2) to allow mountd to
run in a vnet prison and the NFS server code is patched
so that global variables are in a vnet.
The jail(8) man page will be patched in a future commit.
Reviewed by: jamie
MFC after: 4 months
Differential Revision: https://reviews.freebsd.org/D37637
It appears that, prior to r158857 vfs_domount() checked
suser() when MNT_EXPORTED was specified.
r158857 appears to have broken this, since MNT_EXPORTED
was no longer set when mountd.c was converted to use nmount(2).
r164033 replaced the suser() check with
priv_check(td, PRIV_VFS_MOUNT_EXPORTED), which does the
same thing (ie. checks for effective uid == 0 assuming suses_enabled
is set).
This patch restores this check by setting MNT_EXPORTED when the
"export" mount option is specified to nmount().
I think this is reasonable since only mountd(8) should be setting
exports and I doubt any non-root mounted file system would
be setting its own exports.
Reviewed by: kib, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37718
pr_abort calls tcp_usr_abort which calls tcp_drop with ECONNABORTED.
After pr_abort returns, the so_error is then set to a more specific
error. However, a reader can observe and return the ECONNABORTED
error before so_error is set to the desired error value. This is
resulting in spurious test failures of recently added tests for
invalid conditions such as invalid headers.
To fix, refactor the code to abort a connection to call tcp_drop
directly with the desired error value. ktls_reset_send_tag already
calls tcp_drop directly when it aborts a connection due to an error.
Reviewed by: gallatin
Reported by: CI (jenkins), gallatin, olivier
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37692
Sockets have special handling for EPIPE on a write, that was spread out
into several places. Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31).
- Provide sousrsend() that expects a valid uio, and leave sosend() for
kernel consumers only. Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
error handling for kern_sendit(), soo_write() and soaio_process_job().
PR: 265087
Reported by: rz-rpi03 at h-ka.de
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35863
When VV_CROSSLOCK is present, the lock for the vnode at the current
stage of lookup must be held across the VFS_ROOT() call for the
filesystem mounted at the vnode. Since VV_CROSSLOCK implies that
the root vnode reuses the already-held lock, the possibility for
recursion should be made clear in the flags passed to VFS_ROOT().
For cases in which the lock is held exclusive, this means passing
LK_CANRECURSE. For cases in which the lock is held shared, it
means clearing LK_NODDLKTREAT to allow VFS_ROOT() to potentially
recurse on the shared lock even in the presence of an exclusive
waiter.
That the existing code works for unionfs is due to a coincidence
of the current unionfs implementation.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37458
Return the value as stat(2) st_blocks.
Suggested and reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37097
This makes tmpfs size accounting correct for the sparce files. Also
correct report st_blocks/va_bytes. Previously the reported value did not
accounted for the swapped out pages.
PR: 223015
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37097
This allows the use of chroot and/or jail environments which depend on
interpreters registed with imgact_binmisc to use emulator binaries from
the host to emulate programs inside the chroot.
Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37432
Early versions of this code had a free, but this one doesn't need
it. Remove the forgotten free(vv); from earlier versions.
Fixes: ed56dcfc6b
Noticed by: Michael Butler
Sponsored by: Netflix
Copy the arg that sets a variable to maximize the reuse of this
routine. There are places we call it from that are const char * and it
might not be safe to cast that away.
Sponsored by: Netflix
This reverts commit 68c3f03021. There are
some weird crashes when KVMs switch caused by this, so revert this
commit until they are sorted out.
Reported by: cy@
Sponsored by: Netflix
In the rare case that we succeed in probing, but fail to attach, flip
the default to be to disable the
device. hw.bus.disable_failed_devices=false is no required to restore
the old behavior. The old behavior dates form a time when dynamic
control of devices wasn't yet present (devctl didn't exist). Now that
one can retry probe/attach the device with devctl, the default doesn't
make sense: The more desirable behaivor is to have stable device numbers
when one has several instances of the same device in a system (common
for NICs or HBAs).
Reviewed by: jhb (verbal)
Sponsored by: Netflix
Normally, when a device fails to attach, we tear down the newbus state
for that device so that future driver loads can try again (maybe with a
different driver, or maybe with a re-loaded and fixed kld).
Sometimes, however, it is desirable to have the device fail
permanantly. We do this by calling device_disable() on a failed
attached, as well as keeping the device in DS_ATTACHING forever. This
prevents retries on that device. This is enabled via
hw.bus.disable_failed_devices=1 in either a hint via the loader, or at
runtime with a sysctl setting. Setting from 1 -> 0 at runtime will not
affect previously disabled devices, however: they remain disabled.
They can be re-enabled manually with devctl enable, however.
Sponsored by: Netflix
Reviewed by: gallatin, hselasky, jhb
Differential Revision: https://reviews.freebsd.org/D37517
One year ago, I deprecated 'kern' in favor of 'kernel' for the system
name for some power events. I'm about to remove it from the kernel, but
realized there's been no warning generated for users. Preserve POLA by
converting on the fly here and issuing a warning for 14.x, and an fatal
error after we branch 15. Make compiling it an error on 16 to remove
the gross hack after we branch.
Sponsored by: Netflix
Reviewed by: bapt
Differential Revision: https://reviews.freebsd.org/D37584
The new "kernel" system name is the one that's documented and has
been generated for a year now. Remove the old one now that 14.0
is getting close.
Sponsored by: Netflix
Reviewed by: bapt
Differential Revision: https://reviews.freebsd.org/D37582
Since ZFS reports _PC_MIN_HOLE_SIZE as 512 (although it
appears that an unwritten region must be at least f_iosize
to remain unallocated), vn_generic_copy_file_range()
uses 4096 for the copy blksize for ZFS, reulting in slow copies.
For most other file systems, _PC_MIN_HOLE_SIZE and f_iosize
are the same value, so this patch modifies the code to
use f_iosize for most cases. It also documents in comments
why the blksize is being set a certain way, so that the code
does not appear to be doing "magic math".
Reported by: allanjude
Reviewed by: allanjude, asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37076
This is mainly intended to provide a fallback for TOE TLS which may
need to use software decryption for an initial record at the start
of a connection.
Reviewed by: markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37370
The option is not even recognized and with that patched it does not
compile. Even if it did work, it would be prohibitively expensive to
use.
Interested parties can use pmcstat or dtrace instead.
ktls_cleanup() does not free ktls session objects, it merely
cleans (and frees) members of the object.
Change callers to use ktls_free() instead.
Reviewed by: gallatin, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37352
The break-before-make requirement poses a problem when promoting or
demoting mappings containing thread structures: a CPU may raise a
translation fault while accessing curthread, and data_abort() accesses
the thread again before pmap_fault() can translate the address and
return.
Normally this isn't a problem because we have a hack to ensure that
slabs used by the thread zone are always accessed via the direct map,
where promotions and demotions are rare. However, this hack doesn't
work properly with UMA_MD_SMALL_ALLOC disabled, as is the case with
KASAN configured (since our KASAN implementation does not shadow the
direct map and so tries to force the use of the kernel map wherever
possible).
Fix the problem by modifying data_abort() to handle translation faults
in the kernel map without dereferencing "td", i.e., curthread, and
without enabling interrupts. pmap_klookup() has special handling for
translation faults which makes it safe to call in this context. Then,
revert the aforementioned hack.
Reviewed by: kevans, alc, kib, andrew
MFC after: 1 month
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37231
First, an sbuf_new() in device_get_path() shadows the sb
passed in by dev_wired_cache_add(), leaving its sb in an
unfinished state, leading to a failed KASSERT(). Fixing this
is as simple as removing the sbuf_new() from device_get_path()
Second, we cannot simply take a pointer to the sbuf memory and
store it in the device location cache, because that sbuf
is freed immediately after we add data to the cache, leading
to a use-after-free and eventually a double-free. Fixing this
requires allocating memory for the path.
After a discussion with jhb, we decided that one malloc was
better than two in dev_wired_cache_add, which is why it changed
so much.
Reviewed by: jhb
Sponsored by: Netflix
MFC after: 14 days
This commit brings back the driver from FreeBSD commit
f187d6dfbf plus subsequent fixes from
upstream.
Relative to upstream this commit includes a few other small fixes such
as additional INET and INET6 #ifdef's, #include cleanups, and updates
for recent API changes in main.
Reviewed by: pauamma, gbe, kevans, emaste
Obtained from: git@git.zx2c4.com:wireguard-freebsd @ 3cc22b2
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36909
Refactor the symlink and mountpoint traversal logic to avoid
repeatedly checking the vnode type; a symlink cannot be a mountpoint
and vice versa. Avoid repeatedly checking cn_flags for NOCROSSMOUNT
and simplify the check which determines whether the vnode is a
mountpoint.
Suggested by: mjg
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D35054
These are of limited use since the crossmp vnode locking ops have not
actually used a lock since commit
a2d3554542. We in fact require that
these operations are always issued with LK_SHARED. Additionally,
these directives can produce a false positive in certain VV_CROSSLOCK
cases which require upgrading of the covered vnode lock from shared
to exclusive.
While here, replace the runtime check of LK_SHARED with a KASSERT and
expand the check to include LK_NOWAIT, which all callers pass.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D35054
When a lookup operation crosses into a new mountpoint, the mountpoint
must first be busied before the root vnode can be locked. When a
filesystem is unmounted, the vnode covered by the mountpoint must
first be locked, and then the busy count for the mountpoint drained.
Ordinarily, these two operations work fine if executed concurrently,
but with a stacked filesystem the root vnode may in fact use the
same lock as the covered vnode. By design, this will always be
the case for unionfs (with either the upper or lower root vnode
depending on mount options), and can also be the case for nullfs
if the target and mount point are the same (which admittedly is
very unlikely in practice).
In this case, we have LOR. The lookup path holds the mountpoint
busy while waiting on what is effectively the covered vnode lock,
while a concurrent unmount holds the covered vnode lock and waits
for the mountpoint's busy count to drain.
Attempt to resolve this LOR by allowing the stacked filesystem
to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary.
Upon observing this flag, the vfs_lookup() will leave the covered
vnode lock held while crossing into the mountpoint. Employ this flag
for unionfs with the caveat that it can't be used for '-o below' mounts
until other unionfs locking issues are resolved.
Reported by: pho
Tested by: pho
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35054
In order to safely reuse excluded memory when it's reserved for special
purpose, we need to test whether or not the memory has been reserved
early in boot. physmem_excluded will return true when the entire range
is excluded, false otherwise.
Sponsored by: Netflix
List of changes:
- Use integer multiplication instead of long multiplication, because the result is an integer.
- Remove multiple if-statements and predict new if-statements.
- Rename local variable name, "ticks" into "retval" to avoid shadowing
the system "ticks" global variable.
Reviewed by: kib@ and imp@
MFC after: 1 week
Sponsored by: NVIDIA Networking
Differential Revision: https://reviews.freebsd.org/D36859
This allows to fix a bug where sbuf allocation done in the context of
dev_wired_cache_match() must use non-sleepable allocations.
Suggested by: jhb
Reviewed by: jhb, takawata
Discussed with: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36899
Later it would silently converted to ENOMEM always, because any error
was reported as NULL return path.
Reviewed by: jhb, takawata
Discussed with: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36899
Some virtual machines pass virtio MMIO device parameters via the kernel
command line as a series of virtio_mmio.device=<parameters> options.
These get translated into FreeBSD kernel environment variables; but
unfortunately they all use the same variable name, which resulted in
all but the first such parameter being ignored when the dynamic kernel
environment is set up from the initial environment buffers.
With this commit, duplicate environment settings will instead be stored
as ${name}_1, ${name}_2... ${name}_9999. In the unlikely event that
the same variable is set over 10000 times before the dynamic kernel
environment is set up, we panic.
Variable settings after the dynamic environment is initialized continue
to override the previously-set value; the change is limited to the very
early kernel boot (prior to SI_SUB_KMEM + 1) and changes behaviour from
"ignore" to "store with a different name" only.
Reviewed by: imp
Feedback from: kevans
Sponsored by: https://patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D36187
By convention, EINVAL is returned when validating arguments, not EPERM.
This matches the documented behaviour of sched_setscheduler(3), and that
of SCHED_OTHER.
PR: 227735
MFC after: 1 week
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D37021
It likely won't happen, but is consistent with the other functions of
this KPI.
Reviewed by: imp, jhb
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D33479
It can be static within uart_tty.c. It is an open question whether there
remains any real benefit to having uart instances share a swi thread.
Reviewed by: imp, markj, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36938
This reverts commit 76e6e4d72f.
Several programs in the tree use -1 instead of INT_MAX to use
the maximum value. Thanks to Eugene Grosbein for pointing this
out.
Ensure that a negative backlog argument is handled as it if was 0.
Reviewed by: markj@, glebius@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31821
This will resolve a reference and return the appropriate handle, a node
on the simplebus or an ACPI_HANDLE for ACPI. For now we do not try to
further abstract the return type.
MFC after: 2 weeks
Reviewed by: mw
Differential Revision: https://reviews.freebsd.org/D36793
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
In non-periodic mode absolute timers fire at exactly the time given.
When specifying a fast clock, align the firing time so that less
timer interrupt events are needed.
Reviewed by: rrs @
Differential Revision: https://reviews.freebsd.org/D36858
MFC after: 1 week
Sponsored by: NVIDIA Networking
Add local functions to workaround an instruction segment trap (panic)
when the indirect functions copyin and copyout are called by an external
loadable kernel module (i.e. pfsync, zfs and linuxulator). The crash
was triggered by change 47a57144af, but
kernel binary linked with LLD 9 works fine. LLVM bisect points that LLD
behavior chaged after dc06b0bc9ad055d06535462d91bfc2a744b2f589.
This is know to affect powerpc targets only and the final fix is still
being discussed with the LLVM community.
PR: 266730
Reviewed by: luporl, jhibbits (on IRC, previous version)
MFC after: 2 days
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D36234
Modify db_show_sysctl_all so that it does not copy more than once the
data of the input oid, and so that what it passes to db_show_oid does
not alarm coverity.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D36847
Protocols such as netlink may need a large socket receive buffer,
measured in tens of megabytes. This change allows netlink to
set larger socket buffers (given the privs are in place), without
requiring user to manuall bump maxsockbuf.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D36747
Allow to set custom per-protocol handlers for the socket buffers
ioctls by introducing pr_setsbopt callback with the default value
set to the currently-used sbsetopt().
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D36746
The implementation of sysctl_search_oid no longer relies on the
initial value of nodes to be all NULL, so remove the comment that
demands it and let the caller stop enforcing it.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D36768
sysctl_search_old makes several tests in a loop that can be removed.
The first test in the loop is only ever true on the first loop
iteration, and is always true on that iteration, so its work can be
done before the loop begins.
The upper and lower bounds on the loop variable 'indx' are each tested
on each iteration, but 'indx' is changed in one direction or the other
only once within the loop, so only one bound needs to be checked.
Two ways remain in the loop that nodes[indx] can change (after one of
them is put before the loop start), and one of them applies exactly
when indx has been incremented, so no separate test for that case
requires testing.
Restructure and add comments that makes clearer that this is a basic
depth-first search.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D36741
sysctl_register_oid must check the uniqueness of any newly computed
oid_number in sysctl_register_oid.
Reviewed by: asomers
MFC with: d3f96f6610
Differential Revision: https://reviews.freebsd.org/D36743
To avoid using the sysctl list macros directly in external kernel modules.
Reviewed by: asomers, manu and asiciliano
Differential Revision: https://reviews.freebsd.org/D36748
MFC after: 1 week
Sponsored by: NVIDIA Networking
Specifically, when we receive a violation and we're configured to panic,
kasan_enabled gets unset before we descend into panic(). At this point,
there's no longer any reason to allow marking as kasan_shadow_check() is
disabled -- we have some inherent risk of faulting or panicking if the
system's in a bad enough state with no benefit.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D36742
Sysctl OIDs were internally stored in linked lists, triggering O(n^2)
behavior when userland iterates over many of them. The slowdown is
noticeable for MIBs that have > 100 children (for example, vm.uma). But
it's unignorable for kstat.zfs when a pool has > 1000 datasets.
Convert the linked lists into RB trees. This produces a ~25x speedup
for listing kstat.zfs with 4100 datasets, and no measurable penalty for
small dataset counts.
Bump __FreeBSD_version for the KPI change.
Sponsored by: Axcient
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D36500
The vn_rlimit_fsizex() function:
- checks that the write does not exceed RLIMIT_FSIZE limit and fs
maximum supported file size
- truncates write length if it exceeds the RLIMIT_FSIZE or max file size,
but there are some bytes to write
- sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise
POSIX mandates the truncated write in case when some bytes can be
written but whole write request fails the RLIMIT_FSIZE check.
The function is supposed to be used from VOP_WRITE()s. Due to
pecularity in the VFS generic write syscall layer, uio_resid must
correctly reflect the written amount (noted by markj). Provide the dual
vn_rlimit_fsizex_res() function to correct uio_resid after the clamp
done in vn_rlimit_fsizex() on VOP_WRITE() return.
PR: 164793
Reviewed by: asomers, jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36625