When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.
We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.
We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.
Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().
Reviewed by: melifaro
MFC after: 2 weeks
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27279
-fstack-clash-protection was added in Clang commit e67cbac81211 but was
enabled only on Linux. It should work fine on FreeBSD as well, so
enable it.
To be discussed and upstreamed with a test. The OS test should probably
just be removed.
Reviewed by: dim
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27366
It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27278
The current logic is a fine choice for a system administrator modifying
process cpusets or a process creating a new cpuset(2), but not ideal for
processes attaching to a jail.
Currently, when a process attaches to a jail, it does exactly what any other
process does and loses any mask it might have applied in the process of
doing so because cpuset_setproc() is entirely based around the assumption
that non-anonymous cpusets in the process can be replaced with the new
parent set.
This approach slightly improves the jail attach integration by modifying
cpuset_setproc() callers to indicate if they should rebase their cpuset to
the indicated set or not (i.e. cpuset_setproc_update_set).
If we're rebasing and the process currently has a cpuset assigned that is
not the containing jail's root set, then we will now create a new base set
for it hanging off the jail's root with the existing mask applied instead of
using the jail's root set as the new base set.
Note that the common case will be that the process doesn't have a cpuset
within the jail root, but the system root can freely assign a cpuset from
a jail to a process outside of the jail with no restriction. We assume that
that may have happened or that it could happen due to a race when we drop
the proc lock, so we must recheck both within the loop to gather up
sufficient freed cpusets and after the loop.
To recap, here's how it worked before in all cases:
0 4 <-- jail 0 4 <-- jail / process
| |
1 -> 1
|
3 <-- process
Here's how it works now:
0 4 <-- jail 0 4 <-- jail
| | |
1 -> 1 5 <-- process
|
3 <-- process
or
0 4 <-- jail 0 4 <-- jail / process
| |
1 <-- process -> 1
More importantly, in both cases, the attaching process still retains the
mask it had prior to attaching or the attach fails with EDEADLK if it's
left with no CPUs to run on or the domain policy is incompatible. The
author of this patch considers this almost a security feature, because a MAC
policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's
restricted to some subset of available CPUs the ability to attach to a jail,
which might lift the user's restrictions if they attach to a jail with a
wider mask.
In most cases, it's anticipated that admins will use this to be able to,
for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`,
and avoid the need for contortions to spawn a command inside a jail with a
more limited cpuset than the jail.
Reviewed by: jamie
MFC after: 1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D27298
cpuset_init() is better descriptor for what the function actually does. The
name was previously taken by a sysinit that setup cpuset_zero's mask
from all_cpus, it was removed in r331698 before stable/12 branched.
A comment referencing the removed sysinit has now also been removed, since
the setup previously done was moved into cpuset_thread0().
Suggested by: markj
MFC after: 1 week
Currently, it must always allocate a new set to be used for passing to
_cpuset_create, but it doesn't have to. This is purely kern_cpuset.c
internal and it's sparsely used, so just change it to use *setp if it's
not-NULL and modify the two consumers to pass in the address of a NULL
cpuset.
This paves the way for consumers that want the unr allocation without the
possibility of sleeping as long as they've done their due diligence to
ensure that the mask will properly apply atop the supplied parent
(i.e. avoiding the free_unr() in the last failure path).
Reviewed by: jamie, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27297
All paths leading into closefp() will either replace or remove the fd from
the filedesc table, and closefp() will call fo_close methods that can and do
currently sleep without regard for the possibility of an ERESTART. This can
be dangerous in multithreaded applications as another thread could have
opened another file in its place that is subsequently operated on upon
restart.
The following are seemingly the only ones that will pass back ERESTART
in-tree:
- sockets (SO_LINGER)
- fusefs
- nfsclient
Reviewed by: jilles, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27310
Crypto file descriptors were added in the original OCF import as a way
to provide per-open data (specifically the list of symmetric
sessions). However, this gives a bit of a confusing API where one has
to open /dev/crypto and then invoke an ioctl to obtain a second file
descriptor. This also does not match the API used with /dev/crypto on
other BSDs or with Linux's /dev/crypto driver.
Character devices have gained support for per-open data via cdevpriv
since OCF was imported, so use cdevpriv to simplify the userland API
by permitting ioctls directly on /dev/crypto descriptors.
To provide backwards compatibility, CRIOGET now opens another
/dev/crypto descriptor via kern_openat() rather than dup'ing the
existing file descriptor. This preserves prior semantics in case
CRIOGET is invoked multiple times on a single file descriptor.
Reviewed by: markj
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27302
This reduces some code duplication. One behavior change is that
ppt_assign_device() will now only succeed if the device is unowned.
Previously, a device could be assigned to the same VM multiple times,
but each time it was assigned, the device's state was reset.
Reviewed by: markj, grehan
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27301
Add a new ioctl to disable all MSI-X interrupts for a PCI passthrough
device and invoke it if a write to the MSI-X capability registers
disables MSI-X. This avoids leaving MSI-X interrupts enabled on the
host if a guest device driver has disabled them (e.g. as part of
detaching a guest device driver).
This was found by Chelsio QA when testing that a Linux guest could
switch from MSI-X to MSI interrupts when using the cxgb4vf driver.
While here, explicitly fail requests to enable MSI on a passthrough
device if MSI-X is enabled and vice versa.
Reported by: Sony Arpita Das @ Chelsio
Reviewed by: grehan, markj
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27212
This driver provides support for Realtek PCI SD card readers. It attaches
mmc(4) bus on card insertion and detaches it on card removal. It has been
tested with RTS5209, RTS5227, RTS5229, RTS522A, RTS525A and RTL8411B. It
should also work with RTS5249, RTL8402 and RTL8411.
PR: 204521
Submitted by: Henri Hennebert (hlh at restart dot be)
Reviewed by: imp, jkim
Differential Revision: https://reviews.freebsd.org/D26435
Both RPI2 and BEAGLEBONE are still popular and used arm boards.
Both u-boots can coexist as they are named differently and live in the
fat partition.
This leave us with only one image that can be used for both of those
boards and all the other ones supported by FreeBSD provided that you
install the correct u-boot on it.
Reviewed by: imp
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27283
All those board are impossible to buy nowadays and could boot using the
GENERICSD image after putting the correct u-boot on them.
Reviewed by: imp
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27282
Remove the port for aml8726.
Kernel config was removed in r346096 and this port was never migrated
to GENERIC.
It is also impossible to obtain such hardware nowadays.
Reviewed by: imp
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27281
Remove the port for rk30xx.
Kernel config was removed in r346096 and this port was never migrated
to GENERIC.
It is also impossible to obtain such hardware nowadays and this code
don't provide anything beside booting.
Reviewed by: imp
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27280
tagname2tag() hashes the tag name before truncating it to 63 characters.
tag_unref() removes the tag from the name hash by computing the hash
over the truncated name. Ensure that both operations compute the same
hash for a given tag.
The larger issue is a lack of string validation in pf(4) ioctl handlers.
This is intended to be fixed with some future work, but an extra safety
belt in tagname2hashindex() is worthwhile regardless.
Reported by: syzbot+a0988828aafb00de7d68@syzkaller.appspotmail.com
Reviewed by: kp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27346
It was broken by design and unused for years due to conflicts between
different threads, fighting for the same set of mailbox registers, not
designed for multiple requests at a time. So either request has to be
synchronous and spin under the lock, or it should be sent asynchronously
through the queues as Mailbox Command IOCB or some other way.
This removes any OS specifics from the wait code, so it can be inlined.
The ratelimit tags may be shared, especially for unlimited TLS
traffic, and then the refcount is allowed to be greater than one
when freeing the send tag.
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Before this change in case of request queue overflow driver just froze the
device queue for 100ms to retry after. It was pretty bad for performance.
This change introduces SIM queue freezing when free space on the request
queue drops below 255 entries (worst case of maximum I/O size S/G list),
checking for a chance to release it on I/O completion. If the queue still
get overflowed somehow, the old mechanism is still in place, just with
delay reduced to 10ms.
With the earlier queue length increase overflows should not happen often,
but it is still easily reachable on synthetic tests.
Adding to zombie list can be perfomed by idle threads, which on ppc64 leads to
panics as it requires a sleepable lock.
Reported by: alfredo
Reviewed by: kib, markj
Fixes: r367842 ("thread: numa-aware zombie reaping")
Differential Revision: https://reviews.freebsd.org/D27288
Exec and exit are same as corresponding eventhandler hooks.
Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available. Note that the
process lock is owned when the hook is called.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27309
The acl_from_stat function accepts a stat_t * argument, but only uses its
st_mode field. There is no reason to pass the whole struct, so make it accept
a mode_t and rename the function to acl_from_mode.
Linux has non-standard acl_from_mode function in its libacl, so naming the
function this way may help discovering it during porting efforts.
Reviewed by: tsoome, markj
Approved by: markj
Differential Revision: https://reviews.freebsd.org/D27292