The break-before-make requirement poses a problem when promoting or
demoting mappings containing thread structures: a CPU may raise a
translation fault while accessing curthread, and data_abort() accesses
the thread again before pmap_fault() can translate the address and
return.
Normally this isn't a problem because we have a hack to ensure that
slabs used by the thread zone are always accessed via the direct map,
where promotions and demotions are rare. However, this hack doesn't
work properly with UMA_MD_SMALL_ALLOC disabled, as is the case with
KASAN configured (since our KASAN implementation does not shadow the
direct map and so tries to force the use of the kernel map wherever
possible).
Fix the problem by modifying data_abort() to handle translation faults
in the kernel map without dereferencing "td", i.e., curthread, and
without enabling interrupts. pmap_klookup() has special handling for
translation faults which makes it safe to call in this context. Then,
revert the aforementioned hack.
Reviewed by: kevans, alc, kib, andrew
MFC after: 1 month
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37231
They could potentially be abused to overwrite kernel memory, so
shouldn't be accessible when mac_ddb is loaded.
Reviewed by: mhorne
Fixes: bc4ea61d55cb ("ddb: tag core commands with DB_CMD_MEMSAFE")
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37105
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process. It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.
This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual
listening sockets. One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.
Discussed with: glebius
MFC after: 1 month
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37029
It is ok to use memcmp() in the kernel. No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37028
If a memory allocation failure causes bind to fail, we should take the
inpcb back out of its LB group since it's not prepared to handle
connections.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37027
Some auditing of the code shows that "cred" is never non-NULL in these
functions, either because all callers pass a non-NULL reference or
because they unconditionally dereference "cred". So, let's simplify the
code a bit and remove NULL checks. No functional change intended.
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37025
Refer to sockets rather than processes, since one can have multiple
sockets in a load-balancing group within the same process.
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
With the change to return EAI_ADDRFAMILY from getaddrinfo(), fetch
would print "Unknown resolver error" for that error. Add that error
and its string to libfetch's table, using an #ifdef just in case.
Correct error strings for EAI_NODATA (although it is currently unused)
and EAI_NONAME. Should maybe rework the code to use gai_strerror(3),
but that doesn't map directly, and the current strings are shortened.
Reviewed in https://reviews.freebsd.org/D37139 with related changes.
Reviewed by: bz
MFC after: 1 month
Rework getaddrinfo(3) to return different error values for unresolvable
names (same as before, EAI_NONAME) and those without a requested addr
(EAI_ADDRFAMILY) when using DNS. This is implemented via an added
error in the nsswitch layer, NS_ADDRFAMILY, which is used only by
getaddrinfo(). The error is passed through nsdispatch(3), but that
routine has no changes to handle this error. The error originates in
the getaddrinfo DNS layer called via nsdispatch(), and is processed
by the search layer that calls nsdispatch().
While here, add a little style to returns near those that were
modified.
Reviewed in https://reviews.freebsd.org/D37139 with related changes.
Reviewed by: bz
MFC after: 1 month
gai_strerror.c still has messages for EAI_ADDRFAMILY and EAI_NODATA,
but not the man page. Re-add to the man page, and update comments
in the source. Document the errors that are not in RFC 3493 or
POSIX.
Reviewed in https://reviews.freebsd.org/D37139 with related changes.
Reviewed by: bz, pauamma
MFC after: 1 month
EAI_ADDRFAMILY and EAI_NODATA are not in RFC 3493, but are available
and used in many other systems. It is desirable to have at least one
of them in order to distinguish between names that do not resolve and
those that do not have the requested address type. A change to
getaddrinfo() will use EAI_ADDRFAMILY. Both were "#if 0"; re-enable,
conditioned on __BSD_VISIBLE, and update comments. Also add comments
and __BSD_VISIBLE conditional for the last three EAI errors, which
are not in the RFC or POSIX. Note, all of these are available in
NetBSD and OpenBSD, and EAI_ADDRFAMILY and EAI_NODATA are available
in Linux (glibc).
Reviewed in https://reviews.freebsd.org/D37139 with related changes.
Reviewed by: bz
MFC after: 1 month
Previously we relied on the .s.o rule in share/mk/bsd.suffixes.mk to
tell make that linux_support.o is built from linux_support.s, even
though we do not use the .s.o rule to assemble it.
Reviewed by: sjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35864
vcpuHint has been expanded to 16 bit on host side to enable
interruptions to be routed to more CPUs. Guest side should align with
the change.
This change has been tested with hosts with 8-bit and 16-bit vcpuHint,
on both platforms host side can get correct value.
This driver is for ESXi product which only supports x86/x64. They are
little-endian. So there is no need to consider big-endian system.
PR: 264840
Reviewed by: imp@, Zhenlei Huang
Allow pf (l2) to be used to redirect ethernet packets to a different
interface.
The intended use case is to send 802.1x challenges out to a side
interface, to enable AT&T links to function with pfSense as a gateway,
rather than the AT&T provided hardware.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D37193
The root cause of the intermittent span test failures has been
identified as a race between sending the packet and starting the bpf
capture.
This is now resolved, so the test can be re-enabled.
PR: 260461
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
The Sniffer class is often used by test tools such as pft_ping to verify
that packets actually get sent where they're expected.
It starts a background thread to capture packets, but this thread needs
some time to start, leading to intermittent test failures when the
capture doesn't start before the relevant packet is sent.
Add a semaphore to ensure the Sniffer constructor doesn't return until
the capture is actually running.
PR: 260461
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Fedora defines shell functions for some commands used by FreeBSD build
scripts. Unortunatelly it makes them behave incorrectly for our purposes.
For instance 'which which' returns something like:
which ()
{
( alias;
eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias ...
}
instead of
/usr/bin/which
This patch unsets those functions to restore original/expected behavior
Reviewed by: emaste, imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D36900
The ipmi watchdog pretimeout action can trigger unintentionally in
certain rare, complicated situations. What we have seen at Netflix
is that the BMC can sometimes be sent a continuous stream of
writes to port 0x80, and due to what is a bug or misconfiguration
in the BMC software, this results in the BMC running out of memory,
becoming very slow to respond to KCS requests, and eventually being
rebooted by its own internal watchdog. While that is going on in
the BMC, back in the host OS, a number of requests are pending in
the ipmi request queue, and the kcs_loop thread is working on
processing these requests. All of the KCS accesses to process
those requests are timing out and eventually failing because the
BMC is responding very slowly or not at all, and the kcs_loop thread
is holding the IPMI_IO_LOCK the whole time that is going on.
Meanwhile the watchdogd process in the host is trying to pat the
BMC watchdog, and this process is sleeping waiting to get the
IPMI_IO_LOCK. It's not entirely clear why the watchdogd process
is sleeping for this lock, because the intention is that a thread
holding the IPMI_IO_LOCK should not sleep and thus any thread
that wants the lock should just spin to wait for it. My best guess
is that the kcs_loop thread is spinning waiting for the BMC to
respond for so long that it is eventually preempted, and during
the brief interval when the kcs_loop thread is not running,
the watchdogd thread notices that the lock holder is not running
and sleeps. When the kcs_loop thread eventually finishes processing
one request, it drops the IPMI_IO_LOCK and then immediately takes the
lock again so it can process the next request in the queue.
Because the watchdogd thread is sleeping at this point, the kcs_loop
always wins the race to acquire the IPMI_IO_LOCK, thus starving
the watchdogd thread. The callout for the watchdog pretimeout
would be reset by the watchdogd thread after its request to the BMC
watchdog completes, but since that request never processed, the
pretimeout callout eventually fires, even though there is nothing
actually wrong with the host.
To prevent this saga from unfolding:
- when kcs_driver_request() is called in a context where it can sleep,
queue the request and let the worker thread process it rather than
trying to process in the original thread.
- add a new high-priority queue for driver requests, so that the
watchdog patting requests will be processed as quickly as possible
even if lots of application requests have already been queued.
With these two changes, the watchdog pretimeout action does not trigger
even if the BMC is completely out to lunch for long periods of time
(as long as the watchdogd check command does not also get stuck).
Sponsored by: Netflix
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D36555
Rather than using a Scapy-based Python script only check if the state
still exists. Scapy tends to be slow to start, it appears because it
lists all interfaces and gets their (IPv6) addresses a couple of times
at startup. This can be sufficient for the ICMP state to time out and
the test to fail.
We now only check if the state exists or is removed as expected, which
makes things faster, and should mean the test is more robust on slower
machines (such as CI VMs).
Sponsored by: Rubicon Communications, LLC ("Netgate")
First, an sbuf_new() in device_get_path() shadows the sb
passed in by dev_wired_cache_add(), leaving its sb in an
unfinished state, leading to a failed KASSERT(). Fixing this
is as simple as removing the sbuf_new() from device_get_path()
Second, we cannot simply take a pointer to the sbuf memory and
store it in the device location cache, because that sbuf
is freed immediately after we add data to the cache, leading
to a use-after-free and eventually a double-free. Fixing this
requires allocating memory for the path.
After a discussion with jhb, we decided that one malloc was
better than two in dev_wired_cache_add, which is why it changed
so much.
Reviewed by: jhb
Sponsored by: Netflix
MFC after: 14 days
It answers a question that someone might have when faced with the source
tree for the first time, and improves discoverability of the platforms
page.
Reviewed by: emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37179
The layout of the source tree is now only described in README.md. Retain
the cross-reference to hier(7) in SEE ALSO; it is still useful to
readers.
Reviewed by: imp, emaste
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37136
It poses a maintenance burden, since much of the information is
duplicated in the src tree's README.md file. Readers who are interested
enough in learning about the structure of the src tree can download it,
or browse the README online. Have hier(7) just point them there instead.
PR: 261349
Discussed with: freebsd-arch@, freebsd-doc@ lists
Reviewed by: imp, emaste
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37135
Document it in sys/README.md instead. Describe the purpose of LINT
config files as well.
Reviewed by: imp
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37134
Tweak the existing descriptions slightly. Add entries for several
directories which are not yet documented, but not exhaustively.
Reviewed by: imp (previous version), emaste
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37133
Add this primarily to document the sys/ subdirectories of the source tree.
This is a straight copy from the contents of hier(7). Improvements will
follow in other changes.
Reviewed by: imp, emaste
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D37132
In non-Hyper-V systems during Hyper-V initialization, system
initialization was getting hung, as hyperv_identify(),
was returning successful irrespective of the type of the platform.
Reviewed by: andrew, whu
Fixes: 9729f076e4d
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D37219