The "findstack", "show all trace", and "show active trace" commands
were iterating over allproc to enumerate threads. This missed threads
executing in exit1() after being removed from allproc.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27829
Not all debugger operations that enumerate threads require thread
stacks to be resident in memory to be useful. Instead, push P_INMEM
checks (if needed) into callers.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27827
Processes part way through exit1() are not included in allproc. Using
allproc to enumerate processes prevented getting the stack trace of a
thread in this part of exit1() via ddb.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27826
This is used by kernel debuggers to enumerate processes via the pid
hash table.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27825
Exiting processes that have been removed from allproc but are still
executing are not yet marked PRS_ZOMBIE, so they were not listed (for
example, if a thread panics during exit1()). To detect these
processes, clear p_list.le_prev to NULL explicitly after removing a
process from the allproc list and check for this sentinel rather than
PRS_ZOMBIE when walking the pidhash.
While here, simplify the pidhash walk to use a single outer loop.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27824
Keep explicit track of the allprison_lock state during the final part
of kern_jail_set, instead of deducing it from the JAIL_ATTACH flag.
Also properly clean up when the attachment fails, fixing a long-
standing (though minor) memory leak.
Use atomic testandset and testandclear to catch concurrent double free,
and to reduce the number of atomic operations.
Submitted by: jeff
Reviewed by: cem, kib, markj (all previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22703
That is, provide wrappers around the atomic_testandclear and
atomic_testandset primitives.
Submitted by: jeff
Reviewed by: cem, kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22702
After the removal of obsolete GDB 6.1.1 from the base system in
1c0ea326aa we no longer need to downgrade to DWARF2 debug info.
We will need to ensure that our tools (e.g. ctfconvert) handle DWARF5
prior to it becoming the default in the Clang and GCC versions we use.
Reported by: jhb
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
An 8MB max stack size is quite limiting in today's world, and in-fact is
the *default* stack size for almost every other arch (including mips).
Raise the default to 4MB (should be pretty reasonable) and the max to 64MB.
NetBSD made a similar move back in 2015 and raised MAXDSIZ to 1856 at the
same time, so let's just roll that in as well. They later lowered it, but
eventually raised it back to 1856 in order to build rust.
This was noticed while looking at qemu-bsd-user's default stack sizes and
growth behavior (or lack thereof).
Reviewed by: ian
Differential Revision: https://reviews.freebsd.org/D27218
When a break-to-debugger is triggered, kdb will grab the console and vt(4)
will generally switch back to ttyv0. If one issues a continue from the
debugger, then kdb will ungrab the console and the system rolls on.
This change adds a perhaps minor feature: when we're down to grab == 0 and
if vt actually switched away to ttyv0, switch back to the tty it was
previously on before the console was grabbed.
The justification behind this is that a typical flow is to work in
!ttyv0 to avoid console spam while occasionally dropping to ddb to inspect
system state before returning. This could easily enough be tossed behind
a sysctl or something if it's not generally appreciated, but I anticipate
indifference.
Reviewed by: ray
Differential Revision: https://reviews.freebsd.org/D27110
vt_allocate_keyboard only needs to unwind the effects of keyboard-grabbing,
rather than any associated vt window action that may have also happened.
Split out the bits that do the keyboard work into *_noswitch equivalents,
and use those in keyboard allocation. This will be less error-prone when a
later change will offer up different window state behavior when the console
is ungrabbed.
Reviewed by: ray
Differential Revision: https://reviews.freebsd.org/D27110
FUSE_LSEEK reports holes on fuse file systems, and is used for example
by bsdtar.
MFC after: 2 weeks
Relnotes: yes
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D27804
The firmware header loaded into an rsu(4) device has to be customized
to reflect device settings. The driver was overwriting the header
from the shared firmware image before sending it to the device. If
two devices attached at the same time with different settings, one
device could potentially get a corrupted header. The recent changes
in a095390344 exposed this bug in the
form of a panic as the firmware blobs are now marked read-only in
object files and mapped read-only by the kernel.
To avoid the bug, change the driver to allocate a copy of the firmware
header on the stack that is initialized before writing it to the
device.
PR: 252163
Reported by: vidwer+fbsdbugs@gmail.com
Tested by: vidwer+fbsdbugs@gmail.com
Reviewed by: hselasky, bz, emaste
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27850
Enable build of IPMI over OPAL on powerpc64le
Reviewed by: bdragon
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D27443
It's possible for a context switch, and CPU migration, to occur between
fetching the PCPU context and extracting the pc_curpcb. This can cause
the fault handler to be installed for the wrong thread, leading to a
panic in copyin()/copyout(). Since curthread is already in %r13, just
use that directly, as GPRs are migrated, so there is no migration race
risk.
Currently copyinstr() uses fubyte() to read each byte from userspace.
However, this means that for each byte, it calls pmap_map_user_ptr() to
map the string into memory. This is needlessly wasteful, since the
string will rarely ever cross a segment boundary. Instead, map a
segment at a time, and copy as much from that segment as possible at a
time.
Measured with the HPT pmap on powerpc64, this saves roughly 8% time on
buildkernel, and 5% on buildworld, in wallclock time.
- fix values returned by 'sysctls dev.opal_sensor.*.sensor'
- fix missing 'dev.opal_sensor.*.sensor_[max|min]' on sysctl
Reviewed-by: jhibbits
Sponsored-by: Eldorado Research Institute (eldorado.org.br)
Differential-Revision: https://reviews.freebsd.org/D27365
The handshake timer can race with another thread sending a FIN or RST
to close a TOE TLS socket. Just bail from the timer without
rescheduling if the connection is closed when the timer fires.
Reported by: Sony Arpita Das @ Chelsio QA
Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D27583
This tool can generate kernel images without changing the offsets in
the final executable. It replaces the ELF header by properly sized zeroed
block then emits a relative jump to _start(for 'v7jump' or 'v8jump' option)
or the booti header (for 'v8booti' option) to the beginning of the converted file.
Submited by: ian
Xenstore watches received are queued in a list and processed in a
deferred thread. Such queuing was done without any checking, so a
guest could potentially trigger a resource starvation against the
FreeBSD kernel if such kernel is watching any user-controlled xenstore
path.
Allowing limiting the amount of pending events a watch can accumulate
to prevent a remote guest from triggering this resource starvation
issue.
For the PV device backends and frontends this limitation is only
applied to the other end /state node, which is limited to 1 pending
event, the rest of the watched paths can still have unlimited pending
watches because they are either local or controlled by a privileged
domain.
The xenstore user-space device gets special treatment as it's not
possible for the kernel to know whether the paths being watched by
user-space processes are controlled by a guest domain. For this reason
watches set by the xenstore user-space device are limited to 1000
pending events. Note this can be modified using the
max_pending_watch_events sysctl of the device.
This is XSA-349.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
The issue was found while building cxgbe with gcc 10 (in illumos),
the array subscription check is warning us about outside the bounds
access.
See also: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Each entry actually stores a native pointer, not a uint64_t quantity. While
we're here, go ahead and export the pointer as-is rather than converting it
to KVA. This may be more useful as consumers can map /dev/mem and observe
the entry.
For reference, see: sys/contrib/edk2/Include/Uefi/UefiSpec.h
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27669
Specifically implement the if_requestencap callback function for infiniband.
Most of the changes are simply a cut and paste of the equivalent ethernet part.
Reviewed by: melifaro @
Differential Revision: https://reviews.freebsd.org/D27631
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Need to update both link layer address and broadcast address when active link changes for IP over infiniband.
This is because the broadcast address contains the so-called P-key, which is interface dependent.
Reviewed by: kib @
Differential Revision: https://reviews.freebsd.org/D27658
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
This fixes a failed assertion in scenario where the provider
disappears, disk_gone() gets called, and at the exact same
time something else closes the device node triggering a retaste.
Reviewed By: mav
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27330
Before r332974 the old code would sometimes cause a rare lock order
reversal against pagequeue, which looked roughly like this:
witness_checkorder()
__mtx_lock-flags()
vm_page_alloc()
uma_small_alloc()
keg_alloc_slab()
keg_fetch-slab()
zone_fetch-slab()
zone_import()
zone_alloc_bucket()
uma_zalloc_arg()
bucket_alloc()
uma_zfree_arg()
free()
devfs_metoo()
devfs_populate_loop()
devfs_populate()
devfs_rioctl()
VOP_IOCTL_APV()
VOP_IOCTL()
vn_ioctl()
fo_ioctl()
kern_ioctl()
sys_ioctl()
Since r332974 the original problem no longer exists, but it still
makes sense to move things out of the - often congested - lock.
Reviewed By: kib, markj
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27334
-mno-align-long-strings was a flag maintained by FreeBSD for the
now-deleted in-tree gcc. Upstream gcc has no such flag, so just drop
it.
The flag was originally submitted by bde and committed in 2002 (svn
r97911 & r104455). However, upstream gcc did address this same issue in
2004 (gcc svn r76694 / git 4137ba7ab7a), reducing long string alignment
in general, and to 1 with -Os.
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D27768
The original fusefs GSoC project seems to have envisioned exchanging two
types of messages with FUSE servers. Perhaps vectored and non-vectored?
But in practice only one type has ever been used. Delete the other type.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D27770
Commit 41fb066511 doubled the number of glyph maps in the vfnt format
from 2 to 4 to support double-width characters, but a comment describing
the maps was not updated to match.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.
This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.
Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.
For simplicity the only supported semantics are as used by mkdir.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27789
lua: avoid gcc -Wreturn-local-addr bug
Avoid a bug with gcc's -Wreturn-local-addr warning with some
obfuscation. In buggy versions of gcc, if a return value is an
expression that involves the address of a local variable, and even if
that address is legally converted to a non-pointer type, a warning may
be emitted and the value of the address may be replaced with zero.
Howerver, buggy versions don't emit the warning or replace the value
when simply returning a local variable of non-pointer type.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90737
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Libby <rlibby@FreeBSD.org>
Closes#11337
spa: avoid type narrowing warning
Building the spa module for i386 caused gcc to emit
-Wint-to-pointer-cast "cast to pointer from integer of different size"
because spa.spa_did was uint64_t but pthread_join (via thread_join in
spa_deactivate) takes a pointer (32-bit on i386). Define spa_did to be
pointer-size instead. For now spa_did is in fact never non-zero and the
thread_join could instead be ifdef'd out, but changing the size of
spa_did may be more useful for the future.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Libby <rlibby@FreeBSD.org>
Closes#11336
FreeBSD libzfs: gcc requires __thread after static
Building libzfs with gcc on FreeBSD failed because gcc is picky about
the order of keywords in declarations with __thread, whereas clang is
more relaxed.
https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ryan Libby <rlibby@FreeBSD.org>
Closes#11331
Fix compiling on FreeBSD + gcc - don't assume illmnos bits
This looks like it was once from the illumnos compat code.
FreeBSD doesn't have cmn_err as a compiler format attribute, so
it definitely errors out.
It doesn't show up on LLVM because it doesn't trigger at all.
Add in the format flags but keep them behind #if 0 for now;
there are too many format issues that trigger when one does
format checking in the shared code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: adrian chadd <adrian@freebsd.org>
Closes#11068Closes#11069
Fix pointer-is-uint64_t-sized assumption in the ioctl path
This shows up when compiling freebsd-head on amd64 using gcc-6.4.
The lib32 compat build ends up tripping over this assumption.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: adrian chadd <adrian@freebsd.org>
Closes#11068Closes#11069
- Suppress -Wredundant-decls. Ultimately this warning is harmless in
any case, and it does not look like there is a simple way to avoid
redundant declarations in this case without a lot of header pollution
(e.g. having openzfs's shim param.h pulling in sys/kernel.h for hz).
- Suppress -Wnested-externs, which is useless anyway.
Unfortunately it was not sufficient just to modify OPENZFS_CFLAGS,
because the warning suppressions need to appear on the command line
after they are explicitly enabled by CWARNFLAGS from sys/conf/kern.mk,
but OPENZFS_CFLAGS get added before due to use of -I for the shims.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D27685
This was missed in r340856 / commit
6d2e2df764. Three bytes from the kernel
stack may be leaked when reading directory entries.
Reported by: Syed Faraz Abrar <faraz@elttam.com>
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
In vm_page_busy_acquire(), load the object pointer using
atomic_load_ptr() as we do elsewhere. Per the comment, the object
identity must be consistent across sleeps.
In vm_page_grab_sleep(), pass the correct pindex to
_vm_page_busy_sleep(). The pindex is used to re-check the page's
identity before going to sleep. In particular, vm_page_grab_sleep() is
used in unlocked grab, so the object lock is not necessarily held when
verifying the page's identity, and the pindex may change if the page is
moved, or freed and re-allocated. I believe this can result in spurious
VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination
of vm_page_grab_pages_unlocked().
In vm_page_grab_pages(), pass the correct pindex to
vm_page_grab_sleep(). Otherwise I believe vm_page_grab_pages() will
effectively spin when attempting to busy a busy page after the first
index in the range.
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27607
Account for any residual bytes. This is only relevant for vnode-backed
md(4) devices.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27738
The previous patch failed to set the ISDOTDOT flag when appropriate,
which in turn fail to properly handle degenerate lookups.
While here sprinkle some extra assertions.
Tested by: pho (previous version)
dvl reported that "make installkernel" failed with "amd64/arm64/i386
kernel requires linker ifunc support." This test should apply to builds
only; the linker is not used at install time.
I think the same (ifunc-supporting) linker used to build the kernel
should be detected at install time in usual cases (and so not trigger
this error). However, there is no reason to disallow the install, if
for some reason the expected linker isn't the one tested at install
time.
PR: 251580
Reported by: dvl
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The divider table already contains the correct HW divider value, it should
not be modified by other flags such as 'CLK_DIV_ZERO_BASED'.
MFC after: 4 weeks
These handlers could interrupt code which has interrupts disabled,
and if a spurious page fault occurs during exception handler run,
we get clobbered %cr2 in higher level stack.
This is mostly a speculation, but it is based on hints from good sources.
MFC after: 1 week
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27772
eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues
are not passable between processes. And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow. This came up when debugging strange HW video decode
failures in Firefox. A native implementation would avoid these problems
and help with porting Linux software.
Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.
Submitted by: greg@unrelenting.technology
Reviewed by: markj (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26668
allprison_lock should be at least held shared when jail OSD methods
are called. Add a shared lock around one such call where that wasn't
the case.
In another such call, change an exclusive lock grab to be shared in
what is likely the more common case.
Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag
value itself, which is what sysctl expects.
Add prison_set_allow(), which can set or clear a permission bit, and
propagates cleared bits down to child jails.
Use prison_allow() and prison_set_allow() in the various jail.allow.*
sysctls, and others that depend on thoe permissions.
Add locking around checking both pr_allow and pr_enforce_statfs in
prison_priv_check().
We initialize sfio->npages only when some I/O is required to satisfy the
request. However, sendfile_iodone() contains an INVARIANTS-only check
that references sfio->npages, and this check is executed even if no I/O
is performed, so the check may use an uninitialized value.
Fix the problem by initializing sfio->npages earlier. Note that
sendfile_swapin() always initializes the page array. In some rare cases
we need to trim the page array so ensure that sfio->npages gets updated
accordingly.
Reported by: syzkaller (with KASAN)
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27726
Use atomic access and a memory barrier to ensure that the flag parameter
in pr_flag_allow is indeed set after the rest of the structure is valid.
Simplify adding flag bits with pr_allow_all, a dynamic version of
PR_ALLOW_ALL_STATIC.
Use the kernel physical base rather than the ttbr0 base when building
the kernel identity map. The latter is correct with current assumptions
but may not always be the case.
Sponsored by: Innovate UK
These drivers should have been removed along with tl(4) as part of
7c897ca91f and r347918 respectively
as these fromer made sure to only ever attach to the latter, e. g.:
<...>
static int
tlphy_probe(device_t dev)
{
if (!mii_dev_mac_match(dev, "tl"))
return (ENXIO);
<...>
When a jail is added using the default (system-chosen) JID, and
non-default-JID jails already exist, a loop through the allprison
list could restart and result in unnecessary O(n^2) behaviour.
There should never be more than two list passes required.
Also clean up inefficient (though still O(n)) allprison list traversal
when finding jails by ID, or when adding jails in the common case of
all default JIDs.
We have stopped using SVN, so the notes containing the old SVN revisions
are no longer populated, so fall back to purely counting the number of
commits (currently at about 255337).
Also turn the format more into what git-describe produces, with a name
first, then the number of commits and the hash last. Note that as we
don't tag anything on `main`, git describe will never produce something
useful there and finds the newest vendor tag that was merged in instead.
Sample output:
FreeBSD 13.0-CURRENT #6 main-c255126-gb81783dc98e6-dirty
FreeBSD 12.2-STABLE #0 stable/12-c243035-gd16dac42b641-dirty
MFC after: 3 weeks
Reviewed by: imp, glebius
Differential Revision: https://reviews.freebsd.org/D27751
Use recently-added combination of `fib[46]_lookup_rt()` which
returns rtentry & raw nexthop with `rt_get_inet[6]_plen()` which
returns address/prefix length of prefix inside `rt`.
Add `nhop_select_func()` wrapper around inlined `nhop_select()` to
allow callers external to the routing subsystem select the proper
nexthop from the multipath group without including internal headers.
New calls does not require reference counting objects and reduce
the amount of copied/processed rtentry data.
Differential Revision: https://reviews.freebsd.org/D27675
These functions get/set tty winsize respectively, and are trivial wrappers
around corresponding termio ioctls.
The functions are expected to be a part of POSIX.1 issue 8:
https://www.austingroupbugs.net/view.php?id=1151#c3856.
They are currently available in NetBSD and in musl libc.
PR: 251868
Submitted by: Soumendra Ganguly <soumendraganguly@gmail.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27650
This change introduces framework that allows to dynamically
attach or detach longest prefix match (lpm) lookup algorithms
to speed up datapath route tables lookups.
Framework takes care of handling initial synchronisation,
route subscription, nhop/nhop groups reference and indexing,
dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
picking the best matching algorithm on-the-fly based on the
amount of routes in the routing table.
Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.
The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
for large-fib (D27412)
Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
radix4: ~20mpps
radix4_lockless: ~24.8mpps
bsearch4: ~69mpps
dpdk_lpm4: ~67 mpps
700k routes:
radix4_lockless: 3.3mpps
dpdk_lpm4: 46mpps
IPv6:
8 routes:
radix6_lockless: ~20mpps
dpdk_lpm6: ~70mpps
100k routes:
radix6_lockless: 13.9mpps
dpdk_lpm6: 57mpps
Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)
Control:
Framwork adds the following runtime sysctls:
List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless
Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.
Differential Revision: https://reviews.freebsd.org/D27401
If LED state is set through evdev interface, than asynchronous nature
of USB transfer callback can lead to change of order of events echoed
back to userland as it causes LED events to be echoed with some lag.
Fix that with echoing of LED events synchronously in ioctl handler.
Reviewed by: hselasky
Obtained from: sysutils/iichid
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D27750
Unlike AT keyboards, HID devices are able to send all pc105 key
states within a single report. Let evdev to transmit all key state
changes within a single report too.
Reviewed by: hselasky
Obtained from: sysutils/iichid
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D27749
Add support for LAN found on Thinkpad USB-C and Thunderbolt Gen 2
docking stations.
Submitted by: ali.abdallah@suse.com
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Hardware timestamp reporting is disabled by default as it produces many
extra events which are not handled by consumers like libinput.
Add hw.usb.wmt.timestamps=1 tunable to loader.conf to enable it.
Obtained from: sysutils/iichid
In Hybrid mode, the number of contacts that can be reported in one
report is less than the maximum number of contacts that the device
supports. For example, a device that supports a maximum of 4
concurrent physical contacts, can set up its top-level collection to
deliver a maximum of two contacts in one report. If four contact
points are present, the device can break these up into two serial
reports that deliver two contacts each.
Obtained from: sysutils/iichid
When using NFS-over-TLS, an NFS client can optionally provide an X.509
certificate to the server during the TLS handshake. For some situations,
such as different NFS servers or different certificates being mapped
to different user credentials on the NFS server, there may be a need
for different mounts to provide different certificates.
This new mount option called "tlscertname" may be used to specify a
non-default certificate be provided. This alernate certificate will
be stored in /etc/rpc.tlsclntd in a file with a name based on what is
provided by this mount option.
Original if_dwc driver used m_defrag as an implementation shortcut but on
1000Mb networks it affects performance. Implement multi-descriptor support for
TX path.
Tested on RK3399-Firefly, patch adds ~15% of network throughput.
Reviewed By: manu
Differential Revision: https://reviews.freebsd.org/D27520
The existing values correspond to x86 exception vector numbers, but the
trap numbers used in the kernel do not match these 1-to-1. Prefer the
definitions from x86/trap.h, as they are what actually get passed to
kdb_trap(). This is of little consequence, as gdb_cpu_signal() only
reports the trap reason (signal number) to the gdb client.
This is limited to the subset of trap values for which kdb_trap() is
reachable.
Reviewed by: kib
Discussed with: jhb
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27645
Add support for the remote 'G' packet. This is not widely used by gdb
when 'P' is supported, but is technically required by any remote gdb
stub implementation [1].
[1] https://sourceware.org/gdb/current/onlinedocs/gdb/Overview.html
Reviewed by: cem
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
NetApp PR: 44
Differential Revision: https://reviews.freebsd.org/D27644
We support bulk reads of the register set, but not reading specific
registers via the 'p' packet. This is useful at least for the 'call'
command in gdb.
Reviewed by: cem
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
NetApp PR: 44
Differential Revision: https://reviews.freebsd.org/D27644
Now that bhyve(8) supports UART, bvmconsole and bvmdebug are no longer needed.
This also removes the '-b' and '-g' flag from bhyve(8). These two flags were
marked deprecated in r368519.
Reviewed by: grehan, kevans
Approved by: kevans (mentor)
Differential Revision: https://reviews.freebsd.org/D27490
r368193 was suppsed to rename the MOF firmware image, but the
qat_c2xxxfw makefile defined the two images in the wrong order so the
MMP image was renamed instead.
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC (Netgate)
When copying sockaddrs out to userspace, we pad them to a multiple of
the platform alignment (sizeof(long)). However, some sockaddr sizes,
such as struct sockaddr_dl, are not an integer multiple of the
alignment, so we may end up copying out uninitialized bytes.
Fix this by always bouncing through a pre-zeroed sockaddr_storage.
Reported by: KASAN
Reviewed by: melifaro
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27729
g_handleattr_int() consumes the bio if the attribute matches, so when we
check bp->bio_cmd bp may have been freed.
Move GETATTR handling to a separate function to avoid the problem. We
do not need to set bio_completed for such bios, g_handleattr_int() will
handle it. Also remove the setting of bio_resid before the
devstat_end_transaction_bio() call. All of the md(4) bio handlers set
bio_resid already.
Reported by: KASAN
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27724
We use a bitmap to track which cylinder groups have changed between
snapshot creation and filesystem suspension. The "legs" of the bitmap
are four bytes wide (see ACTIVESET()) so we must round up the allocation
size to a multiple of four bytes.
I believe this bug is harmless since UMA/kmem_* will both pad the
allocation and zero the full allocation. Note that malloc() does inline
zeroing when the allocation size is known at compile-time.
Reported by: pho (using KASAN)
Reviewed by: kib, mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27731
Vectored aio will require each aiocb to be associated with multiple
bios, so we can't store a link to the latter from the former. But we
don't really need to. aio_biowakeup already knows the bio it's using,
and the other fields can be stored within the bio and/or buf itself.
Also, remove the unused kaiocb.backend2 field.
Reviewed By: kib
Differential Revision: https://reviews.freebsd.org/D27682
This improves cache behaviour by not writing to the same variable from
multiple cores simultaneously.
pf_state is only used in the kernel, so can be safely modified.
Reviewed by: Lutz Donnerhacke, philip
MFC after: 1 week
Sponsed by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27661
The algorithm we use to update checksums only works correctly if the
updated data is aligned on 16-bit boundaries (relative to the start of
the packet).
Import the OpenBSD fix for this issue.
PR: 240416
Obtained from: OpenBSD
MFC after: 1 week
Reviewed by: tuexen (previous version)
Differential Revision: https://reviews.freebsd.org/D27696
And switch from int to bool while at it.
Reviewed by: melifaro@
Differential Revision: https://reviews.freebsd.org/D27725
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
I suspect that virtualization techniques improved from the time when we
have to effectively disable TSC use in VM. For instance, it was reported
(complained) in https://github.com/JuliaLang/julia/issues/38877 that
FreeBSD is groundlessly slow on AWS with some loads.
Remove the check and start watching for complaints.
Reviewed by: emaste, grehan
Discussed with: cperciva
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27629
The packet, if processed at this point, was already parsed to be UDP
directed to a vxlan port.
Connect-X 4+ does not provide easy method to infer which parser
processed the packet, so driver cannot set the flag without a lot of
efforts which are only to satisfy the formal requirements.
Reviewed by: bryanv, np
Sponsored by: Mellanox Technologies/NVidia Networking
Differential revision: https://reviews.freebsd.org/D27449
MFC after: 1 week
Otherwise libinput refuses to recoginize some Synaptics touchpads with
"kernel bug: device has min == max on ABS_X" message in Xorg.log.
PR: 251149
Reported-by: Jens Grassel <freebsd-ports@jan0sch.de>
Tested-by: Jens Grassel <freebsd-ports@jan0sch.de>
MFC-after: 2 weeks
Conducted tests showed that Embedded Controller is not mandatory for
WMI extensions to work.
Reported-by: yuripv
Reviewed-by: avg
MFC-after: 2 weeks
Differential-Revision: https://reviews.freebsd.org/D27653
The adr instruction allows for an address of +-1M from the instruction.
If we replace these with an adrp and an add instruction we can generate
an address +-4G. The adrp will get an address of the 4k page the label
is within, and the add uses the :lo12: prefix to add just the low bits
to this address.
This will allow us to move things around with fewer issues than if we
needed to keep them within the +-1MB range.
Sponsored by: Innovate UK
When tearing down a VNET, netgraph sends shutdown messages to all of the
nodes before detaching interfaces (SI_SUB_NETGRAPH comes before
SI_SUB_INIT_IF in teardown order). ng_ether nodes handle this by
destroying themselves without detaching from the parent ifnet. Then,
when ifnets go away they detach their ng_ether nodes again, triggering a
use-after-free.
Handle this by modifying ng_ether_shutdown() to detach from the ifnet.
If the shutdown was triggered by an ifnet being destroyed, we will clear
priv->ifp in the ng_ether detach callback, so priv->ifp may be NULL.
Also get rid of the printf in vnet_netgraph_uninit(). It can be
triggered trivially by ng_ether since ng_ether_shutdown() persists the
node unless NG_REALLY_DIE is set.
PR: 233622
Reviewed by: afedorov, kp, Lutz Donnerhacke
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27662
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.
This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.
Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21648
This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.
As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.
Reviewed by: jamie
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27352
These Mini-Box LCDs are using Microchip components and sub-licensed product
IDs. Whilst here, update the constant names and descriptions for the products
to use the names listed on the manufacturer's website rather than vague ones.
The picoLCD 4x20 is named that on the manufacturer's website so prefer that
name, even though linux-usb.org lists it with the numbers reversed as one might
expect.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D27670
Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().
PR: 239873
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Use traditional explicit leading zero format for hex numbers.
Align P2_ hex values.
Wrap long lines by splitting comments.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
rtsock code was build around the assumption that each rtentry record
in the system radix tree is a ready-to-use sockaddr. This assumptions
turned out to be not quite true:
* masks have their length tweaked, so we have rtsock_fix_netmask() hack
* IPv6 addresses have their scope embedded, so we have another explicit
deembedding hack.
Change the code to decouple rtentry internals from rtsock code using
newly-created rtentry accessors. This will allow to eventually eliminate
both of the hacks and change rtentry dst/mask format.
Differential Revision: https://reviews.freebsd.org/D27451
This sysctl node can generate very verbose output, so don't trigger it
for sysctl -a or sysctl vm.pmap.
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D27504
These implementation IDs are defined in the SBI spec, so we should print
their name if detected.
Submitted by: Danjel Qyteza <danq1222@gmail.com>
Reviewed by: jhb, kp
Differential Revision: https://reviews.freebsd.org/D27660
Similar to the recent patch to arm's gdb stub in r368414, allow GDB to
update the contents of most general purpose registers.
Reviewed by: cem, jhb, markj
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
NetApp PR: 44
Differential Revision: https://reviews.freebsd.org/D27642
The SRAT may contain multiple distinct entries that together describe a
contiguous region of physical memory. In this case we were not
coalescing the corresponding entries in the memory affinity table, which
led to fragmented phys_avail[] entries. Since r338431 the vm_phys_segs[]
entries derived from phys_avail[] will be coalesced, resulting in a
situation where vm_phys_segs[] entries do not have a covering
phys_avail[] entry. vm_page_startup() will not add such segments to the
physical memory allocator, leaving them unused.
Reported by: Don Morris <dgmorris@earthlink.net>
Reviewed by: kib, vangyzen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27620
This caused us to write to the low half of the feature word twice, once with
the high bits and once with the low bits. Common legacy device implementations
seem to be fairly lenient about being able to write to the feature bits
multiple times, but Arm's models use a stricter implementation that will ignore
the second write. This fixes using vtnet(4) on those models.
Reported by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Pointy hat: jrtc27
instead of bailing out with EBUSY if there are any.
If driver module is unloaded, or just device is forcibly detached from
the driver, there is no way for driver to correctly unload otherwise.
Esp. if there are resources dedicated to the VFs which prevent turning
down other resources.
Reviewed by: jhb
Sponsored by: Mellanox Technologies / NVidia Networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27615
The argument is a void * so there's no need to cast it to caddr_t.
Update documentation to match function decleration.
Reviewed by: freqlabs
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27093
Similar to r366897, this uses the .incbin directive to pull in a
firmware file's contents into a .fwo file. The same scheme for
computing symbol names from the filename is used as before to maximize
compatiblity and not require rebuilding existing .fwo files for
NO_CLEAN builds. Using ld -o binary requires extra hacks in linkers
to either specify ABI options (e.g. soft- vs hard-float) or to ignore
ABI incompatiblities when linking certain objects (e.g. object files
with only data). Using the compiler driver avoids the need for these
hacks as the compiler driver is able to set all the appropriate ABI
options.
Reviewed by: imp, markj
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27579
- Use [u]intptr_t casts to convert pointers to integers.
- Change IS_ERR* to return bool instead of long.
Reviewed by: manu
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27577
Since we do not own the session lock, a parallel killjobc() might
reset s_leader to NULL after we checked it. Read s_leader only once
and ensure that compiler is not allowed to reload.
While there, make access to t_session somewhat more pretty by using
local variable.
PR: 251915
Submitted by: Jakub Piecuch <j.piecuch96@gmail.com>
MFC after: 1 week
This is just a minor optimization, but it's sensitive. This gives an improvement of 30-50 kpps.
Reviewed by: kp, markj, glebius, lutz_donnerhacke.de
Approved by: vmaffione (mentor)
Sponsored by: vstack.com
Differential Revision: https://reviews.freebsd.org/D27382
Add capability to SPIBUS to have child device with IRQ.
For example many ADC chip have a dedicated pin to signal "data ready"
and the host can just wait for a interrupt to go out and read the result.
It is the same code as in R282674 and R282702 for IICBUS by Michal Meloun
Submitted by: Oskar Holmund <oskar.holmlund@ohdata.se>
Differential Revision: https://reviews.freebsd.org/D27396
Remove the exynos SoC support, this haven't been updated in a while,
isn't present in GENERIC and nobody is motivated to resurect it.
Differential Revision: https://reviews.freebsd.org/D24444
We're looking for file content differences, so ask the question of git
more directly. This helps a lot, saving tens of thousands of fork()s,
when the builder and editor see different stat() results (e.g., UIDs),
as they might with containers.
Submitted by: Nathaniel Wesley Filardo <nwf20@cl.cam.ac.uk>
Reviewed by: bdrewery, emaste, imp
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27646
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.
Noted by: kib
Current way, hardcoded value plus heuristic is not conform to the PCI(e)
specification and it fails on systems where MSI-X bar is not initialized by
BIOS/ACPI (many arm or arm64 systems for example).
Instead, use the standard PCI(e) capability for determining of
MSIX table bar address.
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D27265
One of the disadvantages of our current busdma code is the fact that
we process the bounced buffer in a page-by-page manner. This means that
the short (subpage) buffer allocated across page boundaries is bounced
to 2 separate pages.
This suboptimal behavior is consistent across all platforms and can be
related to (probably unimplementable or incompatible with bouncing)
BUS_DMA_KEEP_PG_OFFSET flag.
Therefore, allocate one additional page to be fully comply with this
requirement.
Discused with: markj
PR: 251018
- Use a uintptr_t cast to get the virtual address of a pointer in
USB_P2U() instead of a ptrdiff_t.
- Add offsets to a char * pointer directly without roundtripping the
pointer through a ptrdiff_t in USB_ADD_BYTES().
Reviewed by: imp, hselasky
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27581
The sense_ptr thing is quite broken. As near as I can tell, the
driver tries to copyout to a physical address rather than whatever
user address the sense buffer should be copied to. It is not
immediately obvious what user address the sense buffer should be
copied to.
Reviewed by: imp
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27578