Remove all the 'entry' and 'return' probes; they clutter up the source
and are redundant to FBT.
Reviewed By: dchagin
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30040
DXR maintains compressed lookup structures with a trivial search
procedure. A two-stage trie is indexed by the more significant bits of
the search key (IPv4 address), while the remaining bits are used for
finding the next hop in a sorted array. The tradeoff between memory
footprint and search speed depends on the split between the trie and
the remaining binary search. The default of 20 bits of the key being
used for trie indexing yields good performance (see below) with
footprints of around 2.5 Bytes per prefix with current BGP snapshots.
Rebuilding lookup structures takes some time, which is compensated for by
batching several RIB change requests into a single FIB update, i.e. FIB
synchronization with the RIB may be delayed for a fraction of a second.
RIB to FIB synchronization, next-hop table housekeeping, and lockless
lookup capability is provided by the FIB_ALGO infrastructure.
DXR works well on modern CPUs with several MBytes of caches, especially
in VMs, where is outperforms other currently available IPv4 FIB
algorithms by a large margin.
Synthetic single-thread LPM throughput test method:
kldload test_lookup; kldload dpdk_lpm4; kldload fib_dxr
sysctl net.route.test.run_lps_rnd=N
sysctl net.route.test.run_lps_seq=N
where N is the number of randomly generated keys (IPv4 addresses) which
should be chosen so that each test iteration runs for several seconds.
Each reported score represents the best of three runs, in million
lookups per second (MLPS), for two bechmarks (RND & SEQ) with two FIBs:
host: single interface address, local subnet route + default route
BGP: snapshot from linx.routeviews.org, 887957 prefixes, 496 next hops
Bhyve VM on an Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60 GHz:
inet.algo host, RND host, SEQ BGP, RND BGP, SEQ
bsearch4 40.6 20.2 N/A N/A
radix4 7.8 3.8 1.2 0.6
radix4_lockless 18.0 9.0 1.6 0.8
dpdk_lpm4 14.4 5.0 14.6 5.0
dxr 70.3 34.7 43.0 19.5
Intel(R) Core(TM) i5-5300U CPU @ 2.30 GHz:
inet.algo host, RND host, SEQ BGP, RND BGP, SEQ
bsearch4 47.0 23.1 N/A N/A
radix4 8.5 4.2 1.9 1.0
radix4_lockless 19.2 9.5 2.5 1.2
dpdk_lpm4 31.2 9.4 31.6 9.3
dxr 84.9 41.4 51.7 23.6
Intel(R) Core(TM) i7-4771 CPU @ 3.50 GHz:
inet.algo host, RND host, SEQ BGP, RND BGP, SEQ
bsearch4 59.5 29.4 N/A N/A
radix4 10.8 5.5 2.5 1.3
radix4_lockless 24.7 12.0 3.1 1.6
dpdk_lpm4 29.1 9.0 30.2 9.1
dxr 101.3 49.9 69.8 32.5
AMD Ryzen 7 3700X 8-Core Processor @ 3.60 GHz:
inet.algo host, RND host, SEQ BGP, RND BGP, SEQ
bsearch4 70.8 35.4 N/A N/A
radix4 14.4 7.2 2.8 1.4
radix4_lockless 30.2 15.1 3.7 1.8
dpdk_lpm4 29.9 9.0 30.0 8.9
dxr 163.3 81.5 99.5 44.4
AMD Ryzen 5 5600X 6-Core Processor @ 3.70 GHz:
inet.algo host, RND host, SEQ BGP, RND BGP, SEQ
bsearch4 93.6 46.7 N/A N/A
radix4 18.9 9.3 4.3 2.1
radix4_lockless 37.2 18.6 5.3 2.7
dpdk_lpm4 51.8 15.1 51.6 14.9
dxr 218.2 103.3 114.0 49.0
Reviewed by: melifaro
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29821
Add a LPS benchmark variant which introduces artificial dependencies
between successive lookups. While here, instead of writing the results
from the lookups to a huge array, add them to an accumulator, in a more
lightweight attempt at preventing the CPU's OOO machinery from
discarding the lookup results if they would be completely unused.
net.route.test.run_lps_rnd measures LPS throughput with independent
uniformly random keys
net.route.test.run_lps_seq measures LPS throughput with uniformly
random keys with artificial interdependencies
Reviewed by: melifaro
MFC after: 7 days
Differential Revision: https://reviews.freebsd.org/D30096
Document what __FreeBSD_version means a bit better by documenting the
sorts of events it should be bumped for. Also include a handy shorthand
for what it means. Add a some advice for how frequently to change this
as well.
Added a note about the approved way to parse this from the param.h file,
though that was not in the review. All in-tree users have been updated
to this method prior to this commit. Move and reword the comment that
was on the same line.
Suggestions by: greg@unrelenting, arch@
Reviewed by: rgrimes@ (earlier version).
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29850
The _ext event notification includes the address being added/removed and
that gives the driver an easy way to ignore non-IPv6 addresses. Remove
'tom' from the handler's name while here, it was moved out of t4_tom a
long time ago.
MFC after: 1 week
Sponsored by: Chelsio Communications
Add a new control message to move ethernet addresses to a given link
in ng_bridge(4). Send this message instead of doing the work directly.
This decouples the read-only activity from the modification under a
more strict writer lock.
Decoupling the work is a prerequisite for multithreaded operation.
Approved by: manpages (bcr), kp (earlier version)
MFC: 3 weeks
Differential Revision: https://reviews.freebsd.org/D28516
This fixes strace(1) erroneously reporting return values
as "Function not implemented", combined with reporting the binary
ABI as X32.
Very similar code in linux_ptrace_getregs() is left as it is - it's
probably wrong too, but I don't have a way to test it.
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D29927
When loading attributes from the cache, the NFS client is careful to
copy only the fields that it initialized. After fetching attributes
from the server, however, it would copy the entire vattr structure
initialized from the RPC response, so uninitialized stack bytes would
end up being copied to userspace. In particular, va_birthtime (v2 and
v3) and va_gen (v3) had this problem.
Use a common subroutine to copy fields provided by the NFS client, and
ensure that we provide a dummy va_gen for the v3 case.
Reviewed by: rmacklem
Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30090
AIM (adaptive interrupt moderation) was part of BSD11 driver. Upon IFLIB
migration, AIM feature got lost. Re-introducing AIM back into IFLIB
based IXGBE driver.
One caveat is that in BSD11 driver, a queue comprises both Rx and Tx
ring. Starting from BSD12, Rx and Tx have their own queues and rings.
Also, IRQ is now only configured for Rx side. So, when AIM is
re-enabled, we should now consider only Rx stats for configuring EITR
register in contrast to BSD11 where Rx and Tx stats were considered to
manipulate EITR register.
Reviewed by: gallatin, markj
Sponsored by: NetApp, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D27344
Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.
Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519
These should never get values large enough for sign to matter, but one
of them becoming negative could cause problems.
MFC after: 1 week
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D29327
devvn_refthread() will initialize *devp only if it succeeds, so check for
success before comparing with fp->f_data. Other devvn_refthread()
callers are careful to do this.
Reported by: KMSAN
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30068
Some filesystems, e.g., pseudofs and the NFSv3 client, do not provide
one.
Reviewed by: kib
Reported by: KMSAN
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30091
Otherwise, if !smp_started is true, then smp_rendezvous_cpus_done() will
harmlessly perform an atomic RMW on an uninitialized variable.
Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
User-supplied data might make this loop too time-consuming. Divide
directly, and handle both the possibility that we were woken up earlier,
and arithmetic overflows/underflows from the calculation.
Reported and tested by: pho (previous version)
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30069
It writes the core of live stopped process to the file descriptor
provided as an argument.
Based on the initial version from https://reviews.freebsd.org/D29691,
submitted by Michał Górny <mgorny@gentoo.org>.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955
This way threads in ptracestop can be discovered by debugger
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955
Set a new P2_PTRACEREQ flag around the request Wait for the target .
process P2_PTRACEREQ flag to clear before setting ours .
Otherwise, we rely on the moment that the process lock is not dropped
until the stopped target state is important. This is going to be no
longer true after some future change.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955
It unsuspends single suspended thread, passed as the argument.
It is up to the caller to arrange the target thread to suspend later,
since the state of the process is not changed from stopped. In particular,
the unsuspended thread must not leave to userspace, since boundary code
is not prepared to this situation.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955
The helper removes the thread from a sleep queue, assuming that it would
need to sleep. The sleepq_remove_nested() function is intended for quite
special case, where suspended thread from traced stopped process is
temporary unsuspended to do some work on behalf of the debugger in the
target context, and this work might require sleep.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955
- SVC_ALL request dumping all map entries, including those marked as
non-dumpable
- SVC_NOCOMPRESS disallows compressing the dump regardless of the sysctl
policy
- SVC_PC_COREDUMP is provided for future use by userspace core dump
request
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955
From my understanding this could happen with iSCSI LUNs with
unusually long names. The bug would make CAM fail to retrieve
the full inquiry data. Instead of bumping the size of the local
variable, just use a macro.
Reviewed By: imp, mav
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #50
Differential Revision: https://reviews.freebsd.org/D29991
Add PCI IDs for Intel Apollo Lake Series HSUARTs:
# pciconf -ll
drv selector class rev hdr vendor device subven subdev
uart0@pci0:0:24:0: 118000 0b 00 8086 5abc 8086 7270
uart1@pci0:0:24:1: 118000 0b 00 8086 5abe 8086 7270
uart2@pci0:0:24:2: 118000 0b 00 8086 5ac0 8086 7270
uart3@pci0:0:24:3: 118000 0b 00 8086 5aee 8086 7270
NB (Intel Document Number 336256-004US):
1. The E3900 and A3900 Series Processors support four LPSS_UART ports,
while the N- and J- Series Processors support only LPSS_UART [2:1]
ports.
2. The LPSS_UART1 port is dedicated for discrete Global Navigation
Satellite System (GNSS). This port can be used for generic UART
functionality if GNSS is not used.
3. The LPSS_UART2 port is dedicated for host OS debug.
4. The LPSS_UART0 and LPSS_UART3 ports are for generic UART functionality.
5. Only UART [1:0] ports support DMA.
PR: 255556
Submitted by: Jose Luis Duran <jlduran@gmail.com>
MFC after: 1 week
Fix what appears to have been a small copy/paste typo in ifconfig(8)'s
documentation (man page and header file).
Not that it matters anymore.
Reference: Table I-2 in IEEE Std 802.1Q-2014.
PR: 255557
Submitted by: Jose Luis Duran <jlduran@gmail.com>
MFC after: 1 week
When estimating working set size, measure only allocation batches, not free
batches. Allocation and free patterns can be very different. For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.
Update working set size on every reclamation call, shrinking caches faster
under pressure. Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches. I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.
Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers. For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used. And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.
Reviewed by: markj, jeff (previous version)
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D29790
When processing INIT and INIT-ACK information, also during
COOKIE processing, delete the current association, when it
would end up in an inconsistent state.
MFC after: 3 days
PR#255523 reported that a file copy for a file with a large hole
to EOF on ZFS ran slowly over NFSv4.2.
The problem was that vn_generic_copy_file_range() would
loop around reading the hole's data and then see it is all
0s. It was coded this way since UFS always allocates a data
block near the end of the file, such that a hole to EOF never exists.
This patch modifies vn_generic_copy_file_range() to check for a
ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that
case as a hole to EOF. asomers@ confirms that it works for his
ZFS test case.
PR: 255523
Tested by: asomers
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30076
Not all interrupt controllers enable IPIs by default as the Arm
GIC specs make it an implementation defined option. As at least two
hypervisors have also previously masked the IPIs on boot.
As we already enable these IPIs on the non-boot CPUs it is expected
this is a safe operation.
Differential Revision: https://reviews.freebsd.org/D26975
This will allow us to allocate an unmapped memory resource, then
later map it with a specific memory attribute.
This is also needed for virtio with the modern PCI attachment.
Reviewed by: kib (via D29723)
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D29694
It is defined as a uint64_t in the UEFI spec. As it's not used as a
pointer by the kernel follow this and define it as the same in the
kernel.
Reviewed by: kib, manu, imp
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D29759
On arm64 we currently use a non-posted write for device memory, however
we should move to use posted writes. This is expected to work on most
hardware, however we will need to support a non-posted option for some
broken hardware.
Reviewed by: imp, manu, bcr (manpage)
Differential Revision: https://reviews.freebsd.org/D29722
Summary:
Since PCPU can live in a GPR for a while longer, let it, rather than
re-getting it in yet another register. MFSPR is an expensive operation,
12 clock latency on POWER9, so the fewer operations we need, the better.
Since the check is tightly coupled to the fetch, by reducing the number
of fetch+check, we reduce the stalls, and improve the performance
marginally. Buildworld was measured at a ~5-7% improvement on a single
run.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D30003
It has to be zeroed before committing it to device.
We do that by allocating it with M_ZERO, but there was no
memory barrier or cache flush to ensure its sees it zeroed.
This fixes MSIX on LS1028A SoC.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: andrew
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D30033
When sending an IPI, if a previous IPI is still pending delivery,
native_lapic_ipi_vectored() waits for the previous IPI to be sent.
We've seen a few inexplicable panics with the current timeout of 50 ms.
Increase the timeout to 1 second and make it tunable.
No hardware specification mentions a timeout in this case; I checked
the Intel SDM, Intel MP spec, and Intel x2APIC spec. Linux and illumos
wait forever. In Linux, see __default_send_IPI_shortcut() in
arch/x86/kernel/apic/ipi.c. In illumos, see apic_send_ipi() in
usr/src/uts/i86pc/io/pcplusmp/apic_common.c. However, misbehaving hardware
could hang the system if we wait forever.
Reviewed by: mav kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D29942
Its use is for cases where some filler is needed for cmd, or we need an
indication that there were no cmd supplied, and so on.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29935
Filter on fifos is real filter for the object, and not a filesystem
events filter like EVFILT_VNODE.
Reported by: markj using syzkaller
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
A testing on the real hardware uncovered an issue, and since I do not have
access to the machine, disable until the bug can be fixed.
Reported by: "Pieper, Jeffrey E" <jeffrey.e.pieper@intel.com>
Sponsored by: The FreeBSD Foundation
MFC after: 3 days