Move the xenisrc structure which needs to be shared between the core Xen
interrupt code and architecture-dependent code into a separate header. A
similar situation exists for the NR_EVENT_CHANNELS constant.
Turn xi_intsrc into a type definition named xi_arch to reflect the new
purpose of being an architectural variable for the interrupt source.
This was originally implemented by Julien Grall, but has been heavily
modified. The core side was renamed "intr-internal.h" and is #include'd
by "arch-intr.h" instead of the other way around. This allows the
architecture to add function definitions which use struct xenisrc.
The original version only moved xi_intsrc into xen_arch_isrc_t. Moving
xi_vector was done by the submitter.
The submitter had also moved xi_activehi and xi_edgetrigger into
xen_arch_isrc_t. Those disappeared with the removal of PVHv1 support.
Copyright note. The current xenisrc structure was introduced at
76acc41fb7 by Justin T. Gibbs. Traces remain, but the strength of
Copyright claims from before 2013 seem pretty weak.
Reviewed by: royger
Submitted by: Elliott Mitchell <ehem+freebsd@m5p.com>, 2021-03-17 19:09:01
Original implementation: Julien Grall <julien@xen.org>, 2015-10-20 09:14:56
Differential Revision: https://reviews.freebsd.org/D30648
[royger]
- Adjust some line lengths
- Fix comment about NR_EVENT_CHANNELS after movement.
- Use #include instead of symlinks.
xen_intr_handle_upcall() has two interfaces. It needs to be called by
the x86 assembly code invoked by the APIC. Second, it needs to be called
as a driver_filter_t for the XenPCI code and for architectures besides
x86.
Unfortunately the driver_filter_t interface was implemented as a wrapper
around the x86-APIC interface. Now create a simple wrapper for the
x86-APIC code, which calls an architecture-independent
xen_intr_handle_upcall().
When called via intr_event_handle(), driver_filter_t functions expect
preemption to be disabled. This removes the need for
critical_enter()/critical_exit() when called this way.
The lapic_eoi() call is only needed on x86 in some cases when invoked
directly as an APIC vector handler.
Additionally driver_filter_t functions have no need to handle interrupt
counters. The intrcnt_add() calling function was reworked to match the
current situation. intrcnt_add() is now only called via one path.
The increment/decrement of curthread->td_intr_nesting_level had
previously been left out. Appears this was mostly harmless, but this
was noticed during implementation and has been added.
CONFIG_X86 is a leftover from use with Linux. While the barrier isn't
needed for FreeBSD on x86, it will be needed for FreeBSD on other
architectures.
Copyright note. xen_intr_intrcnt_add() was introduced at 76acc41fb7
by Justin T. Gibbs. xen_intrcnt_init() was introduced at fd036deac1
by John Baldwin.
sys/x86/xen/xen_arch_intr.c was originally created by Julien Grall in
2015 for the purpose of holding the x86 interrupt interface. Later it
was found xen_intr_handle_upcall() was better earlier, and the x86
interrupt interface better later. As such the filename and header list
belong to Julien Grall, but what those were created for is later.
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D30006
Keeping released xenisrcs in a known state simplifies allocation, but
forces the allocation function to maintain that state. This turns into
a problem when trying to allow for interchangeable allocation functions.
Fix this issue by ensuring xenisrcs are always *fully* initialized
during binding.
Reviewed by: royger
There are actually several distinct locking domains in xen_intr.c, but
all were sharing the same lock. Both xen_intr_port_to_isrc[] and the
x86 interrupt structures needed protection. Split these two apart as a
precursor to splitting the architecture portions off the file.
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D30726
Locking for allocation was being done in xen_intr_bind_isrc(), but the
unlock was inside xen_intr_alloc_isrc(). While the lock acquisition at
the end of xen_intr_alloc_isrc() was to modify xen_intr_port_to_isrc[],
NOT allocation. Fix this garbled (though working) locking scheme.
Now locking for allocation is strictly in xen_intr_alloc_isrc(), while
locking to modify xen_intr_port_to_isrc[] is in xen_intr_bind_isrc().
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D30726
The call structure around xen_intr_alloc_isrc() was rather awful.
Notably finding a structure for reuse is part of allocation, but this
was done outside xen_intr_alloc_isrc(). Move this into
xen_intr_alloc_isrc() so the function handles all allocation steps.
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D30726
As "CPUs", IRQs (vector) and virtual IRQs are always positive integers,
adjust the Xen code to use unsigned integers. Several format strings
need adjustment to match. Additionally single-bit bitfields are
boolean.
No functional change expected.
Reviewed by: royger
This overlaps the purpose of __XEN_INTERFACE_VERSION__. Remove Xen 3.0.2
compatibility. __XEN_INTERFACE_VERSION__ has compatibility to Xen 3.2.8
enabled. As Xen 3.3 was released almost 15 years ago, it seems unlikely
anyone hasn't updated.
Reviewed by: royger
HYPERVISOR_poll(), HYPERVISOR_block(), and HYPERVISOR_crash() appear no
longer used. Further get_system_time() appears to have disappeared at
some point in the past, so HYPERVISOR_poll() was broken anyway.
No functional change intended.
Reviewed by: royger
The present implementation is only for x86. Other architectures need
adjustments for querying presence of EFI.
Xen's EFI support is also quite troublesome on non-x86. This is being
slowly remedied, but until in better shape the EFI clock functionality
should be disabled.
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D31065
Part of the series for allowing FreeBSD/ARM to run on Xen. On ARM the
function is a trivial pass-through, other architectures need distinct
implementations.
While implementing XEN_VCPUID() as a call to XEN_CPUID_TO_VCPUID()
works, that involves multiple accesses to the PCPU region. As such make
this a distinct macro. Only callers in machine independent code have
been switched.
Add a wrapper for the x86 PIC interface to use matching the old
prototype.
Partially inspired by the work of Julien Grall <julien@xen.org>,
2015-08-01 09:45:06, but XEN_VCPUID() was redone by Elliott Mitchell on
2022-06-13 12:51:57.
Reviewed by: royger
Submitted by: Elliott Mitchell <ehem+freebsd@m5p.com>
Original implementation: Julien Grall <julien@xen.org>, 2014-04-19 08:57:40
Original implementation: Julien Grall <julien@xen.org>, 2014-04-19 14:32:01
Differential Revision: https://reviews.freebsd.org/D29404
Previously the upper layer handle was being set before the last
potential error condition. The reasoning appears to have been it was
assumed invalid in case of an error being returned. Now ensure it is
invalid until just before a successful return.
Fixes: 76acc41fb7 ("Implement vector callback for PVHVM and unify event channel implementations")
Fixes: 6d54cab1fe ("xen: allow to register event channels without handlers")
Reviewed by: royger
n further non-LRO testing I found a case where rack is supposed to be waking up but
it is not now. In this special case it sets the flag rc_ack_can_sendout_data. When that is
set we should not prohibit output.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39565
The bridge treated no vlan tag as being equivalent to vlan ID 1, which
causes confusion if the bridge sees both untagged and vlan 1 tagged
traffic.
Use DOT1Q_VID_NULL when there's no tag, and fix up the lookup code by
using 'DOT1Q_VID_RSVD_IMPL' to mean 'any vlan', rather than vlan 0. Note
that we have to account for userspace expecting to use 0 as meaning 'any
vlan'.
PR: 270559
Suggested by: Zhenlei Huang <zlei@FreeBSD.org>
Reviewed by: philip, zlei
Differential Revision: https://reviews.freebsd.org/D39478
It is shorter and more readable.
No functional change intended.
Reviewed by: kp
Fixes: 2d3614fb13 bridge: Log MAC address port flapping
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39542
Introduce the OpenBSD syntax of "scrub" option for "match" and "pass"
rules and the "set reassemble" flag. The patch is backward-compatible,
pf.conf can be still written in FreeBSD-style.
Obtained from: OpenBSD
MFC after: never
Sponsored by: InnoGames GmbH
Differential Revision: https://reviews.freebsd.org/D38025
Add MODULE_PNP_INFO() to the driver to make it autoload if not linked
statically into the kernel. Remove the device from amd64/i386 GENERIC.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D35074
In going through all the TCP stacks I have found we have a few little bugs and niggles that need to
be cleaned up in tcp_subr.c including the following:
a) Set tcp_restoral_thresh to 450 (45%) not 550. This is a better proven value in my testing.
b) Lets track when we try to do pacing but fail via a counter for connections that do pace.
c) If a switch away from the default stack occurs and it fails we need to make sure the time
scale is in the right mode (just in case the other stack changed it but then failed).
d) Use the TP_RXTCUR() macro when starting the TT_REXMT timer.
e) When we end a default flow lets log that in BBlogs as well as cleanup any t_acktime (disable).
f) When we respond with a RST lets make sure to update the log_end_status properly.
g) When starting a new pcb lets assure that all LRO features are off.
h) When discarding a connection lets make sure that any t_in_pkt's that might be there are freed properly.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39501
Quoting from https://maskray.me/blog/2023-04-12-elf-hash-function:
The System V Application Binary Interface (generic ABI) specifies the
ELF object file format. When producing an output executable or shared
object needing a dynamic symbol table (.dynsym), a linker generates a
.hash section with type SHT_HASH to hold a symbol hash table. A DT_HASH
tag is produced to hold the address of .hash.
The function is supposed to return a value no larger than 0x0fffffff.
Unfortunately, there is a bug. When unsigned long consists of more than
32 bits, the return value may be larger than UINT32_MAX. For instance,
elf_hash((const unsigned char *)"\xff\x0f\x0f\x0f\x0f\x0f\x12") returns
0x100000002, which is clearly unintended, as the function should behave
the same way regardless of whether long represents a 32-bit integer or
a 64-bit integer.
Reviewed by: kib, Fangrui Song
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39517
During compressed ack and mbuf queuing we determine if we need to wake up. A
new function was added that is optional to the tfb so that the stack itself can also
be asked if a wakeup should happen. This helps compensate for late hpts calls.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39502
This fixes a bug where stacks could not be deregistered when
end points in the non-default vnet are using it.
Reviewed by: glebius, zlei
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D39514
Use it when possible, instead of separated flags.
No functional change intended.
Reviewed by: hselasky, erj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39466
Use it when possible, instead of separated flags.
No functional change intended.
Reviewed by: hselasky, erj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39466
sysctl variables rx_budget and max_aggregation_size are read-only loader
tunable. Mark them with CTLFLAG_RD flag.
No functional change intended.
Reviewed by: hselasky, erj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39466
Use it when possible, instead of separated flags.
No functional change intended.
Reviewed by: hselasky, erj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39466
Use them when possible, instead of separated flags.
No functional change intended.
Reviewed by: hselasky, erj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39466
Use them when possible, instead of separated flags.
No functional change intended.
Reviewed by: hselasky, erj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39466
Avoid magic numbers when handling the IPv6 flow ID for
DSCP and ECN fields and use the named variable instead.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39503
which returns an indicator if the current thread owns the specified
lock.
Reviewed by: jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39477
This ensures that *mpp != NULL iff vn_finished_write() should be
called, regardless of the returned error, except for V_NOWAIT.
The only exception that must be maintained is the case where
vn_start_write(V_NOWAIT) is called with the intent of later dropping
other locks and then doing vn_start_write(V_XSLEEP), which needs the mp
value calculated from the non-waitable call above it.
Also note that V_XSLEEP is not supported by vn_start_secondary_write().
Reviewed by: markj, mjg (previous version), rmacklem (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39441
Falling back to the multicast key may cause unicast traffic to leak.
Instead fail when no key is found.
For more information see the 'Framing Frames: Bypassing Wi-Fi Encryption
by Manipulating Transmit Queues' paper.
[ I updated the commit message to reference the paper and the code
comment to record historic behaviour as discussed in private email. ]
Security: CVE-2022-47522
Rack is capable of fixed rate or dynamic rate pacing. Both of these can get mixed up when
LRO is not available. This is because LRO will hold off waking up the tcp connection to
processing the inbound packets until the pacing timer is up. Without LRO the pacing only
sort-of works. Sometimes we pace correctly, other times not so much.
This set of changes will make it so pacing works properly in the absence of LRO.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39494
This function is clearly a stub, but it seems better to leave the stub
bits in place than to remove the function entirely.
Differential Revision: https://reviews.freebsd.org/D39355
If all of the mirror's children have the same rotation rate, report
that. But if they have mixed rotation rates, or if any child has an
unknown rotation rate, report "Unknown".
MFC after: 2 weeks
Sponsored by: Axcient
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D39458
This will be used by a forthcoming port of the kinst provider.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39481
if_bridge receives packets via a special interface, if_bridge_input,
rather than by if_input. Thus, netmap's usual hooking of ifnet routines
does not work as expected. Instead, modify bridge_input() to pass
packets directly to netmap when it is enabled. This applies to both
locally delivered packets and forwarded packets.
When a netmap application transmits a packet by writing it to the host
TX ring, the mbuf chain is passed to if_input, which ordinarily points
to ether_input(). However, when transmitting via if_bridge,
bridge_input() needs to see the packet again in order to decide whether
to deliver or forward. Thus, introduce a new protocol flag,
M_BRIDGE_INJECT, which 1) causes the packet to be passed to
bridge_input() again after Ethernet processing, and 2) avoids passing
the packet back to netmap. The source MAC address of the packet is used
to determine the original "receiving" interface.
Reviewed by: vmaffione
MFC after: 2 months
Sponsored by: Zenarmor
Sponsored by: OPNsense
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38066
We already remove mbuf tags from packets transitting an if_epair, but we
didn't remove vlan metadata.
In certain configurations this could lead to unexpected vlan tags
turning up on the rx side.
PR: 270736
Reviewed by: markj
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39482
The assert_vop_locked messages are ignored, and file/line information
is not too useful. Fixing this without changing both witness and VFS
asserts KPIs is not possible.
Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39464
Use route destination sockaddr when the gateway is eiter AF_LINK or
has the different family (IPv4 over IPv6). This change ensures
the nexthop IFA has the same family as the destination.
Reported by: Dmitriy Smirnov <fox@sage.su>
Tested by: Dmitriy Smirnov <fox@sage.su>
MFC after: 3 days
Bad behaving user-space USB applicatoins may crash the kernel by issuing
USB FS related ioctl(2)'s out of their expected order. By default
the USB FS ioctl(2) interface is only available to the
administrator, root, and driver applications like webcamd(8) needs
to be hijacked in order for this to happen.
The issue is the fast-path code does not always see updates made
by the slow-path code, and may then work on freed memory.
This is easily fixed by using an EPOCH(9) type of synchronization
mechanism. A SX(9) lock will be used as a substitute for EPOCH(9),
due to the need for sleepability. In addition most calls going into
the fast-path originate from a single user-space process and the
need for multi-thread performance is not present.
Differential Revision: https://reviews.freebsd.org/D39373
Reviewed by: markj@
Reported by: C Turt <ecturt@gmail.com>
admbugs: 994
MFC after: 1 week
Sponsored by: NVIDIA Networking
Move code in switch cases into own functions to make later changes easier to track.
No functional change, except for removing a superfluous break statement when
range checking USB_FS_MAX_FRAMES, in the USB_FS_OPEN case.
It should not have been there at all.
Suggested by: emaste@
MFC after: 1 week
Sponsored by: NVIDIA Networking
If either of vnodes is shared locked, lock must not be recursed.
Requested by: rmacklem
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39444
It is not implemented and causes panics on boot.
This is a temporary measure until someone(tm) sorts it out.
Reported by: many
Sponsored by: Rubicon Communications, LLC ("Netgate")
Although the NFS client does not currently perform Null RPCs,
this fix is needed if/when it might do so.
Found during testing of experimental code that uses Null RPCs
to maintain/monitor TCP connections for "nconnect" mounts.
MFC after: 3 months
Commit f4179ad46f added support for operation bitmaps for
NFSv4.1/4.2. This commit uses those to implement the SP4_MACH_CRED
case for the NFSv4.1/4.2 ExchangeID operation since the Linux
NFSv4.1/4.2 client is now using this for Kerberized mounts.
The Linux Kerberized NFSv4.1/4.2 mounts currently work without
support for this because Linux will fall back to SP4_NONE,
but there is no guarantee this fallback will work forever.
This commit only affects Kerberized NFSv4.1/4.2 mounts from
Linux at this time.
MFC after: 3 months
The socket argument is superfluous, as a tcpcb always has one and
only one socket.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39434
API contract requires VOPs to handle EXDEV internally, worst case by
falling back to the generic copy routine. This broke with the recent
changes.
While here whack custom loop to lock 2 vnodes with vn_lock_pair, which
provides the same functionality internally. write start/finish around
it plays no role so got eliminated.
One difference is that vn_lock_pair always takes an exclusive lock on
both vnodes. I did not patch around it because current code takes an
exclusive lock on the target vnode. zfs supports shared-locking for
writes, so this serializes different calls to the routine as is, despite
range locking inside. At the same time you may notice the source vnode
can get some traffic if only shared-locked, thus once more this goes
the safer route of exclusive-locking. Note this should be patched to
use shared-locking for both once the feature is considered stable.
Technically the switch to vn_lock_pair should be a separate change, but
it would only introduce churn immediately whacked by the rest of the
patch.
[note: technically the review is still in progress, but so is the
active breakage]
Sponsored by: Rubicon Communications, LLC ("Netgate")
MAC flapping occurs when a bridge receives packets with the same source MAC
address on different member interfaces. The common reasons are:
- user roams from one bridge port to another
- user has wrong network setup, bridge loops e.g.
- someone set duplicated ethernet address on his/her nic
- some bad guy / virus / trojan send spoofed packets
if_bridge currently updates the bridge routing entry silently hence it is hard
to diagnose.
Emit logs when MAC address port flapping occurs to make it easier to diagnose.
Reviewed by: kp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D39375
Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack
has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also
there is a t_maxpeak_rate that needs to be removed (its un-used).
Reviewed by: tuexen, cc
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39427
To allow it to be used before ENTRY we need to ensure the symbol is
in the .text section. It also needs to be aligned correctly.
While here mark the symbol type as a function as in the ENTRY macro.
Reported by: jrtc27
Sponsored by: Arm Ltd
ifnets are allowed to pass batches of multiple packets to if_input,
linked by the m_nextpkt pointer. iflib_rxeof() sometimes does this, for
example. Netmap's generic mode did not handle this and would only
deliver the first packet in the batch, leaking the rest.
PR: 270636
Reviewed by: vmaffione
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39426
In emulated mode, the FreeBSD netmap port attempts to perform zero-copy
transmission. This works as follows: the kernel ring is populated with
mbuf headers to which netmap buffers are attached. When transmitting,
the mbuf refcount is initialized to 2, and when the counter value has
been decremented to 1 netmap infers that the driver has freed the mbuf
and thus transmission is complete.
This scheme does not generalize to the situation where netmap is
attaching to a software interface which may transmit packets among
multiple "queues", as is the case with bridge or lagg interfaces. In
that case, we would be relying on backing hardware drivers to free
transmitted mbufs promptly, but this isn't guaranteed; a driver may
reasonably defer freeing a small number of transmitted buffers
indefinitely. If such a buffer ends up at the tail of a netmap transmit
ring, further transmits can end up blocked indefinitely.
Fix the problem by removing the zero-copy scheme (which is also not
implemented in the Linux port of netmap). Instead, the kernel ring is
populated with regular mbuf clusters into which netmap buffers are
copied by nm_os_generic_xmit_frame(). The refcounting scheme is
preserved, and this lets us avoid allocating a fresh cluster per
transmitted packet in the common case. If the transmit ring is full, a
callout is used to free the "stuck" mbuf, avoiding the queue deadlock
described above.
Furthermore, when recycling mbuf clusters, be sure to fully reinitialize
the mbuf header instead of simply re-setting M_PKTHDR. Some software
interfaces, like if_vlan, may set fields in the header which should be
reset before the mbuf is reused.
Reviewed by: vmaffione
MFC after: 1 month
Sponsored by: Zenarmor
Sponsored by: OPNsense
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38065
This is counterpart to e87c494015, which did the same for ethernet.
Suggested by: hselasky
Reviewed by: hselasky, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D39405
- Let the compiler use constant folding to eliminate conditionals.
- Fix some inconsistent whitespace.
No functional change intended.
Reviewed by: zlei
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38410
by making it accept some open(2) flags. More precisely, only
O_CLOEXEC is supported, the flag is translated into the KQUEUE_CLOEXEC flag
for kqueuex(2), and O_NONBLOCK is silently ignored.
Reported and tested by: vishwin
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39377
Handling of the CLOSE_RANGE_UNSHARE flag is not implemented due to
difference in fd unsharing mechanism in the Linux and FreeBSD.
Reviewed by: mjg
Differential revision: https://reviews.freebsd.org/D39398
MFC after: 2 weeks
There have been many changes to rack over the last couple of years, including:
a) Ability when switching stacks to have one stack query another.
b) Internal use of micro-second timers instead of ticks.
c) Many changes to pacing in forms of
1) Improvements to Dynamic Goodput Pacing (DGP)
2) Improvements to fixed rate paciing
3) A new feature called hybrid pacing where the requestor can
get a combination of DGP and fixed rate pacing with deadlines
for delivery that can dynamically speed things up.
d) All kinds of bugs found during extensive testing and use of the
rack stack for streaming video and in fact all data transferred
by NF
Reviewed by: glebius, gallatin, tuexen
Sponsored By: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D39402
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).
Once all this lands I will then update rack to begin using all these new features.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
If block_cloning is disabled, or other errors from zfs_clone_range()
return an EXDEV we should fall back to vn_generic_copy_file_range().
This fixes issues when copying files on the same dataset with
block_cloning disabled.
Upstreamed as pull request to OpenZFS.
Reviewed by: Mateusz Guzik <mjguzik@gmail.com>
OpenZFS pull request: 14713
Handle data-lanes property for pcie phy and set it accordingly.
This makes devices attached to pcie3 work properly.
For some RK3568 based boards, RTL8125B based device is
connected it. So with this, realtek-re-kmod driver attaches
and works.
Partially obtained from OpenBSD.
Tested on NanoPI-R5S, FireFly Station P2 boards.
Notable upstream pull request merges:
#12194 Fix short-lived txg caused by autotrim
#13368 ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()
#13392 Implementation of block cloning for ZFS
#13741 SHA2 reworking and API for iterating over multiple implementations
#14282 Sync thread should avoid holding the spa config write lock
when possible
#14283 txg_sync should handle write errors in ZIL
#14359 More adaptive ARC eviction
#14469 Fix NULL pointer dereference in zio_ready()
#14479 zfs redact fails when dnodesize=auto
#14496 improve error message of zfs redact
#14500 Skip memory allocation when compressing holes
#14501 FreeBSD: don't verify recycled vnode for zfs control directory
#14502 partially revert PR 14304 (eee9362a7)
#14509 Fix per-jail zfs.mount_snapshot setting
#14514 Fix data race between zil_commit() and zil_suspend()
#14516 System-wide speculative prefetch limit
#14517 Use rw_tryupgrade() in dmu_bonus_hold_by_dnode()
#14519 Do not hold spa_config in ZIL while blocked on IO
#14523 Move dmu_buf_rele() after dsl_dataset_sync_done()
#14524 Ignore too large stack in case of dsl_deadlist_merge
#14526 Use .section .rodata instead of .rodata on FreeBSD
#14528 ICP: AES-GCM: Refactor gcm_clear_ctx()
#14529 ICP: AES-GCM: Unify gcm_init_ctx() and gmac_init_ctx()
#14532 Handle unexpected errors in zil_lwb_commit() without ASSERT()
#14544 icp: Prevent compilers from optimizing away memset()
in gcm_clear_ctx()
#14546 Revert zfeature_active() to static
#14556 Remove bad kmem_free() oversight from previous zfsdev_state_list
patch
#14563 Optimize the is_l2cacheable functions
#14565 FreeBSD: zfs_znode_alloc: lock the vnode earlier
#14566 FreeBSD: fix false assert in cache_vop_rmdir when replaying ZIL
#14567 spl: Add cmn_err_once() to log a message only on the first call
#14568 Fix incremental receive silently failing for recursive sends
#14569 Restore ASMABI and other Unify work
#14576 Fix detection of IBM Power8 machines (ISA 2.07)
#14577 Better handling for future crypto parameters
#14600 zcommon: Refactor FPU state handling in fletcher4
#14603 Fix prefetching of indirect blocks while destroying
#14633 Fixes in persistent error log
#14639 FreeBSD: Remove extra arc_reduce_target_size() call
#14641 Additional limits on hole reporting
#14649 Drop lying to the compiler in the fletcher4 code
#14652 panic loop when removing slog device
#14653 Update vdev state for spare vdev
#14655 Fix cloning into already dirty dbufs
#14678 Revert "Do not hold spa_config in ZIL while blocked on IO"
Obtained from: OpenZFS
OpenZFS commit: 431083f75b
Different lagg protocols have different means and policies to process incoming
traffic. For example, for failover protocol, by default received traffic is only
accepted when they are received through the active port. For lacp protocol, LACP
control messages are tapped off, also traffic will be dropped if they are
received through the port which is not in collecting state or is not joined to
the active aggregator. It confuses if user dump and see inbound traffic on
lagg(4) interfaces but they are actually silently dropped and not passed into
the net stack.
Tap traffic after protocol processing so that user will have consistent view of
the inbound traffic, meanwhile mbuf is set with correct receiving interface and
bpf(4) will diagnose the right direction of inbound packets.
PR: 270417
Reviewed by: melifaro (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39225
From static code analysis, some device drivers (cxgbe, mlx4, mthca, and qlnx)
do not enter net epoch before lagg_input_infiniband(). If IPoIB interface is a
member of lagg(4) interface, and after returning from lagg_input_infiniband()
the receiving interface of mbuf is set to lagg(4) interface, then when
concurrently destroying the lagg(4) interface, there is a small window that the
interface gets destroyed and becomes invalid before infiniband_input() re-enter
net epoch, thus leading use-after-free.
Widen NET_EPOCH coverage to prevent use-after-free.
Thanks hselasky@ for testing with mlx5 devices.
Reviewed by: hselasky
Tested by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39275
NETLINK is going to replace rtsock and a number of other ioctl/sysctl interfaces.
In-base utilies such as route(8), netstat(8) and soon ifconfig(8)
are being converted to use netlink sockets as a transport between
kernel and userland.
In the current configuration, it still possible have the kernel
without NETLINK (`nooptions NETLINK`) and use the aforementioned
utilies by buidling the world with `WITHOUT_NETLINK` src.conf knob.
However, this approach does not cover the cases when person unintentionally
builds a custom kernel without netlink and tries to use the standard userland.
This change adds `option NETLINK` to the default options for each
architecture, fixing the custom kernel issue.
For arm, this change uses `std.armv6` and `std.armv7` (netlink already in)
instead of DEFAULTS.
Reviewed By: imp
Differential Revision: https://reviews.freebsd.org/D39339
Use already-existing RTM_F_PREFIX rtm_flag to indicate that the
request assumes exact-prefix lookup instead of the
longest-prefix-match.
MFC after: 2 weeks
This will be used later in the linsysfs module to filter out VNETs.
Reviewed by: des
Differential revision: https://reviews.freebsd.org/D39382
MFC after: 1 month
Since 81167243b the size of struct pfs_node is 280 bytes, so the kernel
memory allocator takes memory from 384 bytes sized bucket. However, the
length of the node name is mostly short, e.g., for Linux emulation layer
it is up to 16 bytes. The size of struct pfs_node w/o pfs_name is 152
bytes, i.e., we have 104 bytes left to fit the node name into the 256
bytes-sized bucket.
Reviewed by: des
Differential revision: https://reviews.freebsd.org/D39381
MFC after: 1 month
Each physical port has an associated loopback tx channel and anything
transmitted over that channel by the driver is looped back internally by
the hardware as if received on that physical port. This change allows
tracing filters to be installed in this loopback path.
MFC after: 1 week
Sponsored by: Chelsio Communications
NFSv4.1/4.2 uses operation bitmaps for various operations,
such as the SP4_MACH_CRED case for ExchangeID.
This patch adds support for operation bitmaps so that
support for SP4_MACH_CRED can be added to the NFSv4.1/4.2
server in a future commit.
This commit should not change any NFSv4.1/4.2 semantics.
MFC after: 3 months
Split up the lhw lock and the scan lock. The latter is a mtx
while the former changes from mtx to sx as mac80211 downcalls may
sleep (and the ic lock is not usable in that case either and a larger
project to fix).
This will also enforce some lookups under lock (mostly scan) as well
as general protection for more compat code and avoid a possible
deadlock with one of the upcoming callbacks from driver into the
compat code.
Sponsored by: The FreeBSD Foundation
MFC after: 7 days
- Mark assert dummy variables as __unused.
- Use a dummy (void) cast of the flags argument passed to
spin_unlock_irqrestore so it gets treated as used.
Reviewed by: manu, hselasky
Differential Revision: https://reviews.freebsd.org/D39349
Use the GICD_SIZE macro (0x10000), which is half the size of the current
fixed-sized mapping (128 * 1024 == 0x20000).
In ARM64 Hyper-V instances, it seems the Distributor's registers are
located immediately preceding a range of physical memory in the bus
address space. Thus, when ram0 is attaching and attempts to reserve
SYS_RES_MEMORY resources corresponding to its physmem ranges, it fails,
because the first 0x10000 bytes of this range are already owned by gic0.
PR: 270415
Reported by: whu
Tested by: whu
Differential Revision: https://reviews.freebsd.org/D39260
init_pagetables is mapped into the segment containing the BSS, but does
not get zeroed by locore. It is used for bootstrap page table pages.
It happens that the bootstrap kernel stack is also placed in that
section, but there's no reason it shouldn't live in the BSS, so move it
there. No functional change intended.
Reviewed by: andrew
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Juniper Networks
Differential Revision: https://reviews.freebsd.org/D39367
The ENTRY macro adds instructions to the start of a function but not
EENTRY. To use these instructions in both functions move the EENTRY
use before the ENTRY use.
Sponsored by: Arm Ltd
On arm64, the PCB is stored at the top of the thread stack. For thread0
this comes from the static "initstack" region, which is placed in the
.init_pagetable section, which is not part of the BSS and thus doesn't
get zeroed by locore. (See the comment in ldscript.arm64.) It is thus
possible for the pcb_flags field to be uninitialized, which can result
in PCB_SINGLE_STEP being set.
Fix this by simply initializing the field. A separate commit will move
initstack out of the .init_pagetable section, since it has no reason to
be there, but it is preferable to explicitly initialize PCB fields
anyway. In particular, regular kernel stacks are not zeroed upon
allocation, so we should be consistent here.
Reviewed by: andrew
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Juniper Networks
Differential Revision: https://reviews.freebsd.org/D39343
Add opt_netlink.h to the linux_common module, on i386, where we don't
uses linux_common module, move opt_netlink.h inclusion under
i386 condition.
MFC after: 2 weeks
Get/set commands can now choose to provide the interface name rather
than the interface index. This allows userspace to avoid a call to
if_nametoindex().
Suggested by: melifaro
Reviewed by: melifaro
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39359