Make sys/amd64/linux/linux_ptrace.c machine-independent,
in preparation for moving it into sys/compat/linux/.
No functional changes.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32756
In preparation for machine-independent sys/compat/linux/linux_ptrace.c,
rename the i386-specific Linux ptrace(2) implementation. No functional
changes.
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32757
The separator here should be tabs, not spaces. This fixes a warning
from chromium-browser on Bionic:
[1022/162248.137612:ERROR:process_info_linux.cc(107)] format error: unrecognized Uid format
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32612
The existing logic didn't take into account newly inserted mappings
wholly contained by an existing region (or vice versa), nor did it
account for weird overlap scenarios. The latter is probably unlikely
to happen, but the former may happen in UEFI: BootServicesData allocated
within a large chunk of ConventionalMemory. This situation blows up vm
initialization.
While we're here, remove the "exact match" logic as it's likely wrong;
if an exact match exists with conflicting flags, for instance, then we
should probably be doing something else. The new logic takes into
account exact matches as part of the overlapping efforts.
Reviewed by: kib, mhorne (both earlier version)
Differential Revision: https://reviews.freebsd.org/D32701
The nfscl_getref() function is called within nfscl_doiods() when
the NFSv4.1/4.2 pNFS client is doing I/O on a DS. As such,
nfscl_getref() needs to check for a forced dismount.
This patch adds that check.
Found during a recent IETF NFSv4 working group testing event.
MFC after: 2 weeks
Move the efimedia reporting to g_part_mbr_efimedia and use that from
g_part_mbr_dumpconf to report it.
Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32781
Move the efimedia reporting to g_part_gpt_efimedia and use that from
g_part_gpt_dumpconf to report it.
Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32780
This makes hwpmc(4) sampling work on ACPI-based AArch64 systems.
Tested on ARM Neoverse N1.
Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: jrtc27, mhorne
Differential Revision: https://reviews.freebsd.org/D24423
Don't pass the same name to multiple mutexes while using unique types
for WITNESS. Just use the unique types as the mutex names.
Reviewed by: markj
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32740
The purpose of this change is to reduce the amount of dmesg(8) noise when
VT switching after a panic.
Submitted by: Greg V <greg@unrelenting.technology>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30174
Sponsored by: NVIDIA Networking
Correctly recognize NEON/SIMD and VFP instructions in THUMB2 mode and pass
these to the appropriate handler. Note that it is not necessary to filter
all undefined instruction variant or register combinations, this is a job
for given handler.
Reported by: Robert Clausecker <fuz@fuz.su>
PR: 259187
MFC after: 2 weks
Rework if_epair(4) to no longer use netisr and dpcpu.
Instead use mbufq and swi_net.
This simplifies the code and seems to make it work better and
no longer hang.
Work largely by bz@, with minor tweaks by kp@.
Reviewed by: bz, kp
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D31077
For NFS RPCs that receive a NFSERR_DELAY reply, the delay time
is initially 1sec and then increases exponentially to NFS_TRYLATERDEL.
It was found that this delay time is excessive for some NFSv4
servers, which work well with a 1msec delay.
A 1sec delay resulted in very slow performance for Remove and
Rename when delegations and pNFS were enabled.
This patch decreases the initial delay time to 1msec.
Found during a recent IETF NFSv4 working group testing event.
MFC after: 2 weeks
It was a hack only needed for trpt, which can just define it locally.
This makes it possible to fix up systat which also includes the file.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an
advskew of 100, making them master and backup respectively.
If net.inet.carp.demotion is set to a negative value on node 1, node 2
might become master while node 1 still retains it master status. Wether
or not node 2 becomes master seems to depend on the nodes advskew and
what the demotion sysctl was set to on node 1.
The reason for node 2 becoming master seems to be that the calculated
advskew taking demotion into account is truncated to a single unsigned
byte when copied into the carp header for sending, and node 1 stays
master since it takes uses the whole non-truncated calculated advskew
when deciding wether to stay master.
PR: 259528
Reviewed by: donner, glebius
MFC after: 3 weeks
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D32759
Kegs with no items reserved have uk_reserve = 0. So the check
keg->uk_reserve >= dom->ud_free_items will be true once all slabs are
depleted. Then, rather than go and allocate a fresh slab, we return to
the cache layer.
The intent was to do this only when the keg actually has a reserve, so
modify the check to verify this first. Another approach would be to
make uk_reserve signed and set it to -1 until uma_zone_reserve() is
called, but this requires a few casts elsewhere.
Fixes: 1b2dcc8c54 ("uma: Avoid depleting keg reserves when filling a bucket")
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32516
M_USE_RESERVE is used in a couple of places in the VM to avoid unbounded
recursion when the direct map is not available, as is the case on 32-bit
platforms or when certain kernel sanitizers (KASAN and KMSAN) are
enabled. For example, to allocate KVA, the kernel might allocate a
kernel map entry, which might require a new slab, which requires KVA.
For these zones, we use uma_prealloc() to populate a reserve of items,
and then in certain serialized contexts M_USE_RESERVE can be used to
guarantee a successful allocation. uma_prealloc() allocates the
requested number of items, distributing them evenly among NUMA domains.
Thus, in a first-touch zone, to satisfy an M_USE_RESERVE allocation we
might have to check the slab lists of other domains than the current one
to provide the semantics expected by consumers.
So, try harder to find an item if M_USE_RESERVE is specified and the keg
doesn't have anything for current (first-touch) domain. Specifically,
fall back to a round-robin slab allocation. This change fixes boot-time
panics on NUMA systems with KASAN or KMSAN enabled.[1]
Alternately we could have uma_prealloc() allocate the requested number
of items for each domain, but for some existing consumers this would be
quite wasteful. In general I think keg_fetch_slab() should try harder
to find free slabs in other domains before trying to allocate fresh
ones, but let's limit this to M_USE_RESERVE for now.
Also fix a separate problem that I noticed: in a non-round-robin slab
allocation with M_WAITOK, rather than sleeping after a failed slab
allocation we simply try again. Call vm_wait_domain() before retrying.
Reported by: mjg, tuexen [1]
Reviewed by: alc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32515
In 7ec86b6609 ("Also print symbols when printing arm64 registers")
a new function was created to print most registers. Unfortunately the
Link Register (LR) was being printed when we should have printed the
Exception Link Register (ELR).
Fix this by adding the missing 'e'.
Sponsored by: The FreeBSD Foundation
The iicoc driver supports the OpenCores I2C IP. This is included in at
least the SiFive "Unleashed" and "Unmatched" cores and probably others.
Suggested by: jrtc27
Only build on RISC-V for now, since we're not aware of any other cores
with this IP supported by FreeBSD.
Reviewed by: jrtc27, philip
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D32737
For pNFS servers that specify that Layouts are to be returned
upon close, they may expect that LayoutReturn to happen before
the associated Close.
This patch modifies the NFSv4.1/4.2 pNFS client so that this
is done. This only affects a pNFS mount against a non-FreeBSD
NFSv4.1/4.2 server that specifies return_on_close in LayoutGet
replies.
Found during a recent IETF NFSv4 working group testing event.
MFC after: 2 weeks
Add a void *ni_drv_data field to struct ieee80211_node that drivers
can use to backtrack to their internal state from a net80211 node.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential Revision: https://reviews.freebsd.org/D30654 (abandoned)
Use proc_get_binpath() to get the hardlink right.
PR: 248184
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32738
Similar to commit 2be417843a, I believe there could be a race between
the NFS client VOP_LOOKUP() and file Writing that could result in stale
file attributes being loaded into the NFS vnode by VOP_LOOKUP().
I have not been able to reproduce a failure due to this race, but
I believe that there are two possibilities:
The Lookup RPC happens while VOP_WRITE() is being executed and loads
stale file attributes after VOP_WRITE() returns when it has already
completed the Write/Commit RPC(s).
--> For this case, setting the local modify timestamp at the end of
VOP_WRITE() should ensure that stale file attributes are not loaded.
The Lookup RPC occurs after VOP_WRITE() has returned, while
asynchronous Write/Commit RPCs are in progress and then is
blocked by the vnode held by VOP_OPEN/VOP_CLOSE/VOP_FSYNC which
will flush writes via ncl_flush() or ncl_vinvalbuf(), clearing the
NMODIFIED flag (which indicates Writes-in-progress). The VOP_LOOKUP()
then acquires the NFS vnode lock and fills in stale file attributes.
--> Setting the local modify timestamp in ncl_flsuh() and ncl_vinvalbuf()
when they clear NMODIFIED should ensure that stale file attributes
are not loaded.
This patch does the above.
PR: 259071
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D32677
Commit 2be417843a added n_localmodtime, which is used by Lookup
and ReaddirPlus to check to see if the file attributes in an RPC
reply might be stale. This patch sets n_localmodtime in Deallocate.
Done as a separate commit, since Deallocate is not in stable/13.
PR: 259071
Reviewed by: asomers
Differential Revision: https://reviews.freebsd.org/D32635
Testing with it, there appears to be a race between Lookup
and VOPs like Setattr-of-size, where Lookup ends up loading
stale attributes (including what might be the wrong file size)
into the NFS vnode's attribute cache.
The race occurs when the modifying VOP (which holds a lock
on the vnode), blocks the acquisition of the vnode in Lookup,
after the RPC (with now potentially stale attributes).
Here's what seems to happen:
Child Parent
does stat(), which does
VOP_LOOKUP(), doing the Lookup
RPC with the directory vnode
locked, acquiring file attributes
valid at this point in time
blocks waiting for locked file does ftruncate(), which
vnode does VOP_SETATTR() of Size,
changing the file's size
while holding an exclusive
lock on the file's vnode
releases the vnode lock
acquires file vnode and fills in
now stale attributes including
the old wrong Size
does a read() which returns
wrong data size
This patch fixes the problem by saving a timestamp in the NFS vnode
in the VOPs that modify the file (Setattr-of-size, Allocate).
Then lookup/readdirplus compares that timestamp with the time just
before starting the RPC after it has acquired the file's vnode.
If the modifying RPC occurred during the Lookup, the attributes
in the RPC reply are discarded, since they might be stale.
With this patch the test program works as expected.
Note that the test program does not fail on a July stable/12,
although this race is in the NFS client code. I suspect a
fairly recent change to the name caching code exposed this
bug.
PR: 259071
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D32635
Note that this is largely untested at this point, as was
the previous version; I'm committing this mostly to get
rid of `struct linux_pt_reg`.
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32735
In 6e66030c4c, additional ptracestop was added in order
to implement PTRACE_EVENT_EXEC. Make it only apply to cases
where the debugger is a Linux processes; native FreeBSD
debuggers can trace Linux processes too, but they don't
expect that additonal ptracestop.
Fixes: 6e66030c4c
Reported By: kib
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32726
Commit 5e5ca4c8fc added a NFSMNTP_DELEGISSUED flag to indicate when
a delegation has been issued to the mount. For the common case
where an NFSv4 server is not issuing delegations, this flag
can be checked to avoid acquisition of the NFSCLSTATEMUTEX.
This patch adds checks for NFSMNTP_DELEGISSUED being set
to two more functions.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
MFC after: 2 week
If we get a Sacked peer with an MTU change we can retransmit forever if the
last bytes are sacked and the client goes away (think power off). Then we
never see the end condition and continually retransmit.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32671
Previously it returned a shorter struct. I can't find any
modern software that uses it, but tests/ptrace from strace(1)
repo complained.
Differential Revision: https://reviews.freebsd.org/D32601
This fixes ./waitid.gen.test from the strace(1) test suite.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32617
Translate ERESTART into Linux "internal" errno ERESTARTSYS.
This fixes the erestartsys.gen.test from strace(1).
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32623
The pic_* interface was used.
Only edge interrupts are supported by this controller.
Driver mutex had to be converted to a spin lock so that it can
be used in the interrupt filter context.
Two types of intr_map_data are supported - INTR_MAP_DATA_GPIO and
INTR_MAP_DATA_FDT. This way interrupts can be allocated using the
userspace gpio interrupt allocation method, as well as directly from
simplebus. The latter can be used by devices that have its irq routed
to a GPIO pin.
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential revision: https://reviews.freebsd.org/D32587
Driver polls status of all PHYs connected to the switch in a
fixed interval.
Add a sysctl that allows to control frequency of that.
The value is expressed in ticks and defaults to "hz", or 1 second.
Obtained from: Semihalf
Sponsored by: Alstom Group
It was previously used by felix(4) for PHY communication.
Since that is not the case anymore this driver is now left unused.
Obtained from: Semihalf
Sponsored by: Alstom Group
Previously we would use an external MDIO device found on the PCI bus.
Switch to using MDIO mapped in a separate BAR of the switch device.
It is much easier this way since we don't have to depend on another
driver anymore.
Obtained from: Semihalf
Sponsored by: Alstom Group
Some BIOSes protect memory region they reside in by using DMAR to
prevent devices from doing any DMA transactions to that part of RAM.
AMI refers to this as "DMA Control Guarantee".
Disable the protection when address translation is enabled.
I stumbled upon this while investigation a failing coredump on a device
which has this feature enabled.
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D32591
In some cases we might have to create DMAR context before the
corresponding device has been enumerated by the PCI bus.
In that case we get called with NULL dev, because of that trying
to reserve PCI regions causes a NULL pointer dereference in
pci_find_pcie_root_port.
Sponsored by: Stormshield
Obtained from: Semihalf
MFC after: 2 weeks
Reviewed by: kib, rlibby
Differential revision: https://reviews.freebsd.org/D32589
Turns out that if a peer sends in a window update right after rack fires off
a persists probe, we can mis-interpret the window update and calculate
a bogus RTT (very short). We still process the window update and send
the data but we incorrectly generate an RTT. We should be only doing
the RTT stuff if the rwnd is still small and has not changed.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32717
This change is a slight performance optimization for systems with a slow
64-bit division.
The th->th_scale and th->th_large_delta values only depend on the
timecounter frequency and the th->th_adjustment. The timecounter
frequency of a timehand only changes when a new timecounter is activated
for the timehand. The th->th_adjustment is only changed by the NTP
second update. The NTP second update is not done for every call of
tc_windup().
Move the code block to recalculate the scaling factor and
the large delta of a timehand to the new helper function
recalculate_scaling_factor_and_large_delta().
Call recalculate_scaling_factor_and_large_delta() when a new timecounter
is activated and a NTP second update occurred.
MFC after: 1 week
This allows the pmap_remove(min, max) call to see empty pmap and exploit
empty pmap optimization.
Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569
to match the added accounting of the top-level page table pages.
Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569
both for kernel and user page tables, the later exist in the PTI case.
Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569
While doing it, also move all the code to resolve pathnames and obtain
text vp and dvp, into single place. Besides simplifying the code, it
avoids spurious vnode relocks and validates the explanation why
a transient text reference on the script vnode is not harmful.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
For this, use vn_fullpath_hardlink() to resolve executable name for
execve(2).
This should provide the right hardlink name, used for execution, instead
of random hardlink pointing to this binary. Also this should make the
AT_EXECNAME reliable for execve(2), since kernel only needs to resolve
parent directory path, which should always succeed (except pathological
cases like unlinking a directory).
PR: 248184
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
Also re-align comments, and group booleans and char members together.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
Dummynet differs from ALTQ in that ALTQ schedules packets after they
leave pf. Dummynet schedules them after they leave pf, but then
re-injects them.
We currently deal with this by ensuring we don't re-schedule a packet we
get from dummynet, but this produces unexpected results when combined
with NAT, as dummynet processing is done after the NAT transformation.
In other words, the second time the packet is handed to pf it may have a
different source and destination address.
Simplify this by moving dummynet processing to after all other pf
processing, and not re-processing (but always passing) packets from
dummynet.
This fixes NAT of dummynet delayed packets, and also reduces processing
overhead (because we only do state/rule lookup for each dummynet packet
once, rather than twice).
MFC after: 3 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32665
We should clear firewall tags on loopback, icmp reflection, or if_epair
transmission. Left over tags can produce unexpected behaviour,
especially on if_epair where a and b interfaces can be in different
vnets, and have different firewall policies set.
MFC after: 3 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32664
Remove all (non-persistent) tags when we transmit a packet. Real network
interfaces do not carry any tags either, and leaving tags attached can
produce unexpected results.
Reviewed by: bz, glebius
MFC after: 3 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32663
instead of do {} while (0).
This makes them real void expressions, and they can be used anywhere
where a void function call can be used, for example in a conditional
operator.
Reviewed by: kib, mjg
Differential revision: https://reviews.freebsd.org/D32696
I dropped the + 1 from the other two instances in each file but failed
to do so for this one, resulting in a more egregious buffer overread
than the one I was fixing (since the read character ended up in the
output if there was space).
Reported by: Jenkins
Fixes: 34fb1c133c ("Fix intra-object buffer overread for labeled msdosfs volumes")
The starting sequence number used to verify that TLS 1.0 CBC records
are encrypted in-order in the OCF layer was always set to 0 and not to
the initial sequence number from the struct tls_enable.
In practice, OpenSSL always starts TLS transmit offload with a
sequence number of zero, so this only matters for tests that use a
random starting sequence number.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32676
Volume labels, like directory entries, are padded with spaces and so
have no NUL terminator. Whilst the MIN for the dsize argument to strlcpy
ensures that the copy does not overflow the destination, strlcpy is
defined to return the number of characters in the source string,
regardless of the provided dsize, and so keeps reading until it finds a
NUL, which likely exists somewhere within the following fields, but On
CHERI with the subobject bounds enabled in the compiler this buffer
overread will be detected and trap with a bounds violation.
Found by: CHERI
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D32579
In the ATA/ATAPI spec these are space-padded fixed-length strings with
no NUL-terminator (and byte swapped). When performing the identify we
call ata_param_fixup to swap the bytes back to be in order, strip any
leading/trailing spaces and coalesce consecutive spaces, padding with
NULs. However, if the input has no padding spaces, the fixed-up strings
are still not NUL-terminated. This causes two issues. The first is that
strlcpy will truncate the string by replacing the final byte with a NUL.
The second is that strlcpy will keep reading src until it finds a NUL in
order to calculate the return value, which is defined as the length of
src (so that callers can then compare it with the dsize input to see if
the input string was truncated), thereby reading past the end of the
buffer and into whatever adjacent fields are in the structure. In
practice there's a NUL byte somewhere in the structure, but on CHERI
with subobject bounds enabled in the compiler this overread will be
detected and trap as a bounds violation.
Note this matches ata_xpt's aprobedone, which does a bcopy to a
malloc'ed buffer and manually NUL-terminates it for the CAM path's
device's serial_num.
Found by: CHERI
Reviewed by: imp, scottl
Differential Revision: https://reviews.freebsd.org/D32567
Currently, to support 64-byte contexts, xhci_ctx_[gs]et_le(32|64) take a
pointer to the field within a 32-byte context and, if 64-byte contexts
are in use, compute where the 64-byte context field is and use that
instead by deriving a pointer from the 32-byte field pointer. This is
done by exploiting a combination of 64-byte contexts being the same
layout as their 32-byte counterparts, just with 32 bytes of padding at
the end, and that all individual contexts are either in a device
context or an input context which itself is page-aligned. By masking out
the low 4 bits (which is the offset of the field within the 32-byte
contxt) of the offset within the page, the offset of the invididual
context within the containing device/input context can be determined,
which is itself 32 times the number of preceding contexts. Thus, adding
this value to the pointer again gets 64 times the number of preceding
contexts plus the field offset, which gives the offset of the 64-byte
context plus the field offset, which is the address of the field in the
64-byte context.
However, this involves a fair amount of lying to the compiler when
constructing these intermediate pointers, and is rather difficult to
reason about. In particular, this is problematic for CHERI, where we
compile the kernel with subobject bounds enabled; that is, unless
annotated to opt out (e.g. for C struct inheritance reasons where you
need to be able to downcast, or containerof idioms), a pointer to a
member of a struct is a capability whose bounds only cover that field,
and any attempt to dereference outside those bounds will fault,
protecting against intra-object buffer overflows. Thus the pointer given
to xhci_ctx_[gs]et_le(32|64) is a capability whose bounds only cover the
field in the 32-byte context, and computing the pointer to the 64-byte
context field takes the address out of bounds, resulting in a fault when
later dereferenced.
This can be cleaned up by using a different abstraction. Instead of
doing the 32-byte to 64-byte conversion on access to the field, we can
do the conversion when getting a pointer to the context itself, and
define proper 64-byte versions of contexts in order to let the compiler
do all the necessary arithmetic rather than do it manually ourselves.
This provides a cleaner implementation, works for CHERI and may even be
slightly more performant as it avoids the need to mess with masking
pointers (which cannot in the general case be optimised by compilers to
be reused across accesses to different fields within the same context,
since it does not know that the contexts are over-aligned compared with
the C ABI requirements).
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D32554
Rack caches TCP/IP header for fast send, so it doesn't call
tcpip_fillheaders(). After certain socket option changes,
namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update
its fast block to be in sync with the inpcb.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655
Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.
Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655
After handling them in IP level ctloutput, pass them down to TCP
ctloutput.
We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place
for now, but comment out how it should be handled.
For IPv4 we are interested in IP_TOS and IP_TTL.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655
TCP stack sysctl nodes are currently inserted using the stack
name alias. Allow the user to get the current stack's alias to
allow for programatic sysctl access.
Obtained from: Netflix
This brings it inline with what's in openbsd. I tested it locally
with 2G and 5G association; it seems to work.
Tested: Intel 7260 AC, hw 0x140, STA mode, 2G/5G
Differential Revision: https://reviews.freebsd.org/D32627
Subscribers: imp
Obtainde from: OpenBSD
Summary:
This updates the if_iwmreg.h definitions to;
OpenBSD: if_iwmreg.h,v 1.65 2021/10/11 09:03:22 stsp Exp
A few things haven't been fully converted, namely:
* I left a couple things as enums for now just to reduce the
other diffs needed; but they're the same values
* The IWM_SCD_QUEUE_* macros have different offsets which I
didn't update in case they broke things / changed based on later
firmware. But they also may be real bugfixes which are needed
for later chips. It'll need more testing before flipping this on.
The c file updates are:
* Use the newer names for things if the name changed but the semantics
didn't
* Explicitly use the earlier firmware structs which maintain compat
with the current firmware and code. The newer ones are in here and
they'll get converted when more openbsd code is merged into this tree.
* Use the older iwm rate table for now, which has entries for legacy
rates, HT and VHT. Our code works with that right now, updating it
to openbsd's err, "different" version can be done at a later date
when HT/VHT support is added.
Notably, a bunch of definitions were deleted that weren't used.
They're not used either in the openbsd/dfbsd drivers so I think it's
safe to delete them in the long run.
Test Plan: 7260 hw 0x140
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D32627
Reviewed by: md5
Obtained From: OpenBSD
According to 11.4.8 in RFC 7143, ExpDataSN MUST be 0 if the response
code is not Command Completed, but we were requiring it to always be
the count of DataIn PDUs regardless of the response code.
In addition, at least one target (OCI Oracle iSCSI block device)
returns an ExpDataSN of 0 when returning a valid completion with an
error status (Check Condition) in response to a SCSI Inquiry. As a
workaround for this target, only warn without resetting the connection
for a 0 ExpDataSN for responses with a non-zero error status.
PR: 259152
Reported by: dch
Reviewed by: dch, mav, emaste
Fixes: 4f0f5bf995 iscsi: Validate DataSN values in Data-In PDUs in the initiator.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32650
The new iSCSI initiator iscsi(4) was introduced with FreeBSD 10.0, and
the old intiator was marked obsolete shortly thereafter (in commit
d32789d95c, MFC'd to stable/10 in ba54910169). Remove it now.
Reviewed by: jhb, mav
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32673
If the congestion window is very large the fact that we multiply it by 1000 (for microseconds) can
cause the uint32_t to overflow and we incorrectly calculate a very small divisor. This will then
cause the burst timer to be very large when it should be 0. Instead lets make the three variables
uint64_t and avoid the issue.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32668
A BPF descriptor only has an associated interface descriptor once it is
attached to an interface, e.g., with BIOCSETIF. Avoid dereferencing a
NULL pointer in filt_bpfwrite() if the BPF descriptor is not attached.
Reviewed by: ae
Reported by: syzbot+ae45d5166afe15a5a21d@syzkaller.appspotmail.com
Fixes: ded77e0237 ("Allow the BPF to be select for write.")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32561
-Each CQ start task queue to poll when completion happens.
This means every rx and tx queue has its own cleanup task
thread to poll the completion.
- Arm EQ everytime no matter it is mana or hwc. CQ arming
depends on the budget.
- Fix a warning in mana_poll_tx_cq() when cqe_read is 0.
- Move cqe_poll from EQ to CQ struct.
- Support EQ sharing up to 8 vPorts.
- Ease linkdown message from mana_info to mana_dbg.
Tested by: whu
MFC after: 2 weeks
Sponsored by: Microsoft
There was a case in nfscl_doiods() where the function would return
without releasing the delegation shared lock, if it was aquired by
the call to nfscl_getstateid(). This patch adds that release.
I have never observed a failure due to this missing release, so I
do not know if it ever happens in practice. However, since the pNFS
client is not yet heavily used, it might be the case.
Found by code inspection during a recent NFSv4 IETF working group
testing event.
MFC after: 2 week
Add a dummy MODULE_SUPPORTED_DEVICE define as we do for other
MODULE_* macros. This is needed by a wireless driver.
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D32641
Add bcd2bin() as linuxkpi_bcd2bin(). Libkern does provide a bcd2bin()
which cannot be used leaving us with a conflict (see comment in file).
Fortunately this is only seen in one driver so far and it seems easier
to drop this in and change a single line in the driver than to add this
inline in the driver.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D32647
Rename the struct pci_driver {} field got the list_head from links
to node as a driver is actually initialsing this to {} which seems
questionable but it will at least make us match the Linux structure
field name.
MFC after: 3 days
Reviewed by: manu, hselasky
Differential Revision: https://reviews.freebsd.org/D32645
Make the struct pci_dev argument to the pci_{read,write}_config*()
functions "const" to match the Linux definition as some drivers
try to pass in a const argument which we currently fail to honor.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D32644
Add netdev_features.h as a spearate file from the future netdevice.h
implementation to avoid include problems with a future skbuff.h.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D32643
Add a dummy simple_open() to fs.h as we have for other
(unsupported) functions.
This is needed by a wireless driver.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D32642
netdev_priv() is a LinuxKPI function which was used with the old ifnet
linux/netdevice.h implementation which was not adaptable to modern
Linux drviers unless rewriting them for ifnet in first place which
defeats the purpose.
Rename the netdev_priv() calls in mlx4 to mlx4_netdev_priv()
returning the ifnet softc to avoid conflicting symbol names
with different implementations in the future.
MFC after: 3 days
Reviewed by: hselasky, kib
Differential Revision: https://reviews.freebsd.org/D32640
It was here since divert(4) was introduced, probably just came with a
protocol definition boilerplate. There is no useful socket option
that can be set or get for a divert socket.
Reviewed by: donner
Differential Revision: https://reviews.freebsd.org/D32608
For making the call faster, do not count active/inactive object queues,
and do not report vnode info if any (for tmpfs).
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31163
Previously the MSR-based timecounter was registered during
SI_SUB_HYPERVISOR, i.e., very early during boot, and before SI_SUB_LOCK.
After commit 621fd9dcb2 this triggers a panic since the timecounter
list lock is not yet initialized.
The hyperv timecounter does not need to be registered so early, so defer
that to SI_SUB_DRIVERS, at the same time the hyperv TSC timecounter is
registered.
Reported by: whu
Approved by: whu
Fixes: 621fd9dcb2 ("timecounter: Lock the timecounter list")
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Analogous to the other sized version of kstrto[u]<type>() and
kstrtobool_from_user() add the "u8" versions needed by a driver.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D32598
When EVDEV_SUPPORT was introduced, the USB transfers may be running
after the main FIFO is closed. In connection to this a race may appear
which can lead to use-after-free scenarios. Fix this for all FIFO
consumers by initializing and resetting the FIFO queues under the
lock used by the client. Then the client driver will see an empty
queue in all cases a race may appear.
Found by: pho@
MFC after: 1 week
Sponsored by: NVIDIA Networking
unionfs uses a per-directory hashtable to cache subdirectory nodes.
Currently this hashtable is looked up using the directory name, but
since unionfs nodes aren't removed from the cache until they're
reclaimed, this poses some problems. For example, if a directory is
created on a unionfs mount shortly after deleting a previous directory
with the same path, the cache may end up reusing the node for the
previous directory, including its upper/lower FS vnodes. Operations
against those vnodes with then likely fail because the vnodes
represent deleted files; for example UFS will reject VOP_MKDIR()
against such a vnode because its effective link count is 0. This may
then manifest as e.g. mkdir(2) or open(2) returning ENOENT for an
attempt to create a file under the re-created directory.
While it would be possible to fix this by explicitly managing the
name-based cache during delete or rename operations, or by rejecting
cache hits if the underlying FS vnodes don't match those passed to
unionfs_nodeget(), it seems cleaner to instead hash the unionfs nodes
based on their underlying FS vnodes. Since unionfs prefers to operate
against the upper vnode if one is present, the lower vnode will only
be used for hashing as long as the upper vnode is NULL. This should
also make hashing faster by eliminating string traversal and using
the already-computed hash index stored in each vnode.
While here, fix a couple of other cache-related issues:
--Remove 8 bytes of unnecessary baggage from each unionfs node by
getting rid of the stored hash mask field. The mask is knowable
at compile time.
--When a matching node is found in the cache, reference its vnode
using vrefl() while still holding the vnode interlock. Previously
unionfs_nodeget() would vref() the vnode after the interlock was
dropped, but the vnode may be reclaimed during that window. This
caused intermittent panics from vn_lock(9) during unionfs stress
testing.
Reviewed by: kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D32533
An ordered series of BIO_READ and BIO_WRITE operations are
typically done as:
while (work to do) {
setup bp for I/O
g_io_request(bp, consumer);
biowait(bp);
}
Here you need to have biodone() called at the completion of
the I/O to set the BIO_DONE flag and awaken the biowait(). The
obvious way to do this would be to set bio_done = biodone, but
biodone() will only take the desired action if bio_done == NULL.
The relevant code at the end of biodone() is:
done = bp->bio_done;
if (done == NULL) {
mtxp = mtx_pool_find(mtxpool_sleep, bp);
mtx_lock(mtxp);
bp->bio_flags |= BIO_DONE;
wakeup(bp);
mtx_unlock(mtxp);
} else
done(bp);
This code would infinitely recurse if biodone() is specified as the
routine to use at completion. So before this change, a wrapper done
function had to be written:
static void
g_io_done(struct bio *bp)
{
bp->bio_done = NULL;
biodone(bp);
bp->bio_done = g_io_done;
}
This commit changes
if (done == NULL)
to
if (done == NULL || done == biodone)
which eliminates the need for the wrapper function.
Reviewed by: kib
Sponsored by: Netflix
This fixes panic when trying to run strace(8) from Focal.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32355
The Linux way for sendfile(2) to tell the application
to fallback to another way of copying data is by EINVAL,
not ENOTSOCK. This fixes package installation scripts
for Mono packages from Focal.
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32604
It seems that clang IAS erronously adds repz prefix which should not be
there. Cpu would try to store around %ecx bytes of random, while we
only expect a word.
PR: 259218
Reported and tested by: Dennis Clarke <dclarke@blastwave.org>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The modification to the hash are already naturally locked by
in_control_sx. Convert the hash lists to CK lists. Remove the
in_ifaddr_rmlock. Assert the network epoch where necessary.
Most cases when the hash lookup is done the epoch is already entered.
Cover a few cases, that need entering the epoch, which mostly is
initial configuration of tunnel interfaces and multicast addresses.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D32584
This reverts commit 59eab1093a.
The change suppressed EFAULT originating from uiomove(). The deadlock
avoidance mechanism implemented by vn_io_fault1() in the VFS handles
such errors by wiring the user pages and retrying, but this change
caused read() to return early instead. This can result in short I/O,
causing misbehaviour in some applications, and possibly other
consequences.
Until this is resolved somehow, revert the commit.
Approved by: mm
The last two drivers that required sppp are cp(4) and ce(4).
These devices are still produced and can be purchased
at Cronyx <http://cronyx.ru/hardware/wan.html>.
Since Roman Kurakin <rik@FreeBSD.org> has quit them, they no
longer support FreeBSD officially. Later they have dropped
support for Linux drivers to. As of mid-2020 they don't even
have a developer to maintain their Windows driver. However,
their support verbally told me that they could provide aid to
a FreeBSD developer with documentaion in case if there appears
a new customer for their devices.
These drivers have a feature to not use sppp(4) and create an
interface, but instead expose the device as netgraph(4) node.
Then, you can attach ng_ppp(4) with help of ports/net/mpd5 on
top of the node and get your synchronous PPP. Alternatively
you can attach ng_frame_relay(4) or ng_cisco(4) for HDLC.
Actually, last time I used cp(4) back in 2004, using netgraph(4)
instead of sppp(4) was already the right way to do.
Thus, remove the sppp(4) related part of the drivers and enable
by default the negraph(4) part. Further maintenance of these
drivers in the tree shouldn't be a big deal.
While doing that, remove some cruft and enable cp(4) compilation
on amd64. The ce(4) for some unknown reason marks its internal
DDK functions with __attribute__ fastcall, which most likely is
safe to remove, but without hardware I'm not going to do that, so
ce(4) remains i386-only.
Reviewed by: emaste, imp, donner
Differential Revision: https://reviews.freebsd.org/D32590
See also: https://reviews.freebsd.org/D23928
vm_reserv_reclaim_*() will release pages to the default freepool, not
the direct freepool from which noobj allocations are drawn. But if both
pools are empty, the noobj allocator variants must break reservations to
make progress.
Reported by: cy
Reviewed by: kib (previous version)
Fixes: b498f71bc5 ("vm_page: Add a new page allocator interface for unnamed pages")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32592
TCP Hystart draft version -03:
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus
Is a new version of hystart that allows one to carefully exit slow start if the RTT
spikes too much. The newer version has a slower-slow-start so to speak that then
kicks in for five round trips. To see if you exited too early, if not into congestion avoidance.
This commit will add that feature to our newreno CC and add the needed bits in rack to
be able to enable it.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32373
Fix clearing of bits in RCTL for the non-bpf/non-allmulti case.
Update RCTL after modifying the multicast filter registers as per
the Linux driver.
This fixes LACP on igc interfaces, where incoming LACP multicasti
control packets were being dropped.
Reviewed by: kbowling
Obtained from: Rubicon Communications, LLC ("Netgate")
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D32574
Correct input_sta "assertion" checks. CTS/ACK CTRL frames are shorter
then sizeof(struct ieee80211_frame_min) and were thus running into the
is_rx_tooshort error case.
Use ieee80211_anyhdrsize() to handle this better but make sure we do
at least have the first 2 octets needed for that.
While here move the safety checks before any code which may not obey
them later, just for good style.
The non-scanning check further down assumes a frame format also not
matching control frames. For now skip the checks for control frames
which allows us to deal with some of them at least now.
Sponsored by: The FreeBSD Foundation
Obtained from: 20210906 wireless v0.91 code drop
MFC after: 3 days
Reviewed by: adrian
Differential Revision: https://reviews.freebsd.org/D32238
While IEEE80211_R_BAND was defined, there was no place to store the
band. Add a field for that, adjust ieee80211_lookup_channel_rxstatus()
to require it, and update drivers passing "R_{FREQ|IEEE}" in already to
provide the band as well. For the moment keep the fall-back code
requiring all three fields.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: adrian
Differential Revision: https://reviews.freebsd.org/D30662
When we route-to a packet that later turns out to not fit in the
outbound interface MTU we generate an ICMP error.
However, if we've already changed those (i.e. we've passed through a NAT
rule) we have to undo the transformation first.
Obtained from: pfSense
MFC after: 3 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32571
In shm_largepage_phys_populate(), the result from vm_page_grab() is only
needed for assertion.
In shm_dotruncate_largepage(), there is a commented-out prototype code
for managed largepages. The oldobjsz is saved for its sake, so mark
the variable as __unused directly.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The function ignores result returned by linker_release_module().
The FW_UNLOAD flag on the file is cleared, so even on error it would
not be tried again.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
For some of them, used only when KTR or KMSAN are configured, apply
__unused attribute directly.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Mark variables as __diagused for invariant-only vars
Reviewed by: imp, mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32577
A future change to TOE TLS will require a software fallback for the
first few TLS records received. Future support for NIC TLS on receive
will also require a software fallback for certain cases.
Reviewed by: gallatin, hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32566
As a followup to SW KTLS assuming an OCF backend, rename
struct ocf_session to struct ktls_ocf_session and forward
declare it in <sys/ktls.h> to use as the type of
struct ktls_session.cipher.
Reviewed by: gallatin, hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32565
In particular, ktls_pending_rx_info() determines which TLS record is
at the end of the current receive socket buffer (including
not-yet-decrypted data) along with how much data in that TLS record is
not yet present in the socket buffer.
This is useful for future changes to support NIC TLS receive offload
and enhancements to TOE TLS receive offload. Those use cases need a
way to synchronize a state machine on the NIC with the TLS record
boundaries in the TCP stream.
Reviewed by: gallatin, hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32564
Stack gap code used on amd64 can also be reused for arm64. Point
sv_stackgap to elf64_stackgap to enable this feature.
Reviewed by: mw, kib, emaste
Tested by: mw
MFC: after 1 month
Differential Revision: https://reviews.freebsd.org/D32588
Notable upstream pull request merges:
#12392 Avoid panic in case of pool errors and missing L2ARC
#12448 skip snapshot in zfs_iter_mounted()
#12516 Fix NFS and large reads on older kernels
#12533 Fail invalid incremental recursive send gracefully
#12569 FreeBSD: Really zero the zero page
#12575 Reject zfs send -RI with nonexistent fromsnap
#12602 Correct refcount_add in dmu_zfetch
#12650 zpool should call zfs_nicestrtonum() with non-NULL handle
Obtained from: OpenZFS
OpenZFS commit: ec64fdb93d
Devices in sys/dev should be architecture-independent and NOT #include
intr_machdep.h.
Reviewed by: mhorne royger
Differential Revision: https://reviews.freebsd.org/D29959