The Realtek driver assumes an early chandef to be set. At the time
of linuxkpi_ieee80211_ifattach() we do not really know one yet so
try to find the first one which is available and set that.
This prevents a NULL-deref panic.
MFC after: 3 days
Turned out all the workq's taskqueues were named "wlanNA" if you had
more then one card in a machine as by the time we called wiphy_name()
the device name was not set yet and we returned the fallback.
Move the alloc_ordered_workqueue() from linuxkpi_ieee80211_alloc_hw()
to linuxkpi_ieee80211_ifattach() at which time the device name has
to be set to give us a unique name.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Realtek's rtw88 is returning a hard-coded 1 in case they cannot
hw_scan (fw not advertising it). In that case if we want any scan
to run we need to fall-back to sw scan. Start dealing with this.
Long-term we probably need to keep internal state.
MFC after: 3 days
Currently armv8crypto copies the scheme used in aesni(9), where payload
data and output buffers are allocated on the fly if the crypto buffer is
not virtually contiguous. This scheme is simple but incurs a lot of
overhead: for an encryption request with a separate output buffer we
have to
- allocate a temporary buffer to hold the payload
- copy input data into the buffer
- copy the encrypted payload to the output buffer
- zero the temporary buffer before freeing it
We have a handy crypto buffer cursor abstraction now, so reimplement the
armv8crypto routines using that instead of temporary buffers. This
introduces some extra complexity, but gallatin@ reports a 10% throughput
improvement with a KTLS workload without additional CPU usage. The
driver still allocates an AAD buffer for AES-GCM if necessary.
Reviewed by: jhb
Tested by: gallatin
Sponsored by: Ampere Computing LLC
Submitted by: Klara Inc.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28950
This is in preparation for using buffer cursors. No functional change
intended.
Reviewed by: jhb
Sponsored by: Ampere Computing LLC
Submitted by: Klara Inc.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28948
This was useful in converting armv8crypto to use buffer cursors. There
are some cases where one wants to make two passes over data, and this
provides a way to "reset" a cursor.
Reviewed by: jhb
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28949
Various updates to skbuff for new/updated drivers and some housekeeping:
- update types and struct members, add new (stub) functions
- improve freeing of frags.
- fix an issue with sleeping during alloc for dev_alloc_skb().
- Adjust a KASSERT for skb_reserve() which apparently can be called
multiple times if no data was put into the skb yet.
- move the sysctl from linux_8022.c (which may be in a different module)
to linux_skbuff.c so in case we turn debugging on we do not run into
unresolved symbols. Rename the sysctl variable to be less conflicting
and update debugging macros along with that; also add IMPROVE().
- add DDB support to show an skbuff.
- adjust comments/whitespace.
No functional changes intended for iwlwifi.
Sponsored by: The FreeBSD Foundation (partially)
MFC after: 3 days
This update is for more/newer versions of drivers:
- add and properly place more structs, enums, defines needed by drivers.
- correct types of struct fields.
- make various function arguments const.
- move REG_RULE() macro to its own file regulatory.h and
use macros for calculations.
- add linuxkpi_ieee80211_get_channel() implementation.
- change linuxkpi_ieee80211_ifattach() to return int for error checking.
No intended functional changes for iwlwifi.
Sponsored by: The FreeBSD Foundation (partially)
MFC after: 3 days
Add lockdep_assert_not_held() asserting LA_UNLOCKED as needed by a
driver.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34232
callout *_sbt functions are used to reduce ping/timeout scheduling
overhead, while allowing later improvments in the functionality.
Keep similar 1000ms callouts while adding a 10 ms window, to allow
some kernel scheduling improvements.
Reviewed By: jhb
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34222
These zones are cache zones used to allocate TLS offload contexts from
firmware. Releasing items from the cache is a sleepable operation due
to the need to await a response from the firmware command freeing the
tag, so items cannot be reclaimed from the zone in non-sleepable
contexts. Since the cache size is limited by firmware limits, avoid
this by setting UMA_ZONE_UNMANAGED to avoid reclamation by uma_timeout()
and the low memory handler.
Reviewed by: hselasky, kib
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34142
Allow a zone to opt out of cache size management. In particular,
uma_reclaim() and uma_reclaim_domain() will not reclaim any memory from
the zone, nor will uma_timeout() purge cached items if the zone is idle.
This effectively means that the zone consumer has control over when
items are reclaimed from the cache. In particular, uma_zone_reclaim()
will still reclaim cached items from an unmanaged zone.
Reviewed by: hselasky, kib
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34142
The padding in CURRENT shall not shrink. It is
designed that in CURRENT at always stays
the same, and then when a new stable is branched, it
inherits 6 pointer placeholders that can be used
withing this stable/X lifetime to extend the structure.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34269
Allow multiple cores to be used to process if_epair traffic. We do this
(if RSS is enabled) based on the RSS hash of the incoming packet. This
allows us to distribute the load over multiple cores, rather than
sending everything to the same one.
We also switch from swi_sched() to taskqueues, which also contributes to
better throughput.
Benchmark results:
With net.isr.maxthreads=-1
Setup A: (cc0 - bridge0 - epair0a) (epair0b - bridge1 - cc1)
Before 627 Kpps
After (no RSS) 1.198 Mpps
After (RSS) 3.148 Mpps
Setup B: (cc0 - bridge0 - epaira0) (epair0b - vnet jail - epair1a) (epair1b - bridge1 - cc1)
Before 7.705 Kpps
After (no RSS) 1.017 Mpps
After (RSS) 2.083 Mpps
MFC after: 3 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D33731
The proper way to drop this kind of CQE is advancing rxq tail
without indicating the packet to the upper network layer.
MFC after: 2 weeks
Sponsored by: Microsoft
Add a marker after GUID_INIT() and linux/pm_qos.h were added, so
that future version of drm-kmod can selectively remove these bits.
The latest port version does not require user updates for this so
no UPDATING entry.
Add a linux/pm_qos.h with three dummy functions and a struct as needed
by a driver and drm-kmod [1] with no intend to support this for the moment.
Submitted by: wulf (drm-kmod bits) [1]
Sponsored by: The FreeBSD Foundation (drm-kmod requested updates)
MFC after: 3 days
Reviewed by: hselasky (earlier version), wulf
Differential Revision: https://reviews.freebsd.org/D34234
Add a definition for UUID_STRING_LEN to uuid.h as needed by a driver.
Also add GUID_INIT for drm-kmod [1].
Submitted by: wulf [1]
MFC after: 3 days
Reviewed by: hselasky (earlier), wulf
Differential Revision: https://reviews.freebsd.org/D34235
Users are seeing warnings about 2 channels (1 per band)
triggered by an ioctl from wpa_supplicant usually:
lkpi_ic_getradiocaps: Adding chan ... returned error 55
This was an early FAQ.
Check the current number of channels against maxchans and the return
code from net80211. In case net80211 reports that we reached the limit
do not print the warning and do not try to add further channels.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Add maxchans to the disabled debugging in addchan() and copychan_prev()
to aid debugging possible errors rreturned due to reaching maxchans
limits.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
GCC is more pedantic than clang about warning when a function doesn't
handle undefined enum values (see GCC bug 87950). Clang's warning
gives a more pragmatic coverage and should find any real bugs, so
disable the warning for GCC rather than adding __unreachable
annotations to appease GCC.
Reviewed by: imp, emaste
Differential Revision: https://reviews.freebsd.org/D34147
This is behavior what some programs expect and what Linux does. For
example nginx expects EAGAIN when sending messages to /var/run/log,
which it connects to with O_NONBLOCK. Particularly with nginx the
problem is magnified by the fact that a ENOBUFS on send(2) is also
logged, so situation creates a log-bomb - a failed log message
triggers another log message.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D34187
HugeSectors * BytesPerSec should be computed before converting
HugeSectors to a DEV_BSIZE-based count.
Fixes: ba2c98389b ("msdosfs: sanity check sector count from BPB")
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34264
After commit 74cf7cae4d ("softclock: Use dedicated ithreads for
running callouts."), there is a lock order reversal between the per-CPU
callout lock and the scheduler lock. softclock_thread() locks callout
lock then the scheduler lock, when preparing to switch off-CPU, and
sleepq_remove_thread() stops the timed sleep callout while potentially
holding a scheduler lock. In the latter case, it's the thread itself
that's locked, and if the thread is sleeping then its lock will be a
sleepqueue lock, but if it's still in the process of going to sleep
it'll be a scheduler lock.
We could perhaps change softclock_thread() to try to acquire locks in
the opposite order, but that'd require dropping and re-acquiring the
callout lock, which seems expensive for an operation that will happen
quite frequently. We can instead perhaps avoid stopping the
td_slpcallout callout if the thread is still going to sleep, which is
what this patch does. This will result in a spurious call to
sleepq_timeout(), but some counters suggest that this is very rare.
PR: 261198
Fixes: 74cf7cae4d ("softclock: Use dedicated ithreads for running callouts.")
Reported and tested by: thj
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34204
Add a beginning of a tracepoint.h implementation to ease porting drivers
making use of this Linux facility.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34236
Add NETIF_F_HW_CSUM to netdev_features.h as needed by a driver.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34233
Add an implementation of kstrtoint_from_user() based on the other
implementations and an attempt at DECLARE_FLEX_ARRAY() which works
for the driver needing it.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34231
Add an initial ethtool.h for a define and a dummy struct for now
needed by drivers.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34229
Add eth_random_addr() and a dummy of device_get_mac_address()
pending OF (FDT) support needed by drivers.
While here remove a white space in random_ether_addr().
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34228
Add ENOMEDIUM, ENOSR, and ELNRNG to linux/errno.h needed by drivers.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34227
Add sizeof_field() to linux/compiler.h needed by a driver.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34226
Add get_unaligned_le16() to asm/unaligned.h needed by a driver.
MFC after: 3 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D34224
Improve the "syncache: mbuf too small" assertion message with various
variables (some not actually needed) but enough that it will be obvious
if (a) we use IPv4 or IPv6, (b) if UDP tunneling is on, (c) what
max_linkhdr is, and (d) what MHLEN is.
This should help diagnostics in the future.
The case was hit with wireless drivers setting a large ic_headroom
and using IPv6.
Reviewed by: gallatin, tuexen, rscheff
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34217
On architectures with strict alignment requirements (e.g. arm), clang 14
warns about a packed struct which encloses a non-packed union:
In file included from sys/dev/bwi/bwimac.c:79:
sys/dev/bwi/if_bwivar.h:308:7: error: field iv_val within 'struct bwi_fw_iv' is less aligned than 'union (unnamed union at sys/dev/bwi/if_bwivar.h:305:2)' and is usually due to 'struct bwi_fw_iv' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access]
} iv_val;
^
It appears to help if you also add __packed to the inner union (i.e.
iv_val). No change to the layout is intended.
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34196
cxgbe_refresh_stats takes into account VI_SKIP_STATS but not
VI_INIT_DONE when deciding whether to read the hardware stats. But
before this change VI_SKIP_STATS was set only for VIs with VI_INIT_DONE.
That meant that cxgbe_refresh_stats always accessed the hardware for
uninitialized VIs, and this is a problem if the adapter is suspended or
in the middle of a reset.
Fix this by setting VI_SKIP_STATS on all VIs during suspend. While
here, ignore VI_INIT_DONE in vi_refresh_stats too to be consistent with
cxgbe_refresh_stats.
MFC after: 1 week
Sponsored by: Chelsio Communications
The TCP rate pacing code relies on being able to read this pointer
safely while holding an INP lock. The initial TLS session pointer is
set while holding the write lock already.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34086
It is still wrong-ish as fget* funcs don't expect to operate on abitrary
file descriptor tables, but this at least moves it out of the way of an
upcoming change while being bug-compatible.
Add a static assert for the siginfo_t, mcontext_t and ucontext_t
sizes. These are de-factor ABI options and cannot change size ever.
Differential Revision: https://reviews.freebsd.org/D34214
Add a static assert for the siginfo_t, mcontext_t and ucontext_t
sizes. These are de-facto ABI options and cannot change size ever. For
powerpc64, also add asserts for {u,m}mcontext32_t and siginfo32.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D34213
Add a static assert for the siginfo_t, mcontext_t and ucontext_t
sizes. These are de-facto ABI options and cannot change size ever.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D34212
Add a static assert for the siginfo_t, mcontext_t and ucontext_t
sizes. These are de-facto ABI options and cannot change size ever.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D34211
Add a static assert for the siginfo_t, mcontext_t and ucontext_t
sizes. These are de-facto ABI options and cannot change size ever.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D34210
Add a static assert for the siginfo{,32}_t, mcontext{,32}_t and
ucontext{,32}_t sizes. These are de-facto ABI options and cannot change
size ever.
Reviewed by: kib, andrew, jhb
Differential Revision: https://reviews.freebsd.org/D32958
The layout of the structure ends up depending on whether the including
file includes opt_inet.h and opt_inet6.h, so different compilation units
can end up seeing different versions of the structure. Fix this by
unconditionally defining the address fields.
As a side effect, this eliminates some duplication in the kernel's CTF
type graph.
Reviewed by: rscheff, tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34242
Reserve couters in the tcps struct in preparation
for AccECN, extend the debugging output for TF2
flags, optimize the syncache flags from individual
bits to a codepoint for the specifc ECN handshake.
This is in preparation of AccECN.
No functional chance except for extended debug
output capabilities.
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34161
This reverts commit 3de96d664a.
Problem is that it is possible to reach the state with ref_count ==
1 for the mapped non-anonymous object. For instance, anonymous posix
shmfd or linux shmfs object could be mapped, and then corresponding
file descriptor closed, dropping the object reference owned by the
shmfd/shmfs file. Then the check in inactive scan assumes that the
object and page are not mapped and frees the page, while they are not.
PR: 261707
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: now
Backport from Linux 5.17 (drivers/infiniband/hw/mlx5/fs.c)
This fixes creating flow rules from user-space after the
kernel space update based on Linux 5.7-rc1 .
Sponsored by: NVIDIA Networking
This was missed in 74d6c131cb where other geom modules were annotated
with MODULE_VERSION. Again, the problem is the same: we can't detect
that geom_md is loaded into the kernel without it.
This was noticed in release builds on the cluster; mdconfig attempts to
load geom_md because it can't detect it in the kernel, but the cluster
config includes md(4) and does not build the kmod. This problem would
have been masked on hosts with the kmod built, as the kmod attempts to
register the g_md module and fails. With this commit, mdconfig would
not even try to load it again.
Reported by: re (cperciva)
MFC after: 3 days
The ESXi NFSv4.1 client bogusly sends the wrong value
for the csa_sequence argument for a Create_session operation.
RFC8881 requires this value to be the same as the sequence
reply from the ExchangeID operation most recently done for
the client ID.
Without this patch, the server replies NFSERR_STALECLIENTID,
which is the correct response for an NFSv4.0 SetClientIDConfirm
but is not the correct error for NFSv4.1/4.2, which is
specified as NFSERR_SEQMISORDERED in RFC8881.
This patch fixes this.
This change does not fix the issue reported in the PR, where
the ESXi client loops, attempting ExchangeID/Create_session
repeatedly.
Reported by: asomers
Tested by: asomers
PR: 261291
MFC after: 1 week
The NVMe 1.4 spec simply says that Model and Serial numbers are
ASCII strings. Unlike SCSI, it doesn't prohibit non-printable
characters or say that the strings should be padded with spaces.
Since 2014, we have had cam_strvis_sbuf(), which gives additional
options for handling non-ASCII characters. That behavior hasn't
been available for non-sbuf consumers, so users of cam_strvis()
were left with having octal ASCII codes inserted.
So, to avoid having garbage or octal chracters in the strings, use
cam_strvis_sbuf() to create a new function, cam_strvis_flag(), and
re-implement cam_strvis() using cam_strvis_flag().
Now, for the NVMe drives, we can use cam_strvis_flag with the
CAM_STRVIS_FLAG_NONASCII_SPC flag. This transforms non-printable
characters into spaces.
sys/cam/cam.c:
Add a new function, cam_strvis_flag(), that creates an sbuf
on the stack with the user's destination buffer, and calls
cam_strvis_sbuf() with the given flag argument.
Re-implement cam_strvis() to call cam_strvis_flag with the
CAM_STRVIS_FLAG_NONASCII_ESC argument. This should be the
equivalent of the old cam_strvis() function, except for the
overhead of creating the sbuf and calling sbuf_putc/printf.
sys/cam/cam.h:
Declaration for cam_strvis_flag.
sys/cam/nvme/nvme_all.c:
In nvme_print_ident, use the NONASCII_SPC flag with
cam_strvis_flag().
sys/cam/nvme/nvme_da.c:
In ndaregister(), use cam_strvis_flag() with the
NONASCII_SPC flag for the disk description and serial
number we report to GEOM.
sys/cam/nvme/nvme_xpt.c:
In nvme_probe_done(), use cam_strvis_flag with the
NONASCII_SPC flag when storing the drive serial number
in the CAM EDT.
MFC after: 1 week
Sponsored by: Spectra Logic
Differential Revision: https://reviews.freebsd.org/D33973
During a recent deep dive into all the variables so I could
discover why stack switching caused larger retransmits I examined
every variable in rack. In the process I found quite a few bits
that were not used and needed cleanup. This update pulls
out all the unused pieces from rack. Note there are *no* functional
changes here, just the removal of unused variables and a bit of
spacing clean up.
Reviewed by: Michael Tuexen, Richard Scheffenegger
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34205
disconnection function.
Disconnecting hooks are called outside of NET_EPOCH, but
ng_pppoe_disconnect() calls NG_SEND_DATA_ONLY() which should be called
in NET_EPOCH.
PR: 257067
Reported by: niels=freebsd@bakker.net
Reviewed by: vmaffione (mentor), glebius, donner
Approved by: vmaffione (mentor), glebius, donner
Sponsored by: vstack.com
Differential Revision: https://reviews.freebsd.org/D34185
While there, remove a duplicate inclusion of sysctl.h.
Reported by: Gary Jennejohn
Fixes: a35bdd4489 - main - tcp: add sysctl interface for setting socket options
Sponsored by: Netflix, Inc.
This interface allows to set a socket option on a TCP endpoint,
which is specified by its inp_gencnt. This interface will be
used in an upcoming command line tool tcpsso.
Reviewed by: glebius, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34138
In the (extremely unlikely) case of vd->vd_height ==
vt_logo_sprite_height the vd_drawrect code would write outside of
frame-buffer memory.
MFC after: 1 week
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D34220
tcp_ctloutput_set() will be used via the sysctl interface in a
upcoming command line tool tcpsso.
Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34164
Parts of zstd, used in openzfs and other places, trigger a new clang 14
-Werror warning:
```
sys/contrib/zstd/lib/decompress/huf_decompress.c:889:25: error: use of bitwise '&' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
(BIT_reloadDStreamFast(&bitD1) == BIT_DStream_unfinished)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
While the warning is benign, it should ideally be fixed upstream and
then vendor-imported, but for now silence it selectively.
MFC after: 3 days
Since TD_IS_RUNNING() and TS_ON_RUNQ() are defined as logical
expressions involving '==', clang 14 warns about them being checked with
a bitwise operator instead of a logical one:
```
sys/kern/tty_info.c:124:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
runa = TD_IS_RUNNING(td) | TD_ON_RUNQ(td);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
sys/kern/tty_info.c:124:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
sys/kern/tty_info.c:129:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
runb = TD_IS_RUNNING(td2) | TD_ON_RUNQ(td2);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
sys/kern/tty_info.c:129:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
```
Fix this by using logical operators instead. No functional change
intended.
Reviewed by: cem, emaste, kevans, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34186
All pmaps share the top half of the address space. With 3-level page
tables, the top-level kernel map entries are not static: they might
change if the kernel map is extended (via pmap_growkernel()) or a 1GB
mapping in the direct map is demoted (not implemented yet). Thus the
riscv pmap maintains the allpmaps list to synchronize updates to
top-level entries.
When a pmap is created, it is inserted into this list after copying
top-level entries from the kernel pmap. The copying is done without
holding the allpmaps lock, and it is possible for pmap_pinit() to race
with kernel map updates. In particular, if a thread is modifying L1
entries, and a concurrent pmap_pinit() copies the old version of the
entries, it might not receive the update.
Fix the problem by copying the kernel map entries after inserting the
pmap into the list. This ensures that the nascent pmap always receives
updates, though pmap_distribute_l1() may race with the page copy.
Reviewed by: mhorne, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34158
There was nothing preventing one from sending an empty fragment on an
arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this
could not happen. Though the transmit path handles this case for TLS
1.0 with AES-CBC, we should be strict and allow empty fragments only in
modes where it is explicitly allowed.
Modify sosend_generic() to reject writes to a KTLS-enabled socket if the
number of data bytes is zero, so that userspace cannot trigger the
aforementioned assertion.
Add regression tests to exercise this case.
Reported by: syzkaller
Reviewed by: gallatin, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34195
Most fget*() functions initialize the output parameter to NULL. Make
the externally visible interface behave consistently, and make
fget_unlocked_seq() private to kern_descrip.c.
This fixes at least one bug in a consumer, _filemon_wrapper_openat(),
which assumes that getvnode() sets the output file pointer to NULL upon
an error.
Reported by: syzbot+01c0459408f896a5933a@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34190
Having a single pool of worker threads adds extra complexity and
overhead. The software backend also uses per-connection kthreads.
Sponsored by: Chelsio Communications
Previously the driver was called to send PDUs to the NIC synchronously
from the icl_conn_pdu_queue_cb callback. However, this performed a
fair bit of work while holding the icl connection lock. Instead,
change the callback to add sent PDUs to a STAILQ and defer dispatching
of PDUs to the NIC to a helper thread similar to the scheme used in
the TCP iSCSI backend.
- Replace rx_flags int and the sole RXF_ACTIVE flag with a simple
rx_active bool.
- Add a pool of transmit worker threads for cxgbei.
- Fix worker thread exit to depend on the wakeup in kthread_exit()
to fix a race with module unload.
Reported by: mav
Sponsored by: Chelsio Communications
- Add a starting index to 'struct vmstats' and change the
VM_STATS ioctl to fetch the 64 stats starting at that index.
A compat shim for <= 13 continues to fetch only the first 64
stats.
- Extend vm_get_stats() in libvmmapi to use a loop and a static
thread local buffer which grows to hold the stats needed.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27463
The ref_count counter is global (i.e. not per-vnet) so we can't use a
per-vnet lock to protect it. Moreover, in callouts curvnet is not set,
so we'd end up panicing when trying to use DN_BH_WLOCK().
Instead we use the global sched_lock, which is already used when
evaluating ref_count (in unload_dn_aqm()).
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34059
The ntp_init() function did set a couple of global objects to zero. These
objects are in the .bss section and already initialized to zero during kernel
or module loading.
The dummy PDU needs to be freed before marking task abortion complete
as otherwise cfiscsi_session_terminate_tasks can return and destroy
the session in another thread before the PDU is freed.
Fixes: 2e8d1a5525 iscsi: Allocate a dummy PDU for the internal nexus reset task.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D34176
clang doesn't implement it, and Linux doesn't enforce it. As a
result, new instances keep cropping up both in FreeBSD's code and in
upstream sources from vendors.
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D34144
This LOR happens when reading from a file backed MD device:
lock order reversal:
1st 0xfffffe00431eaac0 pbufwait (pbufwait, lockmgr) @ /cobra/src/sys/vm/vm_pager.c:471
2nd 0xfffff80003f17930 ufs (ufs, lockmgr) @ /cobra/src/sys/dev/md/md.c:977
lock order pbufwait -> ufs attempted at:
#0 0xffffffff80c78ead at witness_checkorder+0xbdd
#1 0xffffffff80bd6a52 at lockmgr_lock_flags+0x182
#2 0xffffffff80f52d5c at ffs_lock+0x6c
#3 0xffffffff80d0f3f4 at _vn_lock+0x54
#4 0xffffffff80708629 at mdstart_vnode+0x499
#5 0xffffffff807060ec at md_kthread+0x20c
#6 0xffffffff80bbfcd0 at fork_exit+0x80
#7 0xffffffff810b809e at fork_trampoline+0xe
This LOR was previously blessed by witness before commit 531f8cfea0
("Use dedicated lock name for pbufs").
Instead of blessing ufs and pbufwait, use LK_NOWAIT to prevent recording
the lock order. LK_NOWAIT will be a nop here as the lock is dropped in
pbuf_dtor(). The takes the same approach as 5875b94c74 ("buf_alloc():
lock the buffer with LK_NOWAIT").
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D34183
We should clear the single step flag when entering a signal hander and
set it when returning. This fixes the ptrace__PT_STEP_with_signal test.
While here add support for userspace to set the single step bit as on
x86. This can be used by userspace for self tracing.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34170
Missed to move the tcp_set_flags() past ECN processing
in rack_fast_output() earlier.
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34180
When debugging 32-bit programs a debugger may insert a instruction that
will raise the undefined instruction trap. The kernel handles these
by raising a SIGTRAP, however the code was incorrect.
Fix this by using the expected TRAP_BRKPT signal code.
Sponsored by: The FreeBSD Foundation
As promised to the transport call on 11/4/22 here is an implementation
of hystart++ for cubic. It also cleans up the tcp_congestion function
to have a better name. Common variables are moved into the general
cc.h structure so that both cubic and newreno can use them for
hystart++
Reviewed by: Michael Tuexen, Richard Scheffenegger
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33035
These headers originate with the Xen project and shouldn't be mixed with
the main portion of the FreeBSD kernel. Notably they shouldn't be the
target of clean-up commits.
Switch to use the headers in sys/contrib/xen.
Reviewed by: royger
The current path of the Xen headers at /sys/xen/interface/ is not
correct, as those headers are imported verbatim from the Xen sources
and shouldn't be modified, as any modifications would be lost when a
new version is imported.
Changes to the public headers must be first done in Xen upstream so
that they can be backported and new imports will already carry them.
Import Xen 4.16 headers in sys/contrib/xen/. It's unlikely that we
will import different Xen code, so don't place them inside of any
subdirectory. If in the future other pieces of Xen code need to be
imported the headers will need to move into an include/ subdirectory.
Note that this commit does not yet modify the include path to use the
newly imported headers.
Sponsored by: Citrix Systems R&D
There's no need to explicitly add linear mappings for the grant table
area, as the memory is allocated using xenmem_alloc and it should
already have a linear mapping that can be obtained using
rman_get_virtual.
While there also remove the return value of gnttab_map, since there's
no return value anymore.
Sponsored by: Citrix Systems R&D
Reviewed by: Elliott Mitchell <ehem+freebsd@m5p.com>
Differential revision: https://reviews.freebsd.org/D29602
Since 59f256ec35 dmesg(8) will always skip first line of the message
buffer, cause it might be incomplete. The problem is that in most cases
it is complete, valid and contains the "---<<BOOT>>---" marker. This
skip can be disabled with '-a', but that would also unhide all non-kernel
messages. Move this functionality from dmesg(8) to kernel, since kernel
actually knows if wrap has happened or not.
The main motivation for the change is not actually the value of the
"---<<BOOT>>---" marker. The problem breaks unit tests, that clear
message buffer, perform a test and then check the message buffer for
a result. Example of such test is sys/kern/sonewconn_overflow.
Incorrectly used KMOD_ marco in static kernel ECN functions.
Both eventually resolve to counter_s64_add(), but better
use the correct macros.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34181
Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.
Formally no functional change.
Incidentially this establishes correct
ECN operation in one instance.
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162
When we create a table without counters, add an entry and later
re-define the table to have counters we wound up trying to read
non-existent counters.
We now cope with this by attempting to add them if needed, removing them
when they're no longer needed and not trying to read from counters that
are not present.
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34131
Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.
Formally no functional change.
Incidentially this establishes correct
ECN operation in one instance.
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162
sbcut() returns mbufs in reverse order so is not suitable for reading
data from the socket buffer. Instead, check for already-received data
in the receive worker thread before passing offload PDUs up to the
iSCSI layer. This uses soreceive() to read data from the socket and
is also to use M_WAITOK since it now runs from a worker thread instead
of an interrupt thread.
Also, fix decoding of the data segment length for pre-offload PDUs.
Reported by: Jithesh Arakkan @ Chelsio
Fixes: a8c4147edc cxgbei: Parse all PDUs received prior to enabling offload mode.
Sponsored by: Chelsio Communications
FUSE file systems that do not set FUSE_NO_OPENDIR_SUPPORT do not
guarantee that d_off will be valid after closing and reopening a
directory. That conflicts with NFS's statelessness, that results in
unresolvable bugs when NFS reads large directories, if:
* The file system _does_ change the d_off field for the last directory
entry previously returned by VOP_READDIR, or
* The file system deletes the last directory entry previously seen by
NFS.
Rather than doing a poor job of exporting such file systems, it's better
just to refuse.
Even though this is technically a breaking change, 13.0-RELEASE's
NFS-FUSE support was bad enough that an MFC should be allowed.
MFC after: 3 weeks.
Reviewed by: rmacklem
Differential Revision: https://reviews.freebsd.org/D33726
In its lowest common denominator, FUSE does not require that a directory
entry's d_off field is valid outside of the lifetime of the directory's
FUSE file handle. But since NFS is stateless, it must reopen the
directory on every call to VOP_READDIR. That means reading the
directory all the way from the first entry. Not only does this create
an O(n^2) condition for large directories, but it can also result in
incorrect behavior if either:
* The file system _does_ change the d_off field for the last directory
entry previously seen by NFS, or
* The file system deletes the last directory entry previously seen by
NFS.
Handily, for file systems that set FUSE_NO_OPENDIR_SUPPORT d_off is
guaranteed to be valid for the lifetime of the directory entry, there is
no need to read the directory from the start.
MFC after: 3 weeks
Reviewed by: rmacklem
The FUSE protocol does not require that a directory entry's d_off field
outlive the lifetime of its directory's file handle. Since the NFS
server must reopen the directory on every VOP_READDIR call, that means
it can't pass uio->uio_offset down to the FUSE server. Instead, it must
read the directory from 0 each time. It may need to issue multiple
FUSE_READDIR operations until it finds the d_off field that it's looking
for. That was the intention behind SVN r348209 and r297887, but a logic
bug prevented subsequent FUSE_READDIR operations from ever being issued,
rendering large directories incompletely browseable.
MFC after: 3 weeks
Reviewed by: rmacklem
For write operation pseudofs creates an sbuf with the data.
Use this data instead of the uio as it's not usable anymore after
uiomove.
Reviewed by: hselasky
MFC after: 1 week
Sponsored by: Beckhoff Automation GmbH & Co. KG
Differential Revision: https://reviews.freebsd.org/D34114
Modules no longer call kernel functions for atomic ops, and since the
previous commit, we always use lock prefix.
Submitted by: Elliott Mitchell <ehem+freebsd@m5p.com>
Reviewed by: jhb, markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34153
Atomics have significant other use besides providing in-system
primitives for safe memory updates. They are used for implementing
communication with out of system software or hardware following some
protocols.
For instance, even UP kernel might require a protocol using atomics to
communicate with the software-emulated device on SMP hypervisor. Or
real hardware might need atomic accesses as part of the proper
management protocol.
Another point is that UP configurations on x86 are extinct, so slight
performance hit by unconditionally use proper atomics is not important.
It is compensated by less code clutter, which in fact improves the
UP/i386 lifetime expectations.
Requested by: Elliott Mitchell <ehem+freebsd@m5p.com>
Reviewed by: Elliott Mitchell, imp, jhb, markj, royger
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34153
While here clean up the names for the naming convention of the other
registers in this file.
Reviewed by: kib, mhorne (earlier version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34060
Summary:
This switch is based off of the AR8327/AR8337 external switch/PHY.
However unlike the AR8327/AR8337 it itself doesn't have any PHYs;
instead an external PHY connects to it using the PSGMII port.
Differential Revision: https://reviews.freebsd.org/D34112
Reviewed by: manu
This code is inspired by the ar40xx code in openwrt, which itself
is based on the Qualcomm QCA-SSDK. Both of these sources are, amusingly,
BSD licenced - and thus I have included some of the comments in the
hardware workaround paths to document some of the magic numbers.
This adds the ethernet MAC and ethernet switch definitions.
I've rewritten the header file and the DTS based on documentation
and the required driver fields rather than the GPL'ed
ones from openwrt.
Differential Revision: https://reviews.freebsd.org/D34111
Reviewed by: manu
This adds support for the IPQ4018/IPQ4019 MDIO bus. This is used to
talk to external PHYs and switches. (There's an internal switch
in the IPQ4018/IPQ4019 as well, but it's accessible via MMIO/AXI.)
Differential Revision: https://reviews.freebsd.org/D34110
Reviewed by: manu
A lot more generic cam related things were done in mmc_sim so this
simplifies the driver a lot.
Differential Revision: https://reviews.freebsd.org/D32154
Reviewed by: imp
After b5d227b0 FreeBSD was panicking on boot with "Duplicate free" in
UMA. Analyzing the asm, the '1' mask was treated as an integer, rather
than a long, causing 'slw' (shift left word) to be used for the shifting
instruction, not 'sld' (shift left double). This means the upper bits
of the bitfield were not getting used, resulting in corruption of the
bitfield.
While fixing this, the 'and' check of the mask does not need to be
recorded, so don't record (drop the '.').
There seem to be systems returning some garbage here. I still don't
know why, but at least I hope this check fix indefinite printf loop.
MFC after: 2 weeks
setsockopt() grants full access to the deprecated
TOS byte. For TCP, mask out the ECN codepoint, so that
only the DSCP portion can be adjusted.
Reviewed By: tuexen, hselasky, #manpages, #transport, debdrup
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34154
GCC's -Wformat complains about NULL format strings passed to
iwl_fw_dbg_collect_trig (though the function handles NULL format
strings). Curious that upstream iwlwifi in Linux is built with GCC
and explicitly opts into this warning via the __printf() attribute.
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D34146
Due to variable size of struct ctl_ha_msg_mode ctl_isc_announce_mode()
sent only first 4 bytes of modified mode page to the other HA side,
that caused its corruption there, noticeable only after failover.
I've found alike bug also in ctl_isc_announce_lun(), but there it was
sending slightly more than needed, that is a smaller problem.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
74cf7cae4d ("softclock: Use dedicated ithreads for running callouts.")
switched callouts away from the swi infrastructure. It turns out that
this was a major source of entropy in early boot, which we've now lost.
As a result, first boot on hardware without a 'fast' entropy source
would block waiting for fortuna to be seeded with little hope of
progressing without manual intervention.
Let's resolve it by explicitly harvesting entropy in callout_process()
if we've handled any callouts. cc/curthread/now seem to be reasonable
sources of entropy, so use those.
Discussed with: jhb (also proposed initial patch)
Reported by: many
Reviewed by: cem, markm (both csprng)
Differential Revision: https://reviews.freebsd.org/D34150
TCP RACK can cache the IP header while preparing
a new TCP packet for transmission. Thus all the
IP ECN codepoint bits need to be assigned, without
assuming a clear field beforehand.
Reviewed By: tuexen, kbowling, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34148
In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.
Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130
When FILEMON_SET_FD is used, the filemon handle effectively wraps the
passed file. In particular, the handle may be inherited by a child
process, or transferred over a unix domain socket, so we must verify
that the backing file permits this.
Reported by: syzbot+36e6be9e02735fe66ca8@syzkaller.appspotmail.com
Reviewed by: emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34128
Consistently only pass the inp and the sopt around. Don't pass the
so around, since in a upcoming commit tcp_ctloutput_set() will be
called from a context different from setsockopt(). Also expect
the inp to be locked when calling tcp_ctloutput_[gs]et(), this is
also required for the upcoming use by tcpsso, a command line tool
to set socket options.
Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34151
According to Broadcom, mixing 64-bit SGEs with 32-bit chain entries can
lead to IOC Fault code 0x40000d04. This fault code has been observed to
suddenly increase on certain machines when the OCA firmware images are
deployed. The hardware interprets all elements of a 64-bit SGE, even
ones marked as 32-bit. Depending on the other bits, this will just work,
but sometimes generate the above fault. Broadcom recommends this
practice, and the Linux and NetBSD drivers follow it.
Rework the chaining code to use MPI2_SGE_CHAIN64 instead of
MPI2_SGE_CHAIN32. Adjust MPS_SGC_SIZE from 8 to 12 to match the size of
the new structure. Flag the structure as being 64-bits now. Since
MPS_SGE64_SIZE and MPS_SGC_SIZE are the same now, mps_push_sge could be
simplified (after the same fashion of mpr). The different number of
cases collapse to whether or not there's room for the segments and if
not we need a chain, however these changes haven't been made yet as the
current code handles those cases properly with the new defines.
Made chain_busaddr 64-bits, even though we ask for all allocations to be
below 4GB for this tag. Use it to set both parts of the CHAIN64 address
rather than baking the 4GB assumption. Add asserts around the allocation
to detect and BUSDMA bugs in allocation.
Remove asserts and associated comment in mpi_pre_fw_download and
mpi_pre_fw_upload. The code does not, it seems, depend on this
invariant. The mpr driver has similar code, no asserts and also doesn't
depend on this.
Adjust comments to reflect the updated size.
Sponsored by: Netflix
Reviewed by: scottl, mav
Differential Revision: https://reviews.freebsd.org/D34016
I see no apparent need to avoid waiting on the lock just because
vinactive() may be called on another thread while the thread that
cleared the vnode refcount has the lock dropped. In fact, this
can at least lead to a panic of the form "vn_lock: error <errno>
incompatible with flags" if LK_RETRY was passed to VOP_LOCK().
In this case LK_NOWAIT may cause the underlying FS to return an
error which is incompatible with LK_RETRY.
Reported by: pho
Reviewed by: kib, markj, pho
Differential Revision: https://reviews.freebsd.org/D34109
The unionfs root vnode will always share a lock with its lower vnode.
If unionfs was mounted with the 'below' option, this will also be the
vnode covered by the unionfs mount. During unmount, the covered vnode
will be locked by dounmount() while the unionfs root vnode will be
locked by vgone(). This effectively requires recursion on the same
underlying like, albeit through two different vnodes.
Reported by: pho
Reviewed by: kib, markj, pho
Differential Revision: https://reviews.freebsd.org/D34109
VOP_LOCK() may be handed a vnode that is concurrently reclaimed.
unionfs_lock() accounts for this by checking for empty vnode private
data under the interlock. But it incorrectly asserts that the vnode
is using the unionfs dispatch table before making this check.
Reverse the order, and also update KASSERT_UNIONFS_VNODE() to provide
more useful information.
Reported by: pho
Reviewed by: kib, markj, pho
Differential Revision: https://reviews.freebsd.org/D34109
Commit b0b7d978b6 changed the NFSv4 server's default
behaviour to check the file's mode or ACL for permission to
open the file, to be Linux and Solaris compatible.
However, it turns out that Linux makes an exception for
the case of Claim_delegate_cur(_fh).
When a NFSv4 client is returning a delegation, it must
acquire Opens against the server to replace the ones
done locally in the client. The client does this via
an Open operation with Claim_delegate_cur(_fh). If
this operation fails, due to a change to the file's
mode or ACL after the delegation was issued, the
client does not have any way to retain the open.
As such, the Linux client allows the file's owner
to perform an Open with Claim_delegate_cur(_fh)
no matter what the mode or ACL allows.
This patch makes the FreeBSD server allow this case,
to be Linux compatible.
This patch only affects the case where delegations
are enabled, which is not the default.
MFC after: 2 weeks
If port resume fails, likely the USB device is detached. Ignore such errors,
because else the USB stack might try forever trying to resume the device,
before it will proceed detaching it.
MFC after: 1 week
Sponsored by: NVIDIA Networking
Eliminate shlq $3,address shift after masking of the va is done, which
is needed to convert pt_entry_t[] array index into byte offset.
Do it by preshifting the mask, and compensating the right shift of va.
Suggested by: alc
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33786
instead of only OBJ_ANON objects that are backing, as it is now.
This is required for e.g. vm_meter is_object_active() detection, and
should be useful in some more cases.
Use refcount KPI for all objects, regardless of owning the object lock,
and the fact that currently OBJ_ANON cannot change for the live object.
Noted and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33549
This allows entering the debugger at the earliest possible time, if
the '-d' argument is passed to the kernel.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34120
TCP per RFC793 has 4 reserved flag bits for future use. One
of those bits may be used for Accurate ECN.
This patch is to include these bits in the LRO code to ease
the extensibility if/when these bits are used.
Reviewed By: hselasky, rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34127
This fixes a -Wsign-compare error reported by GCC due to the two
results of the ternary operator having differing signedness.
Reviewed by: dougm, rlibby
Differential Revision: https://reviews.freebsd.org/D34122
6d4baa0d01 incorrectly rounded the lenght of the pflog header up to 8
bytes, rather than 4.
PR: 261566
Reported by: Guy Harris <gharris@sonic.net>
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
TLS RX support is modeled after TLS TX support. The basic structures and layouts
are almost identical, except that the send tag created filters RX traffic and
not TX traffic.
The TLS RX tag keeps track of past TLS records up to a certain limit,
approximately 1 Gbyte of TCP data. TLS records of same length are joined
into a single database record.
Regularly the HW is queried for TLS RX progress information. The TCP sequence
number gotten from the HW is then matches against the database of TLS TCP
sequence number records and lengths. If a match is found a static params WQE
is queued on the IQ and the hardware should immediately resume decrypting TLS
data until the next non-sequential TCP packet arrives.
Offloading TLS RX data is supported for untagged, prio-tagged, and
regular VLAN traffic.
MFC after: 1 week
Sponsored by: NVIDIA Networking
If the driver_version capability bit is enabled, send the driver
version to firmware after the init HCA command, for display purposes.
Example of driver version: "FreeBSD,mlx5_core,14.0.0,3.x-xxx"
Linux commits:
012e50e109fd27ff989492ad74c50ca7ab21e6a1
MFC after: 1 week
Sponsored by: NVIDIA Networking
Currently, unicast/multicast loopback raw ethernet (non-RDMA) packets
are sent back to the vport. A unicast loopback packet is the packet
with destination MAC address the same as the source MAC address. For
multicast, the destination MAC address is in the vport's multicast
filter list.
Moreover, the local loopback is not needed if there is one or none
user space context.
After this patch, the raw ethernet unicast and multicast local
loopback are disabled by default. When there is more than one user
space context, the local loopback is enabled.
Note that when local loopback is disabled, raw ethernet packets are
not looped back to the vport and are forwarded to the next routing
level (eswitch, or multihost switch, or out to the wire depending on
the configuration).
Linux commits:
c85023e153e3824661d07307138fdeff41f6d86a
8978cc921fc7fad3f4d6f91f1da01352aeeeff25
MFC after: 1 week
Sponsored by: NVIDIA Networking
This change adds convenience functions to setup a flow steering rule based on
a TCP socket. The helper function gets all the address information from the
socket and returns a steering rule, to be used with HW TLS RX offload.
MFC after: 1 week
Sponsored by: NVIDIA Networking
Previously flow steering tables and rules were only created and destroyed
at link up and down events, respectivly. Due to new requirements for adding
TLS RX flow tables and rules, the main flow steering table must always be
available as there are permanent redirections from the TLS RX flow table
to the vlan flow table.
MFC after: 1 week
Sponsored by: NVIDIA Networking
All packets must go through the indirection table, RQT,
because it is not possible to modify the RQN of the TIR
for direct dispatchment after it is created, typically
when the link goes up and down.
MFC after: 1 week
Sponsored by: NVIDIA Networking
Add support to map an SQ to a specific schedule queue using a
special WQE as performance enhancement.
SQ remap operation is handled by a privileged internal queue, IQ,
and the mapping is enabled from one rate to another.
The transition from paced to non-paced should however always go
through FW.
MFC after: 1 week
Sponsored by: NVIDIA Networking
Internal send queues are regular sendqueues which are reserved for WQE commands
towards the hardware and firmware. These queues typically carry resync
information for ongoing TLS RX connections and when changing schedule queues
for rate limited connections.
The internal queue, IQ, code is more or less a stripped down copy
of the existing SQ managing code with exception of:
1) An optional single segment memory buffer which can be read or
written as a whole by the hardware, may be provided.
2) An optional completion callback for all transmit operations, may
be provided.
3) Does not support mbufs.
MFC after: 1 week
Sponsored by: NVIDIA Networking
Allocate the RQT once, pointing all initial entries to the drop RQN.
When opening the channels simplify modify the RQT, directing all traffic
to the new RQNs. Similarly when closing the channels point all RQT entries
back to the so-called drop RQN.
MFC after: 1 week
Sponsored by: NVIDIA Networking
What is a drop RQ and why is it needed?
The RSS indirection table, also called the RQT, selects the
destination RQ based on the receive queue number, RQN. The RQT is
frequently referred to by flow steering rules to distribute traffic
among multiple RQs. The problem is that the RQs cannot be destroyed
before the RQT referring them is destroyed too. Further, TLS RX
rules may still be referring to the RQT even if the link went
down. Because there is no magic RQN for dropping packets, we create
a dummy RQ, also called drop RQ, which sole purpose is to drop all
received packets. When the link goes down this RQN is filled in all
RQT entries, of the main RQT, so the real RQs which are about to be
destroyed can be released and the TLS RX rules can be sustained.
MFC after: 1 week
Sponsored by: NVIDIA Networking
During packet reception the network stack frequently transmit data in
response to TCP window updates. To reduce the number of transmit doorbells
needed, inhibit all transmit doorbells designated for the same channel until
after the reception of packets for the given channel is completed.
While at it slightly refactor the mlx5e_tx_notify_hw() function:
1) The doorbell information is always stored into sq->doorbell.d64 .
No need to pass a separate pointer to this variable.
2) Move checks for skipping doorbell writes inside this function.
MFC after: 1 week
Sponsored by: NVIDIA Networking
Instead of allocating directly from a normal zone. This way
import and release are guaranteed to process all allocated and then
deallocated items. Also, the release occurs in a sleepable context when
caller of uma_zfree() or uma_zdestroy() can sleep itself.
MFC after: 1 week
Sponsored by: NVIDIA Networking
When allocating new vnode, we need to lock it exclusively before
making it externally visible. Since other threads cannot observe the
vnode yet, current lock order cannot create LoR conditions.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34126
It prevents WITNESS from recording the lock order for the buffer lock
acquired by getblkx().
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34073
it is needed for __read_mostly attribute definition, which right now
comes from vm/vm_page.h including sys/systm.h
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34089
It contains assert-related definitions previously provided by
sys/systm.h. The new header is leaner than whole systm.h.
Include kassert.h from systm.h for compatibility.
The copyright assignment to Eivind Eklund was suggested by Kirk McKusick
and is based in the commit 5526d2d920.
Suggested by: jhb
Reviewed by: alc, imp, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34089
struct sglist is intended for holding S/G lists of physical address
ranges, not virtual address ranges. GCC 9.x issues several warnings
due to casts between pointers and integers of different sizes as a
result (vm_paddr_t is 64-bits on i386). Instead, add a local 'struct
hv_sglist' which uses an array of 'struct iovec' to hold the S/G list
of virtual address ranges.
Differential Revision: https://reviews.freebsd.org/D31933
- After a connection has fallen back from NIC TLS to SW TLS, any
pacing rate changes should modify the inpcb send tag even though
SB_TLS_IFNET is set.
- If a connection tries to modify the pacing rate before the send
tag has been converted from plain TLS to TLS + RL, don't fail
the rate request set but let it fall through to setting the rate
on the non-TLS inpcb RL tag.
Reviewed by: gallatin, rrs, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34085
At the moment this is mostly a no-op but in the future there will be
in-flight encrypted data which requires software decryption. This
same setup is also needed for NIC TLS RX.
Note that this does break TOE TLS RX for AES-CBC ciphers since there
is no software fallback for AES-CBC receive. This will be resolved
one way or another before 14.0 is released.
Reviewed by: hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D34082
There are some error paths in ioctl handlers that will call
pf_krule_free() before the rule's rpool.mtx field is initialized,
causing a panic with INVARIANTS enabled.
Fix the problem by introducing pf_krule_alloc() and initializing the
mutex there. This does mean that the rule->krule and pool->kpool
conversion functions need to stop zeroing the input structure, but I
don't see a nicer way to handle this except perhaps by guarding the
mtx_destroy() with a mtx_initialized() check.
Constify some related functions while here and add a regression test
based on a syzkaller reproducer.
Reported by: syzbot+77cd12872691d219c158@syzkaller.appspotmail.com
Reviewed by: kp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34115
Also remove once-used functions to clean up after failed insmntque1(),
which were destructor callbacks in previous life.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D34071
According to the RM it's not safe to disable a TX ring while it is busy
transmitting frames.
In order to be safe wait until the ring is empty. (cidx==pidx)
Use this opportunity to remove a set-but-unused variable.
Obtained from: Semihalf
Sponsored by: Alstom Group
According to the RM rings can hold at most ring_size - 1 descriptors at any time.
No additional logic is needed since iflib already respects this constrain.
Thanks to that the pidx == cidx situation is not ambiguous and indicates an
empty ring.
Use that to simplify the logic that calculates the amount of processed frames.
Obtained from: Semihalf
Sponsored by: Alstom Group
The NIC can IP align received packets.
It was observed that it caused some rare stalls, that required full board reset.
Disable this feature for now. It doesn't provide any significant performance
improvement anyway.
Obtained from: Semihalf
Sponsored by: Alstom Group
when the vnode is doomed after relock. The mere fact that the vnode is
doomed does not prevent us from doing UFS operations on it while it is
still belongs to UFS, which is determined by non-NULL v_data. Not
finishing some operations, e.g. not syncing the inode block only because
the vnode started reclamation, is not correct.
Add macro IS_UFS() which incapsulates the v_data != NULL, and use it
instead of VN_IS_DOOMED() for places where the operation completion is
important.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
The lock is unneccessary since the mount point is busied, which prevents
unmount and syncer vnode deallocation. Having the vnode locked causes
innocent LoRs and complicates debugging.
Also stop starting write accounting around it. Any caller of
VOP_FSYNC() must do it already, and sync_vnode() does.
Reported and tested by: pho
Reviewed by: markj, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072