Commit Graph

147132 Commits

Author SHA1 Message Date
Mark Johnston
5fd1a67e88 inpcb: Release the inpcb cred reference before freeing the structure
Now that the inp_cred pointer is accessed only while the inpcb lock is
held, we can avoid deferring a crfree() call when freeing an inpcb.

This fixes a problem introduced when inpcb hash tables started being
synchronized with SMR: the credential reference previously could not be
released until all lockless readers have drained, and there is no
mechanism to explicitly purge cached, freed UMA items.  Thus, ucred
references could linger indefinitely, and since ucreds hold a jail
reference, the jail would linger indefinitely as well.  This manifests
as jails getting stuck in the DYING state.

Discussed with:	glebius
Tested by:	glebius
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38573
2023-04-20 12:13:06 -04:00
Mark Johnston
7b92493ab1 inpcb: Avoid inp_cred dereferences in SMR-protected lookup
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets.  To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.

Changing SMR to periodically recycle garbage is not trivial.  Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred.  This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.

Commit 220d892129 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there.  This patch goes further to handle
lookups of unconnected sockets.  Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics.  This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.

In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.

Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups.  When a match is found, we try to lock the inpcb and
re-validate its connection info.  In the common case, this works well
and we can simply return the inpcb.  If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.

Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR.  In
particular, lbgroups are rarely allocated or freed.

I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.

Discussed with:	glebius
Tested by:	glebius
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38572
2023-04-20 12:13:06 -04:00
Mark Johnston
3e98dcb3d5 inpcb: Move inpcb matching logic into separate functions
These functions will get some additional callers in future revisions.

No functional change intended.

Discussed with:	glebius
Tested by:	glebius
Sponsored by:	Modirum MDPay
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38571
2023-04-20 12:13:06 -04:00
Mark Johnston
fdb987bebd inpcb: Split PCB hash tables
Currently we use a single hash table per PCB database for connected and
bound PCBs.  Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.

Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb.  Now the database has one table each
for connected and unconnected sockets.

When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use.  This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away.  There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.

I also made the "rehash" parameter of in(6)_pcbconnect() unused.  This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.

UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle.  To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number.  When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.

raw_ip (ab)uses the hash table without using the corresponding
accessors.  Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs.  This will be addressed in some
way in the future.

inp interators which specify a hash bucket will only visit connected
PCBs.  This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.

Discussed with:	glebius
Tested by:	glebius
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38569
2023-04-20 12:13:06 -04:00
Bjoern A. Zeeb
35f7fa4ac1 LinuxKPI: 802.11: improve assertion and tkip code
Move a KASSERT out of a function and make it a CTASSERT with
appropriate comments.

Skeleton implement two tkip functions, still left TODO, initializing
variables with dummy values to quiten compiler warnings.  It is
unclear to me if we should still ever properly implement TKIP
compat code at this point.  If so the current code gives a good
idea what needs to be done in addition to allocating references
to real state along with keyconf.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-04-20 16:07:50 +00:00
Bjoern A. Zeeb
7db7bfe1a7 iwlwifi: quieten more compiler warnings
Quieten some more (valid) gcc warnings and disable dead code.
There are more warnings, some probably a compiler problem, the
other related to firmware structs which I do not want to adjust
just locally.  Leave a comment to revisit after a next driver
update.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-04-20 16:07:05 +00:00
Bjoern A. Zeeb
74e908b3c6 LinuxKPI: fix READ_ONCE() -Wcast-equal warnings
Rather than using ACCESS_ONCE() in READ_ONCE() add a missing cast
to const in order to satisfy -Wcast-equal by gcc.
Sadly we cannot do the same to WRITE_ONCE() which still is very
noisy.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Reviewed by:	hselasky
Differential Revision: https://reviews.freebsd.org/D39706
2023-04-20 11:51:22 +00:00
Alexander V. Chernikov
56d4550c4d ifnet: factor out interface renaming into a separate function.
This change is required to support interface renaming via Netlink.
No functional changes intended.

Reviewed by:	zlei
Differential Revision: https://reviews.freebsd.org/D39692
MFC after:	2 weeks
2023-04-20 10:23:37 +00:00
Mateusz Guzik
9c4e270822 zfs: fix up EINVAL from getdirentries on .zfs
PR:	270909
2023-04-20 08:38:28 +00:00
Mateusz Guzik
7ff3143809 zfs: add missing vn state transition for .zfs
Reported by:	des
2023-04-20 08:09:59 +00:00
Bjoern A. Zeeb
f369f10dd8 LinuxKPI: 802.11: fix a -Wenum-compare warning
We are asserting that two values from different enums are the same.
gcc warns about these.  Cast the values to (int) to avoid the warning.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-04-19 21:49:17 +00:00
Bjoern A. Zeeb
b2dcb84868 LinuxKPI: skbuff.h: fix -Warray-bounds warnings
Harmonize sk_buff_head and sk_buff further and fix -Warray-bounds
warnings reports by gcc.  At the same time simplify some code by
re-using other functions or factoring some code out.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-04-19 21:49:00 +00:00
Dimitry Andric
87f55ab0b4 ichiic: use bool for one-bit wide bit-fields
A one-bit wide bit-field can take only the values 0 and -1. Clang 16
introduced a warning that "implicit truncation from 'int' to a one-bit
wide bit-field changes value from 1 to -1". Fix by using c99 bool.

Reported by:	Clang
Reviewed by:	emaste, wulf
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D39665
2023-04-19 22:25:50 +02:00
Igor Ostapenko
0e0c47ecd6 vfs cache: fix vfs.cache.stats.* name typos
Two vfs.cache.stats names are fixed:
- s/.dotdothis/.dotdothits/
- s/.posszaps/.poszaps/

Signed-off-by: Igor Ostapenko <pm@igoro.pro>
[mjg: massaged the header a little bit]
2023-04-19 18:47:38 +00:00
Randall Stewart
4e8a20a764 tcp: rack the request level logging is a bit too noisy when doing point logging.
When doing request level BB logging the hybrid_bw_log() does not have proper screening to minimize logging
when point level logging is in use. Lets fix it properly so you have to have the proper knobs set to get the
more noisy logging.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39699
2023-04-19 14:02:12 -04:00
Randall Stewart
7a842346c3 tcp: Rack can crash with the new non-TSO fix..
Turns out the location of the check to see if we can do output is in the wrong place. We need
to jump off to the compressed acks before handling that case since th is NULL in the
compressed ack case which is handled differently anyway.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39690
2023-04-19 13:17:04 -04:00
Randall Stewart
303246dcdf We have a TCP_LOG_CONNEND log that should come out at the very last log of every connection. This
holds some nice stats about why/how the connection ended. Though with the current code it does not
come out without accounting due to the placement of the ifdefs. Also we need to make sure the stacks
fini has ran before calling in from tcp_subr so we get all logs the stack may make at its ending.

Reviewed by: rscheff
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39693
2023-04-19 12:54:25 -04:00
Navdeep Parhar
7adf138ba9 cxgbe/iw_cxgbe: debug routines to dump STAG (steering tag) entries.
t4_dump_stag to dump hw state for a known STAG.

t4_dump_all_stag to dump hw state for all valid STAGs.  This routine
walks the entire STAG region looking for valid entries and this can take
a while for some configurations.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2023-04-19 09:38:14 -07:00
Justin Hibbits
12e99b63d2 ofed: Fix a logic inversion from IfAPI conversion
Reported by:	bartosz.sobczak_intel.com
Fixes:		3e142e0767 ("ofed: Mechanically convert to IfAPI")
Sponsored by:	Juniper Networks, Inc.
2023-04-19 11:56:25 -04:00
Dmitry Salychev
4cd9661428
dpaa2: Avoid dpaa2_cmd race conditions
struct dpaa2_cmd is no longer malloc'ed, but can be allocated on stack
and initialized with DPAA2_CMD_INIT() on demand. Drivers stopped caching
their DPAA2 command objects (and associated tokens) in the software
contexts in order to avoid using them concurrently.

Reviewed by:		bz
Approved by:		bz (mentor)
MFC after:		3 weeks
Differential Revision:	https://reviews.freebsd.org/D39509
2023-04-19 17:39:05 +02:00
Bjoern A. Zeeb
f621b087c0 iwlwifi: rtw88: rtw89: fix gcc warnings
Fix -Wno-format and unused variables warnings with gcc by adopting
(to|the) FreeBSD-specific code.

Reported by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Reviewed by:	jhb
Differential Revision: https://reviews.freebsd.org/D39673
2023-04-19 12:21:40 +00:00
Randall Stewart
960985a209 tcp: bbr.c is non-capable of doing ECN and sets an INP flag to fend off ECN however our syncache is not aware of that flag.
We need to make the syncache aware of the flag and not do ECN if its set. Note that this
is not 100% full proof but the best we can do (i.e. its still possible that you can get in a
situation where the peer try's to do ecn).

Reviewed by: tuexen, glebius, rscheff
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39672
2023-04-18 12:21:56 -04:00
Kristof Provost
2e6cdfe293 pf: change pf_rules_lock and pf_ioctl_lock to per-vnet locks
Both pf_rules_lock and pf_ioctl_lock only ever affect one vnet, so
there's no point in having these locks affect other vnets.
(In fact, the only lock in pf that can affect multiple vnets is
pf_end_lock.)

That's especially important for the rules lock, because taking the write
lock suspends all network traffic until it's released. This will reduce
the impact a vnet running pf can have on other vnets, and improve
concurrency on machines running multiple pf-enabled vnets.

Reviewed by:	zlei
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D39658
2023-04-19 09:50:52 +02:00
Konstantin Belousov
617a11eab6 x86: initialize use_xsave once
The explanation from https://reviews.freebsd.org/D39637 by stevek:
The "use_xsave" variable is a global and that is only supposed to be
initialized early before scheduling gets started. However, with the way
the ifuncs for "fpusave" and "fpurestore" are implemented, the value
could be changed at runtime when scheduling is active if "use_xsave"
was set to 0 by the tunable. This leaves a window of opportunity where
"use_xsave" gets re-initialized to 1 and a context switch could occur
with a thread that was not set up to be able to use xsave functionality.
This can lead to an "privileged instruction fault".

The fix is to protect "use_xsave" from being initialized more than once.

Reported and reviewed by:	stevek
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39660
2023-04-19 02:22:28 +03:00
Konstantin Belousov
93ca6ff295 umtx: allow to configure minimal timeout (in nanoseconds)
PR:	270785
Reviewed by:	markj, mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39584
2023-04-19 02:22:28 +03:00
Konstantin Belousov
7aeea73e30 syncer vnode: add VOP_GETWRITEMOUNT() definition explicitly
Since syncer vnode vector does not provide a fallback to the default
one, its VOP_GETWRITEMOUNT() implementation implicitly returned
EOPNOTSUPP, which means that syncer ignored suspension.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2023-04-19 02:21:40 +03:00
Konstantin Belousov
d8a096621b sync_vnode(): add assert to check vn_start_write() correctness
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2023-04-19 02:21:40 +03:00
Steve Kiernan
8deb442cf7 mac: Honor order when registering MAC modules.
Ensure MAC modules are inserted in order that they are registered.

Reviewed by:	markj
Obtained from:	Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D39589
2023-04-18 15:36:27 -04:00
Mateusz Guzik
5e954b9216 tmpfs: add missing vop_fplookup ops to tmpfs_fifoop_entries
Reported by:	gbe
PR:	270917
2023-04-18 18:06:30 +00:00
Marius Strobl
8defc88c13 gem(4): Remove onboard-only Sun ERI and remnants of SBus support
These bits are obsolete since 58aa35d429.
This change reverts part of 9ba2b298df as
well as effectively bd3d9826d7, i. e. the
SBus-related modifications. This also gets rid of a nasty hack required
as bus_{read,write}_N(9) doesn't really fit bus_space_subregion(9).
2023-04-18 19:17:24 +02:00
Marius Strobl
bd15d31cef mmc(4): Don't call bridge driver for timings not requiring tuning
The original idea behind calling into the bridge driver was to have the
logic deciding whether tuning is actually required for a particular bus
timing in a given slot as well as doing the sanity checking only on the
controller layer which also generally is better suited for these due to
say SDHCI_SDR50_NEEDS_TUNING. On another thought, not every such driver
should need to check whether tuning is required at all, though, and not
everything is SDHCI in the first place.
Adjust sdhci{,_fsl_fdt}(4) accordingly, but keep sdhci_generic_tune() a
bit cautious still.
2023-04-18 19:17:24 +02:00
Randall Stewart
2ad584c555 tcp: Inconsistent use of hpts_calling flag
Gleb has noticed there were some inconsistency's in the way the inp_hpts_calls flag was being used. One
such inconsistency results in a bug when we can't allocate enough sendmap entries to entertain a call to
rack_output().. basically a timer won't get started like it should. Also in cleaning this up I find that the
"no_output" side of input needs to be adjusted to make sure we don't try to re-pace too quickly outside
the hpts assurance of 250useconds.

Another thing here is we end up with duplicate calls to tcp_output() which we should not. If packets go
from hpts for processing the input side of tcp will call the output side of tcp on the last packet if it is needed.
This means that when that occurs a second call to tcp_output would be made that is not needed and if pacing
is going on may be harmful.

Lets fix all this and explicitly state the contract that hpts is making with transports that care about the
flag.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39653
2023-04-17 17:10:26 -04:00
Steve Kiernan
fb5ff7384c arm64: Use FULLKERNEL instead of .ALLSRC in .bin target
Using .ALLSRC may get additional arguments that we may not want
and could cause the objcopy to fail.

Reviewed by:	emaste
Obtained from:	Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D39639
2023-04-18 11:41:57 -04:00
Kristof Provost
af94d8cc17 pf: fix incorrect lock define
PF_TABLE_STATS_ASSERT() should be checking pf_table_stats_lock not
pf_rules_lock.

Fortunately the define is not yet used anywhere so this was harmless.
Fix it anyway, in case it does get used.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-04-18 15:51:05 +02:00
Hans Petter Selasky
1943c40cd6 mlx5en(4): Don't wait for receive queue to fill up with mbufs during open channels.
Failure to get mbufs may be transient.
Don't permanently fail to open the channels due to lack of mbufs.
This also makes modifying channel parameters faster.

MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:07 +02:00
Hans Petter Selasky
6bd4bb9bdb mlx5en(4): Explain why CQE zipping is off.
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:07 +02:00
Hans Petter Selasky
80b4ef6d10 mlx5: Remove unused debugfs node pointers.
No functional change intended.

MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:07 +02:00
Hans Petter Selasky
aa7bbdabde mlx5: Implement diagostic counters as sysctl(8) nodes.
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:07 +02:00
Hans Petter Selasky
95bf70a4bf mlx5: Don't give zero number of pages to the firmware.
Can happen when using virtual mlx5_core<N> functions, VFs.

MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:06 +02:00
Hans Petter Selasky
273bfac08f mlx5: Implement mlx5_core_modify_cq_by_mask().
Implement one CQ modify function supporting all firmware versions,
instead of having more variants of CQ modify.

MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:06 +02:00
Hans Petter Selasky
2f7e9a8a21 mlx5: Fix duplicate free of default flow rule in error case.
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:06 +02:00
Hans Petter Selasky
b0b87d9151 mlx5: Make mlx5_del_flow_rule() NULL safe.
This change factors out repeated NULL checks.

No functional change intended.

MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:06 +02:00
Hans Petter Selasky
3bb3e4768f mlx5: Make MLX5_COMP_EQ_SIZE tunable.
When using hardware pacing, this value can be increased, because more SQ's
means more EQ events aswell. Make it tunable, hw.mlx5.comp_eq_size .

MFC after:	1 week
Sponsored by:	NVIDIA Networking
2023-04-18 15:01:06 +02:00
Randall Stewart
37229fed38 tcp: Blackbox logging and tcp accounting together can cause a crash.
If you currently turn BB logging on and in combination have TCP Accounting on we can get a
crash where we have no NULL check and we run out of memory. Also lets make sure we
don't do a divide by 0 in calculating any BB ratios.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39622
2023-04-17 13:52:00 -04:00
Alexander V. Chernikov
28abf63277 netlink: sync interface IFLA attributes
MFC after:	2 weeks
2023-04-18 12:34:05 +00:00
Gordon Bergling
105e397eb6 kern_sysctl: Remove double words in source code comments
- s/on on/on/

MFC after:	5 days
2023-04-18 07:14:57 +02:00
Gordon Bergling
93e4914816 net80211: Remove double words in source code comments
- s/we we/we/

MFC after:	5 days
2023-04-18 07:14:50 +02:00
Stephen J. Kiernan
76735c7439 flash: Add "n25q64" to mx25l driver
This is for 64Mb Micron N25Q serial NOR flash memory

Obtained from:	Juniper Networks, Inc.
2023-04-18 00:21:17 -04:00
Jason A. Harmening
0c01203e47 vfs_lookup(): re-check v_mountedhere on lock upgrade
The VV_CROSSLOCK handling logic may need to upgrade the covered
vnode lock depending upon the requirements of the filesystem into
which vfs_lookup() is walking.  This may involve transiently
dropping the lock, which can allow the target mount to be unmounted.

Tested by:	pho
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D39272
2023-04-17 20:31:40 -05:00
Jason A. Harmening
93fe61afde unionfs_mkdir(): handle dvp reclamation
The underlying VOP_MKDIR() implementation may temporarily drop the
parent directory vnode's lock.  If the vnode is reclaimed during that
window, the unionfs vnode will effectively become unlocked because
the its v_vnlock field will be reset.  To uphold the locking
requirements of VOP_MKDIR() and to avoid triggering various VFS
assertions, explicitly re-lock the unionfs vnode before returning
in this case.

Note that there are almost certainly other cases in which we'll
similarly need to handle vnode relocking by the underlying FS; this
is the only one that's caused problems in stress testing so far.
A more general solution, such as that employed for nullfs in
null_bypass(), will likely need to be implemented.

Tested by:	pho
Reviewed by:	kib, markj
Differential Revision: https://reviews.freebsd.org/D39272
2023-04-17 20:31:40 -05:00