Commit Graph

423 Commits

Author SHA1 Message Date
Gleb Smirnoff
eb93b99d69 in_pcb: delay crfree() down into UMA dtor
inpcb lookups, which check inp_cred, work with pcbs that potentially went
through in_pcbfree().  So inp_cred should stay valid until SMR guarantees
its invisibility to lookups.

While here, put the whole inpcb destruction sequence of in_pcbfree(),
inpcb_dtor() and inpcb_fini() sequentially.

Submitted by:		markj
Differential revision:	https://reviews.freebsd.org/D33273
2021-12-05 10:46:37 -08:00
Peter Lei
4c018b5aed in_pcb: limit the effect of wraparound in TCP random port allocation check
The check to see if TCP port allocation should change from random to
sequential port allocation mode may incorrectly cause a false positive
due to negative wraparound.
Example:
    V_ipport_tcpallocs = 2147483585 (0x7fffffc1)
    V_ipport_tcplastcount = 2147483553 (0x7fffffa1)
    V_ipport_randomcps = 100
The original code would compare (2147483585 <= -2147483643) and thus
incorrectly move to sequential allocation mode.

Compute the delta first before comparing against the desired limit to
limit the wraparound effect (since tcplastcount is always a snapshot
of a previous tcpallocs).
2021-12-03 12:38:12 -08:00
Peter Lei
13e3f3349f in_pcb: fix TCP local ephemeral port accounting
Fix logic error causing UDP(-Lite) local ephemeral port bindings
to count against the TCP allocation counter, potentially causing
TCP to go from random to sequential port allocation mode prematurely.
2021-12-03 12:30:21 -08:00
Cy Schubert
db0ac6ded6 Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"
This reverts commit 266f97b5e9, reversing
changes made to a10253cffe.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.
2021-12-02 14:45:04 -08:00
Cy Schubert
266f97b5e9 wpa: Import wpa_supplicant/hostapd commit 14ab4a816
This is the November update to vendor/wpa committed upstream 2021-11-26.

MFC after:      1 month
2021-12-02 13:35:14 -08:00
Gleb Smirnoff
f971e79139 tcp_hpts: rename input queue to drop queue and trim dead code
The HPTS input queue is in reality used only for "delayed drops".
When a TCP stack decides to drop a connection on the output path
it can't do that due to locking protocol between main tcp_output()
and stacks.  So, rack/bbr utilize HPTS to drop the connection in
a different context.

In the past the queue could also process input packets in context
of HPTS thread, but now no stack uses this, so remove this
functionality.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33025
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
de2d47842e SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc).  However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree().  However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it.  Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
  Once the desired inpcb is found we need to transition from SMR
  section to r/w lock on the inpcb itself, with a check that inpcb
  isn't yet freed.  This requires some compexity, since SMR section
  itself is a critical(9) section.  The complexity is hidden from
  KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
  notification) also a new KPI is provided, that hides internals of
  the database - inp_next(struct inp_iterator *).

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33022
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
565655f4e3 inpcb: reduce some aliased functions after removal of PCBGROUP.
Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33021
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
93c67567e0 Remove "options PCBGROUP"
With upcoming changes to the inpcb synchronisation it is going to be
broken. Even its current status after the move of PCB synchronization
to the network epoch is very questionable.

This experimental feature was sponsored by Juniper but ended never to
be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
[gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].

I'm up to resurrecting it back if there is any interest from anybody.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33020
2021-12-02 10:48:48 -08:00
Gordon Bergling
27c4abc7cd inet(3): Fix two typos in sysctl descriptions
- s/sequental/sequential/

MFC after:	3 days
2021-11-30 10:21:47 +01:00
Roy Marples
5c5340108e net: Allow binding of unspecified address without address existance
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).

An equivalent test has existed since 4.4-Lite.  It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).

In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es).  In practice no work will be
avoided.

Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses.  It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.

The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs.  As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.

PR:		253166
Reviewed by:	ae, bz, donner, emaste, glebius
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32563
2021-10-20 19:25:51 -04:00
Gleb Smirnoff
0f617ae48a Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c. 2021-10-18 10:19:57 -07:00
Gleb Smirnoff
744a64bd92 in_pcb: garbage collect in_pcbrele() 2021-10-18 10:07:16 -07:00
Gleb Smirnoff
5a78df20ce in_pcb: garbage collect unused structure in_pcblist 2021-10-18 10:06:39 -07:00
Gleb Smirnoff
2144431c11 Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.

Next step would be to CK-ify the in_addr hash and get rid of the...

Reviewed by:		melifaro
Differential Revision:	https://reviews.freebsd.org/D32434
2021-10-13 10:04:46 -07:00
John Baldwin
c782ea8bb5 Add a switch structure for send tags.
Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'.  A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions.  Now, those type-specific functions are saved
in the switch structure and invoked directly.  In addition, this more
gracefully permits multiple implementations of the same tag within a
driver.  In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by:	gallatin, hselasky, kib (older version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D31572
2021-09-14 11:43:41 -07:00
Mark Johnston
f161d294b9 Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input.  In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur.  Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.

Reported by:	syzkaller+KASAN
Reviewed by:	tuexen, melifaro
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29519
2021-05-03 13:35:19 -04:00
Gleb Smirnoff
1db08fbe3f tcp_input: always request read-locking of PCB for any pure SYN segment.
This is further rework of 08d9c92027.  Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.

This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.
2021-04-20 10:02:20 -07:00
Gleb Smirnoff
08d9c92027 tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by:	rrs
Differential revision:	https://reviews.freebsd.org/D29576
2021-04-12 08:25:31 -07:00
Randall Stewart
1a714ff204 This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.

    Sponsored by: Netflix Inc.
    Differential Revision:  https://reviews.freebsd.org/D28357
2021-01-28 11:53:05 -05:00
Hans Petter Selasky
093e723190 Add missing decrement of active ratelimit connections.
Reviewed by:	rrs@
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-01-26 18:00:21 +01:00
Hans Petter Selasky
85d8d30f9f Don't allow allocating a new send tag on an INP which is being torn down.
This fixes a potential send tag leak.

Reviewed by:	rrs@
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-01-26 18:00:02 +01:00
Andrew Gallatin
a034518ac8 Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by:	jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by:	Netfix
Differential Revision:	https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
John Baldwin
98d7a8d9cd Call m_snd_tag_rele() to free send tags.
Send tags are refcounted and if_snd_tag_free() is called by
m_snd_tag_rele() when the last reference is dropped on a send tag.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26995
2020-10-29 22:18:56 +00:00
John Baldwin
aebfdc1fec Store the new send tag in the right place.
r350501 added the 'st' parameter, but did not pass it down to
if_snd_tag_alloc().

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26997
2020-10-29 22:14:34 +00:00
Alexander V. Chernikov
0c325f53f1 Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
 and outbound traffic. Current code fills mbuf flowid from inp_flowid
 for connection-oriented sockets. However, inp_flowid is currently
 not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
 for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
 system.

Reviewed by:	glebius (previous version),ae
Differential Revision:	https://reviews.freebsd.org/D26523
2020-10-18 17:15:47 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Mike Karels
2fdbcbea76 Fix NULL-pointer bug from r361228.
Note that in_pcb_lport and in_pcb_lport_dest can be called with a NULL
local address for IPv6 sockets; handle it.  Found by syzkaller.

Reported by:	cem
MFC after:	1 month
2020-05-19 01:05:13 +00:00
Mike Karels
2510235150 Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by:	tuexen (rscheff, rrs previous version)
MFC after:	1 month
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D24781
2020-05-18 22:53:12 +00:00
Alexander V. Chernikov
9ac7c6cfed Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to
the new routing KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24245
2020-04-14 23:06:25 +00:00
Andrew Gallatin
98085bae8c make lacp's use_numa hashing aware of send tags
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
     lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
     always called by lacp_select_tx_port()

-   pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

-   adds a numa_domain field to if_snd_tag_alloc_params

-   plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by:	rrs, hselasky, jhb
Sponsored by:	Netflix
2020-03-09 13:44:51 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Randall Stewart
481be5de9d White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by:	Netflix Inc.
2020-02-12 13:31:36 +00:00
Gleb Smirnoff
c1604fe4d2 Make in_pcbladdr() require network epoch entered by its callers. Together
with this widen network epoch coverage up to tcp_connect() and udp_connect().

Revisions from r356974 and up to this revision cover D23187.

Differential Revision:	https://reviews.freebsd.org/D23187
2020-01-22 06:10:41 +00:00
Gleb Smirnoff
5c722e2ad3 The network epoch changes in the TCP stack combined with old r286227,
actually make removal of a PCB not needing ipi_lock in any form.  The
ipi_list_lock is sufficient.
2020-01-22 06:03:45 +00:00
Gleb Smirnoff
6a2954a17d Relax locking requirements for in_pcballoc(). All pcbinfo fields
modified by this function are protected by the PCB list lock that is
acquired inside the function.

This could have been done even before epoch changes, after r286227.
2020-01-22 05:58:29 +00:00
Gleb Smirnoff
2a4bd982d0 Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch.   The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC
2020-01-15 06:05:20 +00:00
Michael Tuexen
fe1274ee39 Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
   table.
2) The remote address was filled in and the inp was relocated in the
   hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.

Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22971
2020-01-12 17:52:32 +00:00
Gleb Smirnoff
d797164a86 Since r353292 on input path we are always in network epoch, when
we lookup PCBs.  Thus, do not enter epoch recursively in
in_pcblookup_hash() and in6_pcblookup_hash().  Same applies to
tcp_ctlinput() and tcp6_ctlinput().

This leaves several sysctl(9) handlers that return PCB credentials
unprotected.  Add epoch enter/exit to all of them.

Differential Revision:	https://reviews.freebsd.org/D22197
2019-11-07 20:49:56 +00:00
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Xin LI
903c4ee6ec Fix !INET build. 2019-08-02 22:43:09 +00:00
Randall Stewart
20abea6663 This adds the third step in getting BBR into the tree. BBR and
an updated rack depend on having access to the new
ratelimit api in this commit.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D20953
2019-08-01 14:17:31 +00:00
Hans Petter Selasky
59854ecf55 Convert all IPv4 and IPv6 multicast memberships into using a STAILQ
instead of a linear array.

The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.

To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi

All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.

This patch has been tested by pho@ .

Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by:	markj @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-25 11:54:41 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Andrew Gallatin
50575ce11c Track TCP connection's NUMA domain in the inpcb
Drivers can now pass up numa domain information via the
mbuf numa domain field.  This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by:	markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20028
2019-04-25 15:37:28 +00:00
John Baldwin
43b65e3c98 Don't check the inp socket pointer in in_pcboutput_eagain.
Reviewed by:	hps (by saying it was ok to be removed)
MFC after:	1 month
Sponsored by:	Netflix
2019-03-29 19:47:42 +00:00
Andrey V. Elsukov
c7ee62fcd5 In r335015 PCB destroing was made deferred using epoch_call().
But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables,
and thus it needs VNET context, that is missing during gtaskqueue
executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred().

PR:		235684
MFC after:	1 week
2019-02-13 15:46:05 +00:00
Gleb Smirnoff
a68cc38879 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00