Commit Graph

306 Commits

Author SHA1 Message Date
Mark Johnston
317fa5169d netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option
It has no effect, and an exp-run revealed that it is not in use.

PR:		261398 (exp-run)
Reviewed by:	mjg, glebius
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38822
2023-02-28 15:57:21 -05:00
Mark Johnston
3aff4ccdd7 netinet: Remove IP(V6)_BINDMULTI
This option was added in commit 0a100a6f1e but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.

The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree.  Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient.  So, let's remove it.

PR:		261398 (exp-run)
Reviewed by:	glebius
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38574
2023-02-27 10:03:11 -05:00
Justin Hibbits
3d0d5b21c9 IfAPI: Explicitly include <net/if_private.h> in netstack
Summary:
In preparation of making if_t completely opaque outside of the netstack,
explicitly include the header.  <net/if_var.h> will stop including the
header in the future.

Sponsored by:	Juniper Networks, Inc.
Reviewed by:	glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38200
2023-01-31 15:02:16 -05:00
Elliott Mitchell
21cc0918c7 sys: Nuke double-semicolons
A distinct number of double-semicolons have ended up in FreeBSD.  Take a
pass at getting rid of many of these harmless typos.

Reviewed by: emaste, rrs
Pull Request: https://github.com/freebsd/freebsd-src/pull/609
Differential Revision: https://reviews.freebsd.org/D31716
2022-11-02 09:34:20 -06:00
Gleb Smirnoff
2e0e273927 netinet6: trim overly long lines in GET_PKTOPT_VAR(), fit into 80 chars 2022-10-13 09:03:38 -07:00
Gleb Smirnoff
53af690381 tcp: remove INP_TIMEWAIT flag
Mechanically cleanup INP_TIMEWAIT from the kernel sources.  After
0d7445193a, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions.  Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision:	https://reviews.freebsd.org/D36400
2022-10-06 19:24:37 -07:00
Gleb Smirnoff
46ddeb6be8 netinet6: retire ip6protosw.h
The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d2 in 2001 and the latter survived until
today.  It has been reduced down to only one useful declaration that
moves to ip6_var.h

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36726
2022-10-03 20:53:04 -07:00
Mateusz Guzik
dda6376b04 net: employ newly added pfil_mbuf_{in,out} where approriate
Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36454
2022-09-08 16:21:08 +00:00
Gleb Smirnoff
74ed2e8ab2 raw ip: fix regression with multicast and RSVP
With 61f7427f02 raw sockets protosw has wildcard pr_protocol.  Protocol
of a specific pcb is stored in inp_ip_p.

Reviewed by:		karels
Reported by:		karels
Differential revision:	https://reviews.freebsd.org/D36429
Fixes:			61f7427f02
2022-09-02 12:17:09 -07:00
Alexander V. Chernikov
50fa27e795 netinet6: fix interface handling for loopback traffic
Currently, processing of IPv6 local traffic is partially broken:
 link-local connection fails and global unicast connect() takes
 3 seconds to complete.
This happens due to the combination of multiple factors.
IPv6 code passes original interface "origifp" when passing
traffic via loopack to retain the scope that is mandatory for the
correct hadling of link-local traffic. First problem is that the logic
of passing source interface is not working correcly for TCP connections,
resulting in passing "origifp" on the first 2 connection attempts and
lo0 on the subsequent ones. Second problem is that source address
validation logic skips its checks iff the source interface is loopback,
which doesn't cover "origifp" case.
More detailed description is available at https://reviews.freebsd.org/D35732

Fix the first problem by untangling&simplifying ifp/origifp logic.
Fix the second problem by switching source address validation check to
using M_LOOP mbuf flag instead of interface type.

PR:		265089
Reviewed by:	ae, bz(previous version)
Differential Revision:	https://reviews.freebsd.org/D35732
MFC after:	2 weeks
2022-07-10 12:47:47 +00:00
Andrey V. Elsukov
7d98cc096b Fix ipfw fwd that doesn't work in some cases
For IPv4 use dst pointer as destination address in fib4_lookup().
It keeps destination address from IPv4 header and can be changed
when PACKET_TAG_IPFORWARD tag was set by packet filter.

For IPv6 override destination address with address from dst_sa.sin6_addr,
that was set from PACKET_TAG_IPFORWARD tag.

Reviewed by:	eugen
MFC after:	1 week
PR:		256828, 261697, 255705
Differential Revision: https://reviews.freebsd.org/D34732
2022-04-11 14:16:43 +03:00
Andrew Gallatin
9ba117960e Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to  if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.

This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.

Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks
2022-01-27 10:34:34 -05:00
Kristof Provost
9f5432d5e5 netinet6: ip6_setpktopt() requires NET_EPOCH
ip6_setpktopt() can call ifnet_byindex() which requires epoch. Mark the
function as requiring NET_EPOCH, and ensure we enter it priot to calling
it.

Reported-by: syzbot+92526116441688fea8a3@syzkaller.appspotmail.com
Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D33462
2021-12-17 17:30:36 +01:00
Gleb Smirnoff
d74b7baeb0 ifnet_byindex() actually requires network epoch
Sweep over potentially unsafe calls to ifnet_byindex() and wrap them
in epoch.  Most of the code touched remains unsafe, as the returned
pointer is being used after epoch exit.  Mark that with a comment.

Validate the index argument inside the function, reducing argument
validation requirement from the callers and making V_if_index
private to if.c.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33263
2021-12-06 09:32:31 -08:00
Mark Johnston
44775b163b netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by:	glebius, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33097
2021-11-24 13:31:16 -05:00
Michael Tuexen
a8d54fc903 ipv6: Fix getsockopt() for some IPPROTO_IPV6 level socket options
Fix getsockopt() for the IPPROTO_IPV6 level socket options with the
following names: IPV6_HOPOPTS, IPV6_RTHDR, IPV6_RTHDRDSTOPTS,
IPV6_DSTOPTS, and IPV6_NEXTHOP.

Reviewed by:		markj
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D31458
2021-08-09 09:29:13 +02:00
Ryan Stone
2290dfb40f Enter the net epoch before calling ip6_setpktopts
ip6_setpktopts() can look up ifnets via ifnet_by_index(), which
is only safe in the net epoch.  Ensure that callers are in the net
epoch before calling this function.

Sponsored by: Dell EMC Isilon
MFC after: 4 weeks
Reviewed by: donner, kp
Differential Revision: https://reviews.freebsd.org/D30630
2021-06-04 13:18:11 -04:00
Kristof Provost
bb4a7d94b9 net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by:	ae (previous version), rscheff, tuexen, rgrimes
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29056
2021-03-04 20:56:48 +01:00
Alexander V. Chernikov
1bd44b11e5 Do not reference returned ifa in in6_ifawithifp().
The only place where in6_ifawithifp() is used is ip6_output(),
 which uses the returned ifa to bump traffic counters.
Given ifa stability guarantees is provided by epoch, do not refcount ifa.

This eliminates 2 atomic ops from IPv6 fast path.

Reviewed By:	rstone
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28649
2021-02-14 10:11:18 +00:00
Gleb Smirnoff
3f43ada98c Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2021-01-29 11:46:24 -08:00
Alexander V. Chernikov
0c325f53f1 Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
 and outbound traffic. Current code fills mbuf flowid from inp_flowid
 for connection-oriented sockets. However, inp_flowid is currently
 not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
 for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
 system.

Reviewed by:	glebius (previous version),ae
Differential Revision:	https://reviews.freebsd.org/D26523
2020-10-18 17:15:47 +00:00
Richard Scheffenegger
868aabb470 Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409
2020-10-09 12:06:43 +00:00
Navdeep Parhar
b092fd6c97 if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by:	kib@
Relnotes:	Yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:37:57 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Mark Johnston
19afc65a7e ip6_output(): Check the return value of in6_getlinkifnet().
If the destination address has an embedded scope ID, make sure that it
corresponds to a valid ifnet before proceeding.  Otherwise a sendto()
with a bogus link-local address can trigger a NULL pointer dereference.

Reported by:	syzkaller
Reviewed by:	ae
Fixes:		r358572
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25887
2020-07-30 17:43:23 +00:00
Mark Johnston
95033af923 Add the SCTP_SUPPORT kernel option.
This is in preparation for enabling a loadable SCTP stack.  Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.

Discussed with:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-18 19:32:34 +00:00
Alexander V. Chernikov
1483c1c508 Replace ip6_ouput fib6_lookup_nh_<ext|basic> calls with fib6_lookup().
fib6_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24973
2020-05-28 07:29:44 +00:00
Andrew Gallatin
d7452d89ad IPv6: sync IP_NO_SND_TAG_RL support from IPv4
The IP_NO_SND_TAG_RL flag to ip{,6}_output() means that the packets
being sent should bypass hardware rate limiting. This is typically used
by modern TCP stacks for rexmits.

This support was added to IPv4 in r352657, but never added to IPv6, even
though rack and bbr call ip6_output() with this flag.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24822
2020-05-12 14:01:12 +00:00
Andrew Gallatin
84af4cc153 Fix the build
Back out the IPv6 portion of r360903, as the stamp_tag param
is apparently not supported in upstream FreeBSD.

Sponsored by:	Netflix
Pointy hat to: gallatin
2020-05-11 21:23:22 +00:00
Andrew Gallatin
6043ac201a Ktls: never skip stamping tags for NIC TLS
The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.

When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped.  Failure
to stamp a NIC TLS packet will result in crypto
issues.

Reviewed by:	hselasky, rrs
Sponsored by:	Netflix, Mellanox
2020-05-11 19:17:33 +00:00
Gleb Smirnoff
7b6c99d08d Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:12:56 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Andrew Gallatin
23feb56348 KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213
2020-04-14 14:46:06 +00:00
Mark Johnston
e02582d1ae Fix synchronization in the IPV6_2292PKTOPTIONS set handler.
The inpcb needs to be locked when we update output packet options.
Otherwise it is possible for the IPV6_2292PKTOPTIONS handler to free
packet option structures while another thread is reading or updating
them.

Note that the option handler is still kind of broken.  For instance it
frees all options before performing privilege checks for individual
options.  However, this can be fixed separately.

Reported by:	syzbot+52eb0fd4ddc119787f9d@syzkaller.appspotmail.com
Reviewed by:	bz, tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24125
2020-03-19 21:38:52 +00:00
Bjoern A. Zeeb
8483fce695 ip6: retire in6_selectroute_fib() as promised 8 years ago
In r231852 I added in6_selectroute_fib() as a compat function with the
fibnum as an extra argument compared to in6_selectroute() to keep the
KPI stable.
Way too late retire this function again and add the fib to in6_selectroute()
which also only has a single consumer now and was an orphan function before.
2020-03-03 13:48:12 +00:00
Bjoern A. Zeeb
000c42faf3 ip6_output: use new routing KPI when not passed a cached route
Implement the equivalent of r347375 (IPv4) for the IPv6 output path.
In IPv6 we get passed a cached route (and inp) by udp6_output()
depending on whether we acquired a write lock on the INP.
In case we neither bind nor connect a first UDP packet would come in
with a cached route (wlocked) and all further packets would not.
In case we bind and do not connect we never write-lock the inp.

When we do not pass in a cached route, rather than providing the
storage for a route locally and pass it over the old lookup code
and down the stack, use the new route lookup KPI and acquire all
details we need to send the packet.

Compared to the IPv4 code the IPv6 code has a couple of possible
complications: given an option with a routing hdr/caching route there,
and path mtu (ro_pmtu) case which now equally has to deal with the
possibility of having a route which is NULL passed in, and the
fwd_tag in case a firewall changes the next hop (something to
factor out in the future).

Sponsored by:	Netflix
Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D23886
2020-03-03 11:32:47 +00:00
Bjoern A. Zeeb
3db6053160 ip6_output: fix regression introduced in r358167 for ipv6 fragmentation
When moving the calculations for the optlen into the if (opt) block
which deals with possible extension headers I failed to initialise
unfragpartlen to the ipv6 header length if there were no extension
headers present.  Correct that mistake to make IPv6 fragment length
calculcations work again.

Reported by:	hselasky, kp
OKed by:	hselasky, kp
MFC after:	3 days
X-MFC with:	r358167
PR:		244393
2020-02-25 15:03:41 +00:00
Bjoern A. Zeeb
3459050c9a Fix IPv6 checksums when exthdrs are present.
In two places in ip6_output we are doing (delayed) checksum calculations.
The initial logic came from SCTP in r205075,205104 and later I copied
and adjusted it for the TCP|UDP case in r235958.
The problem was that the original SCTP offsets were already wrong for any
case with extension headers present given IPv6 extension headers are not
part of the pseudo checksum calculations.
The later changes do not help in case there is checksum offloading as for
extension headers (incl. fragments) we do currrently never offload as we
have no infrastructure to know whether the NIC can handle these cases.

Correct the offsets for delayed checksum calculations and properly handle
mbuf flags.  In addition harmonize the almost identical duplicate code.

While here eliminate the now unneeded variable hlen and add an always
missing mtod() call in the 1-b and 3 cases after the introduction of
the mb_unmapped_to_ext() calls.

Reported by:	Francis Dupont (fdupont isc.org)
PR:		243675
MFC after:	6 days
Reviewed by:	markj (earlier version), gallatin
Differential Revision:	https://reviews.freebsd.org/D23760
2020-02-24 19:12:20 +00:00
Bjoern A. Zeeb
a1a6c01e41 ip6_output: improve extension header handling
Move IPv6 source address checks from after extension header heandling
to the top of the function. If we do not pass these checks there is
no reason to do a lot of work upfront.

Fold extension header preparations and length calculations together into
a single branch and macro rather than doing them sequentially.
Likewise move extension header concatination into a single branch block
only doing it if we recorded any extension header length length.

Reviewed by:	melifaro (earlier version), markj, gallatin
Sponsored by:	Netflix (partially, originally)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23740
2020-02-20 10:56:12 +00:00
Bjoern A. Zeeb
7c1daefe2c ip6_output: update comments.
Clear up some comments and improve to panic messages.

No functional changes.

MFC after:	3 days
2020-02-18 11:28:00 +00:00
Gleb Smirnoff
b955545386 Make ip6_output() and ip_output() require network epoch.
All callers that before may called into these functions
without network epoch now must enter it.
2020-01-22 05:51:22 +00:00
Gleb Smirnoff
e00ee1a9f4 In r343631 error code for a packet blocked by a firewall was
changed from EACCES to EPERM.  This change was not intentional,
so fix that.  Return EACCESS if a firewall forbids sending.

Noticed by:	ae
2020-01-01 17:32:20 +00:00
Gleb Smirnoff
1e4f4e56b9 ip6_output() has a complex set of gotos, and some can jump out of
the epoch section towards return statement. Since entering epoch
is cheap, it is easier to cover the whole function with epoch,
rather than try to properly maintain its state.
2019-10-09 17:02:28 +00:00
Mark Johnston
cb49ec5431 Improve locking in the IPV6_V6ONLY socket option handler.
Acquire the inp lock before checking whether the socket is already bound,
and around updates to the inp_vflag field.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21867
2019-10-07 23:35:23 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Bjoern A. Zeeb
0ecd976e80 IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names.  We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
  - in6pcb which really is an inpcb and nothing separate
  - sotoin6pcb which is sotoinpcb (as per above)
  - in6p_sp which is inp_sp
  - in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone  also rename the in6p variables to inp as
  that is what we call it in most of the network stack including
  parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with:		tuexen (SCTP changes)
MFC after:		3 months
Sponsored by:		Netflix
2019-08-02 07:41:36 +00:00
John Baldwin
82334850ea Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
John Baldwin
77a0144145 Sort opt_foo.h #includes and add a missing blank line in ip_output(). 2019-06-11 22:07:39 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00