tcp_outflags isn't used in this source file and compilation failed with
external GCC on sparc64. I'm not sure why only that case failed (perhaps
inconsistent -Werror config) but it is a legitimate issue to fix.
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D23068
also commonizes the functions that both the freebsd and
rack stack uses.
Sponsored by:Netflix Inc
Differential Revision: https://reviews.freebsd.org/D23052
a connection can now have a separate tag added to the id.
Obtained from: Lawrence Stewart
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D22866
SCTP_PEER_ADDR_PARAMS socket option. The code in the stack assumes
sane values for the MTU.
This issue was found by running an instance of syzkaller.
MFC after: 1 week
This fixes deadlock between CARP and bridge. Bridge calls this function
taking CARP lock while holding bridge lock. Same time CARP tries to send
its announcements via the bridge while holding CARP lock.
Use of CARP_LOCK() here does not solve anything, since sc_addr is constant
while race on sc_state is harmless and use of the lock does not close it.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
in the case where a packet not marked was received.
Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19143
reported length.
Thanks to Natalie Silvanovich from Google for finding one of these
issues in the SCTP userland stack and reporting it.
MFC after: 1 week
The code in questions walks IPv6 tree every 60 seconds and looks into
the routes with non-zero expiration time (typically, redirected routes).
For each such route it sets RTF_PROBEMTU flag at the expiration time.
No other part of the kernel checks for RTF_PROBEMTU flag.
RTF_PROBEMTU was defined 21 years ago, 30 Jun 1999, as RTF_PROTO1.
RTF_PROTO1 is a de-facto standard indication of a route installed
by a routing daemon for a last decade.
Reviewed by: bz, ae
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22865
IPv4 and IPv6.
This fixes a regression issue after r349369. When trying to exit a
multicast group before closing the socket, a multicast leave packet
should be sent.
Differential Revision: https://reviews.freebsd.org/D22848
PR: 242677
Reviewed by: bz (network)
Tested by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D22479
Change the remaining caddr_t usages to char * following the removal
of the KAME macros
No functional change.
Requested by: glebius
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Netflix (originally)
Differential Revision: https://reviews.freebsd.org/D22399
groups. Do not acquire additional references. This makes the IPv4 IGMP
code in line with the IPv6 MLD code.
Background:
The IPv4 multicast code puts an extra reference on the in_multi struct
when joining groups. This becomes visible when using daemons like
igmpproxy from ports, that multicast entries do not disappear from the
output of ifmcstat(8) when multicast streams are disconnected.
This fixes a regression issue after r349762.
While at it factor the ip_mfilter_insert() and ip6_mfilter_insert() calls
to avoid repeated "is_new" check.
Differential Revision: https://reviews.freebsd.org/D22595
Tested by: Guido van Rooij <guido@gvr.org>
Reviewed by: rgrimes (network)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.
See the net/tcprtt port for an example consumer of this API.
Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time. stats(3) is more generic
and can be used in both userland and the kernel.
Reviewed by: thj
Tested by: thj
Obtained from: Netflix
Relnotes: yes
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D20655
This allows adding more ECN related flags in the future.
No functional change intended.
Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22497
to make them aligned.
Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22428
to add support for L4S or SCE, which require processing of the IP TOS
field.
Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22426
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.
Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22436
RTT measurements, but also scheldule new ones for the future.
Submitted by: Julius Flohr
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22547
r354748-354750 replaced the KAME macros with m_pulldown() calls.
Contrary to the rest of the network stack m_len checks before m_pulldown()
were not put in placed (see r354748).
Put these m_len checks in place for now (to go along with the style of the
network stack since the initial commits). These are not put in for
performance but to avoid an error scenario (even though it also will help
performance at the moment as it avoid allocating an extra mbuf; not because
of the unconditional function call).
The observed error case went like this:
(1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it.
(2) m_pullup() will call m_get() unless the requested length is larger than
MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the
requested length of data and pkthdr into the new mbuf.
(3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail.
This was observed with failing auto-configuration as an RA packet of
200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input()
dropped the mbuf.
(Re-)adding the m_len checks before m_pullup() calls avoids this problems
with mbufs using external storage for now.
MFC after: 3 weeks
Sponsored by: Netflix
as requiresd by the socket API specification.
Thanks to Inaki Baz Castillo, who found this bug running the userland
stack with valgrind and reported the issue in
https://github.com/sctplab/usrsctp/issues/408
MFC after: 1 week
One leak happens when there is not enough memory to allocate the
the resources for streams. The other leak happens if the are
unknown parameters in the received INIT-ACK chunk which require
reporting and the INIT-ACK requires sending an ABORT due to illegal
parameter combinations.
Hopefully this fixes
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19083
MFC after: 1 week
With VIMAGE kernels modules get special treatment as they need
to also keep the original values and make copies for each instance.
For that a few pages of vnet modspace are provided and the
kernel-linker and the VNET framework know how to deal with things.
When the modspace is (almost) full, other modules which would
overflow the modspace cannot be loaded and kldload will fail.
ip_mroute uses a lot of variable space, mostly be four big arrays:
set_vnet 0000000000000510 vnet_entry_multicast_register_if
set_vnet 0000000000000700 vnet_entry_viftable
set_vnet 0000000000002000 vnet_entry_bw_meter_timers
set_vnet 0000000000002800 vnet_entry_bw_upcalls
Dynamically malloc the three big ones for each instance we need
and free them again on vnet teardown (the 4th is an ifnet).
That way they only need module space for a single pointer and
allow a lot more modules using virtualized variables to be loaded
on a VNET kernel.
PR: 206583
Reviewed by: hselasky, kp
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D22443
The adjustments are inspired by the Linux stack, which has had a
functionally equivalent implementation for more than a decade now.
Submitted by: Richard Scheffenegger
Reviewed by: Cheng Cui
Differential Revision: https://reviews.freebsd.org/D18982
This patch addresses a very common case of frequent application stalls,
where TCP runs idle and looses the state of the network.
Submitted by: Richard Scheffenegger
Reviewed by: Cheng Cui
Differential Revision: https://reviews.freebsd.org/D18954
After r354748-354750 all uses of the IP6_EXTHDR_CHECK() and
IP6_EXTHDR_GET() macros are gone from the kernel. IP6_EXTHDR_GET0()
was unused. Remove the macros and update the documentation.
Sponsored by: Netflix
While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these
are not part of the PULLDOWN_TESTS.
Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove
the extra check and m_pullup() in tcp_input() under isipv6 given
tcp6_input() has done exactly that pullup already.
MFC after: 8 weeks
Sponsored by: Netflix
In a few places we have IP6_EXTHDR_GET() left in upper layer protocols.
The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data
fragment is not contiguous.
Convert these last remaining instances into m_pullup()s instead.
In CARP, for example, we will a few lines later call m_pullup() anyway,
the IPsec code coming from OpenBSD would otherwise have done the m_pullup()
and are copying the data a bit later anyway, so pulling it in seems no
better or worse.
Note: this leaves very few m_pulldown() cases behind in the tree and we
might want to consider removing them as well to make mbuf management
easier again on a path to variable size mbufs, especially given
m_pulldown() still has an issue not re-checking M_WRITEABLE().
Reviewed by: gallatin
MFC after: 8 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22335
to the higher of the previous ssthresh or 3/4 of the prior cwnd.
Submitted by: Richard Scheffenegger
Reviewed by: Cheng Cui
Differential Revision: https://reviews.freebsd.org/D18982
In ip6_[direct_]input() we are looping over the extension headers
to deal with the next header. We pass a pointer to an mbuf pointer
to the handling functions. In certain cases the mbuf can be updated
there and we need to pass the new one back. That missing in
dest6_input() and route6_input(). In tcp6_input() we should also
update it before we call tcp_input().
In addition to that mark the mbuf NULL all the times when we return
that we are done with handling the packet and no next header should
be checked (IPPROTO_DONE). This will eventually allow us to assert
proper behaviour and catch the above kind of errors more easily,
expecting *mp to always be set.
This change is extracted from a larger patch and not an exhaustive
change across the entire stack yet.
PR: 240135
Reported by: prabhakar.lakhera gmail.com
MFC after: 3 weeks
Sponsored by: Netflix
handlers can be greatly simplified. All the previous double
cycling and complex locking was added to avoid these functions
holding global PCB locks for extended period of time, preventing
addition of new entries.
in the network epoch, we can greatly simplify synchronization.
Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro.
Remove some unneccesary assertions and convert necessary ones into the
NET_EPOCH_ASSERT macro.
locking in udp_output() and udp6_output().
First, we select if we need read or write lock in PCB itself, we take
the lock and enter network epoch. Then, we proceed for the rest of
the function. In case if we need to modify PCB hash, we would take
write lock on it for a short piece of code.
We could exit the epoch before allocating an mbuf, but with this
patch we are keeping it all the way into ip_output()/ip6_output().
Today this creates an epoch recursion, since ip_output() enters epoch
itself. However, once all protocols are reviewed, ip_output() and
ip6_output() would require epoch instead of entering it.
Note: I'm not 100% sure that in udp6_output() the epoch is required.
We don't do PCB hash lookup for a bound socket. And all branches of
in6_select_src() don't require epoch, at least they lack assertions.
Today inet6 address list is protected by rmlock, although it is CKLIST.
AFAIU, the future plan is to protect it by network epoch. That would
require epoch in in6_select_src(). Anyway, in future ip6_output()
would require epoch, udp6_output() would need to enter it.
we lookup PCBs. Thus, do not enter epoch recursively in
in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to
tcp_ctlinput() and tcp6_ctlinput().
This leaves several sysctl(9) handlers that return PCB credentials
unprotected. Add epoch enter/exit to all of them.
Differential Revision: https://reviews.freebsd.org/D22197
In preparation for another change factor out various variable cleanups.
These mainly include:
(1) do not assign values to variables during declaration: this makes
the code more readable and does allow for better grouping of
variable declarations,
(2) do not assign values to variables before need; e.g., if a variable
is only used in the 2nd half of a function and we have multiple
return paths before that, then do not set it before it is needed, and
(3) try to avoid assigning the same value multiple times.
MFC after: 3 weeks
Sponsored by: Netflix
In theory the eventhandler invoke should be in the same VNET as
the the current interface. We however cannot guarantee that for
all cases in the future.
So before checking if the fragmentation handling for this VNET
is active, switch the VNET to the VNET of the interface to always
get the one we want.
Reviewed by: hselasky
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22153
happens is we are more delayed in the pacer calling in so
we remove the stack from the pacer and recalculate how
much time is left after all data has been acknowledged. However
the comparision was backwards so we end up with a negative
value in the last_pacing_delay time which causes us to
add in a huge value to the next pacing time thus stalling
the connection.
Reported by: vm2.finance@gmail.com
Thanks to cem@ for discussing the issue which resulted in this patch.
Reviewed by: cem@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D22089
It remains unattached to any client protocol. Netdump is unaffected
(remaining half-duplex). The intended consumer is NetGDB.
Submitted by: John Reimer <john.reimer AT emc.com> (earlier version)
Discussed with: markj
Differential Revision: https://reviews.freebsd.org/D21541
Loosen requirements for connecting to debugnet-type servers. Only require a
destination address; the rest can theoretically be inferred from the routing
table.
Relax corresponding constraints in netdump(4) and move ifp validation to
debugnet connection time.
Submitted by: John Reimer <john.reimer AT emc.com> (earlier version)
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D21482
Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine
to the generic debugnet code. The imagined use is both netdump, shown here,
and NetGDB (vaporware). It uses the ddb(4) lexer, with some new extensions,
to parse out IPv4 addresses.
'Netdump' uses the generic debugnet routine to load a configuration and
start a dump, without any netdump configuration prior to panic.
Loosely derived from work by: John Reimer <john.reimer AT emc.com>
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D21460
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).
It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).
The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.
Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.
The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.
No other functional change intended.
Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421
partial fragmented packets before a network interface is detached.
When sending IPv4 or IPv6 fragmented packets and a fragment is lost
before the network device is freed, the mbuf making up the fragment
will remain in the temporary hashed fragment list and cause a panic
when it times out due to accessing a freed network interface
structure.
1) Make sure the m_pkthdr.rcvif always points to a valid network
interface. Else the rcvif field should be set to NULL.
2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for
the fully defragged packet, instead of the first received fragment.
Panic backtrace for IPv6:
panic()
icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx
icmp6_error()
frag6_freef()
frag6_slowtimo()
pfslowtimo()
softclock_call_cc()
softclock()
ithread_loop()
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D19622
MFC after: 1 week
Sponsored by: Mellanox Technologies
instead of calling an SCTP specific function from the IP code.
This is a requirement of supporting SCTP as a kernel loadable module.
This patch was developed by markj@, I tweaked a bit the SCTP related
code.
Submitted by: markj@
MFC after: 3 days
Previously they were defined in a header which was included exactly
once. Change this to follow the usual practice of putting definitions
in C files. No functional change intended.
Discussed with: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Cast the pointers to (uintptr_t) before assigning to type
uint64_t. This eliminates an error from gcc when we cast the pointer
to a larger integer type.
Ensure the epoch_call() function is not called more than one time
before the callback has been executed, by always checking the
RS_FUNERAL_SCHD flag before invoking epoch_call().
The "rs_number_dead" is balanced again after r353353.
Discussed with: rrs@
Sponsored by: Mellanox Technologies
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler. This also adds some counters
for active TOE sessions.
The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().
To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE. Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.
Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
The new sysctl was not added to the siftr.4 man page at the time.
This updates the man page, and removes one left over trailing whitespace.
Submitted by: Richard Scheffenegger
Reviewed by: bcr@
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21619
no longer worked. The problem was that the defines used the
same space as the VLAN id. This commit does three things.
1) Move the LRO used fields to the PH_per fields. This is
safe since the entire PH_per is used for IP reassembly
which LRO code will not hit.
2) Remove old unused pace fields that are not used in mbuf.h
3) The VLAN processing is not in the mbuf queueing code. Consequently
if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing
for now until rack_bbr_common is updated to handle the VLAN properly.
Reported by: Brad Davis
that this path is taken by setting the tail pointer correctly.
There is still bug related to handling unordered unfragmented messages
which were delayed in deferred handling.
This issue was found by OSS-Fuzz testing the usrsctp stack and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17794
MFC after: 3 days
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.
Reviewed by: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21665
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.
Reviewed by: gallatin@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21616
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by: rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21825
syn cache overflows. Whether this is due to an attack or due to the system
having more legitimate connections than the syn cache can hold, this
situation can quickly impact performance.
To make the system perform better during these periods, the code will now
switch to exclusively using cookies until the syn cache stops overflowing.
In order for this to occur, the system must be configured to use the syn
cache with syn cookie fallback. If syn cookies are completely disabled,
this change should have no functional impact.
When the system is exclusively using syn cookies (either due to
configuration or the overflow detection enabled by this change), the
code will now skip acquiring a lock on the syn cache bucket. Additionally,
the code will now skip lookups in several places (such as when the system
receives a RST in response to a SYN|ACK frame).
Reviewed by: rrs, gallatin (previous version)
Discussed with: tuexen
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21644
rather than indirectly through the backpointer to the tcp_syncache
structure stored in the hashtable bucket.
This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.
Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21644
use of this parameter was removed in r313330. This commit now removes
passing this now-unused parameter.
Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21644
that instead of functions only being inside the _KERNEL and
the absence of RATELIMIT causing us to have NULL/error returning
interfaces we ended up with non-kernel getting the error path.
opps..
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.
Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:
https://queue.acm.org/detail.cfm?id=3022184
or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on implementing V2
of BBR to see if it is an improvement over V1.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D21582
This avoids a double lock bug in the NAT colliding state processing
of SCTP. Thanks to Felix Weinrank for finding and reporting this issue in
https://github.com/sctplab/usrsctp/issues/374
He found this bug using fuzz testing.
MFC after: 3 days
before computing the RTO.
This should fix an overflow issue reported by Felix Weinrank in
https://github.com/sctplab/usrsctp/issues/375
for the userland stack and found by running a fuzz tester.
MFC after: 3 days
it wasn't taking the IP level options into account. This patch fixes this.
In addition, it also corrects a KASSERT and adds protection code to assure
that the IP header chain and the TCP head fit in the first fragment as
required by RFC 7112.
Reviewed by: rrs@
MFC after: 3 days
Sponsored by: Nertflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21666
checksum computation, do not assume that the IP header chain and the
SCTP common header are in contiguous memory although the SCTP lays
out the mbuf chains that way. If there are IP-level options inserted
by the IP layer, the constraint is not fulfilled anymore.
This issues was found by running syzkaller. Thanks to markj@ who is
running an instance which also provides kernel dumps. This allowed me
to find this issue.
MFC after: 3 days
Don't call arc4random() unconditionally to initialize sc_iss, and
then when syncookies are enabled, just overwrite it with the
return value from from syncookie_generate(). Instead, only call
arc4random() to initialize sc_iss when syncookies are not
enabled.
Note that on a system under a syn flood attack, arc4random()
becomes quite expensive, and the chacha_poly crypto that it calls
is one of the more expensive things happening on the
system. Removing this unneeded arc4random() call reduces CPU from
about 40% to about 35% in my test scenario (Broadwell Xeon, 6Mpps
syn flood attack).
Reviewed by: rrs, tuxen, bz
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21591
sb_tls_flags, its just the sb_flags. Also the ratelimit
code, now that the defintion is in sockbuf.h, does not
need the ktls.h file (or its predecessor).
Sponsored by: Netflix Inc
This fixes hitting a KASSERT with a valid packet exchange.
Reviewed by: rrs@, Richard Scheffenegger
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21567
to add BBR. These changes make it so you can get an
array of timestamps instead of a compressed ack/data segment.
BBR uses this to aid with its delivery estimates. We also
now (via Drew's suggestions) will not go to the expense of
the tcb lookup if no stack registers to want this feature. If
HPTS is not present the feature is not present either and you
just get the compressed behavior.
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D21127
* Convert the TCP delayed ACK timer from ms to ticks as required.
This fixes the timer on platforms with hz != 1000.
* Don't delay acknowledgements which report duplicate data using
DSACKs.
Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21512
The lowest SACK block is used when multiple Blocks would be elegible as
DSACK blocks ACK blocks get reordered - while maintaining the ordering of
SACK blocks not relevant in the DSACK context is maintained.
Reviewed by: rrs@, tuexen@
Obtained from: Richard Scheffenegger
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21038
This fixes problem with parameters indicating a zero length or partial
parameters after an unknown parameter indicating to stop processing. It
also fixes a problem with state cookie parameters after unknown
parametes indicating to stop porcessing.
Thanks to Mark Wodrich from Google for finding two of these issues
by fuzz testing the userland stack and reporting them in
https://github.com/sctplab/usrsctp/issues/355
and
https://github.com/sctplab/usrsctp/issues/352
MFC after: 3 days
data chunks. Abort the association if there are data chunks with larger
fragement sequence numbers than the fragement sequence of the last
fragment.
Thanks to Mark Wodrich from Google who found this issue by fuzz testing
the userland stack and reporting this issue in
https://github.com/sctplab/usrsctp/issues/355
MFC after: 3 days
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
an retransmission of the initial SYN (with data) would
cause us to strip the SYN and decrement/increase offset/len
which then caused us a -1 offset and a panic.
Reported by: Larry Rosenman
(Michael Tuexen helped me debug this at the IETF)
nd_buf is used to buffer headers (for both the kernel dump itself and
for EKCD) before the final call to netdump_dumper(), which flushes
residual data in nd_buf. As a result, a small portion of the residual
data would be corrupted. This manifests when kernel dump compression
is enabled since both zstd and zlib detect the corruption during
decompression.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D21294
Since ipvoly is used for checksum calculation, part of original IP
header is zeroed. This part includes ip_ttl field, that can be used
later in IP_MINTTL socket option handling.
PR: 239799
MFC after: 1 week
IPPROTO 33 is DCCP in the IANA Registry:
https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml
IPPROTO_SEP was added about 20 years ago in r33804. The entries were added
straight from RFC1700, without regard to whether they were used.
The reference in RFC1700 for SEP is '[JC120] <mystery contact>', this is an
indication that the protocol number was probably in use in a private network.
As RFC1700 is no longer the authoritative list of internet numbers and that
IANA assinged 33 to DCCP in RFC4340, change the header to the actual
authoritative source.
Reviewed by: Richard Scheffenegger, bz
Approved by: bz (mentor)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21178
Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names. We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.
Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.
* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
- in6pcb which really is an inpcb and nothing separate
- sotoin6pcb which is sotoinpcb (as per above)
- in6p_sp which is inp_sp
- in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone also rename the in6p variables to inp as
that is what we call it in most of the network stack including
parts of netinet6.
The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.
No functional changes.
Discussed with: tuexen (SCTP changes)
MFC after: 3 months
Sponsored by: Netflix
chunk. This also plugs an potential mbuf leak.
Thanks to Felix Weinrank for reporting this issue found by fuzz-testing
the userland stack.
MFC after: 3 days
HOSTNAME parameter or a parameter with an illegal length, only
include an error cause indicating why the ABORT was sent.
This also fixes an mbuf leak which could occur.
MFC after: 3 days
an updated rack depend on having access to the new
ratelimit api in this commit.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20953
DCTCP specific methods. Also fallthrough NewReno for non ECN capable
TCP connections and improve the integer arithmetic.
Obtained from: Richard Scheffenegger
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20550
* Initialize the alpha parameter to a conservative value (like Linux)
* Improve handling of arithmetic.
* Improve man-page
Obtained from: Richard Scheffenegger
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20549
of the TCP TS offset from taking the IP addresses and the TCP port
numbers into account to a version just taking only the IP addresses
into account. This works around broken middleboxes or endpoints.
The default is to keep the behaviour, which is also the behaviour
recommended in RFC 7323.
Reported by: devgs@ukr.net
Reviewed by: rrs@
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D20980
When compiling RACK on platforms using gcc, a warning that tcp_outflags
is defined but not used is issued and terminates compilation on PPC64,
for example. So don't indicate that tcp_outflags is used.
Reviewed by: rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D20971
get BBRv1 into the tree. This fixes the DSACK bug but
is also needed by BBR. We have yet to go two more
one will be for the pacing code (tcp_ratelimit.c) and
the second will be for the new updated LRO code that
allows a transport to know the arrival times of packets
and (tcp_lro.c). After that we should finally be able
to get BBRv1 into head.
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D20908
well as sets in some of the groundwork for committing BBR. The
hpts system is updated as well as some other needed utilities
for the entrance of BBR. This is actually part 1 of 3 more
needed commits which will finally complete with BBRv1 being
added as a new tcp stack.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20834
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
instead of a linear array.
The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.
To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi
All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.
This patch has been tested by pho@ .
Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by: markj @
MFC after: 1 week
Sponsored by: Mellanox Technologies
With this opcode it is possible to match TCP packets with specified
MSS option, whose value corresponds to configured in opcode value.
It is allowed to specify single value, range of values, or array of
specific values or ranges. E.g.
# ipfw add deny log tcp from any to any tcpmss 0-500
Reviewed by: melifaro,bcr
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
If we take the slow path for forwarding we should still tell our
firewalls (hooked through pfil(9)) that we're forwarding. Pass the
ip_output() flags to ip_output_pfil() so it can set the PFIL_FWD flag
when we're forwarding.
MFC after: 1 week
Sponsored by: Axiado
in response to SACKs. The default behavior is unchanged; however, the limit
can be activated by changing the new net.inet.tcp.rack.split_limit sysctl.
Submitted by: Peter Lei <peterlei@netflix.com>
Reported by: jtl
Reviewed by: lstewart (earlier version)
Security: CVE-2019-5599
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
Apply a linker script when linking i386 kernel modules to apply padding
to a set_pcpu or set_vnet section. The padding value is kind-of random
and is used to catch modules not compiled with the linker-script, so
possibly still having problems leading to kernel panics.
This is needed as the code generated on certain architectures for
non-simple-types, e.g., an array can generate an absolute relocation
on the edge (just outside) the section and thus will not be properly
relocated. Adding the padding to the end of the section will ensure
that even absolute relocations of complex types will be inside the
section, if they are the last object in there and hence relocation will
work properly and avoid panics such as observed with carp.ko or ipsec.ko.
There is a rather lengthy discussion of various options to apply in
the mentioned PRs and their depends/blocks, and the review.
There seems no best solution working across multiple toolchains and
multiple version of them, so I took the liberty of taking one,
as currently our users (and our CI system) are hitting this on
just i386 and we need some solution. I wish we would have a proper
fix rather than another "hack".
Also backout r340009 which manually, temporarily fixed CARP before 12.0-R
"by chance" after a lead-up of various other link-elf.c and related fixes.
PR: 230857,238012
With suggestions from: arichardson (originally last year)
Tested by: lwhsu
Event: Waterloo Hackathon 2019
Reported by: lwhsu, olivier
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D17512
The corresponding changes for the RACK stack where missed and are added
by this commit.
Reviewed by: Richard Scheffenegger, rrs@
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20372
another one in udp_output(). This one is a race condition.
We do check on the laddr and lport without holding a lock in
order to determine whether we want a read or a write lock
(this is in the "sendto/sendmsg" cases where addr (sin) is given).
Instrumenting the kernel showed that after taking the lock, we
had bound to a local port from a parallel thread on the same socket.
If we find that case, unlock, and retry again. Taking the write
lock would not be a problem in first place (apart from killing some
parallelism). However the retry is needed as later on based on
similar condition checks we do acquire the pcbinfo lock and if the
conditions have changed, we might find ourselves with a lock
inconsistency, hence at the end of the function when trying to
unlock, hitting the KASSERT.
Reported by: syzbot+bdf4caa36f3ceeac198f@syzkaller.appspotmail.com
Reviewed by: markj
MFC after: 6 weeks
Event: Waterloo Hackathon 2019
netdump waits for acknowledgement from the server for each write. When
dumping page table pages, we perform many small writes, limiting
throughput. Use the netdump client's buffer to buffer small contiguous
writes before calling netdump_send() to flush the MAXDUMPPGS-sized
buffer. This results in a significant reduction in the time taken to
complete a netdump.
Submitted by: Sam Gwydir <sam@samgwydir.com>
Reviewed by: cem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20317
received and support for time stamps was negotiated in the SYN/SYNACK
exchange, perform the PAWS check and only expand the syn cache entry if
the check is passed.
Without this check, endpoints may get stuck on the incomplete queue.
Reviewed by: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D20374
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
that we end up in a consistent locking state at the end of
udp_output() in order to be able to see what the values are based
on which we once took a decision (note: some values may have changed).
This helped to debug a syzkaller report.
MFC after: 2 months
Event: Waterloo Hackathon 2019
These are obviously missing from the .c files, but don't show up in any
tinderbox configuration (due to latent header pollution of some kind). It
seems some configurations don't have this pollution, and the includes are
obviously missing, so go ahead and add them.
Reported by: Peter Jeremy <peter AT rulingia.com>
X-MFC-With: r347984
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
r333175 converted the global multicast lock to a sleepable sx lock,
so the lock order with respect to the (non-sleepable) inp lock changed.
To handle this, r333175 and r333505 added code to drop the inp lock,
but this opened races that could leave multicast group description
structures in an inconsistent state. This change fixes the problem by
simply acquiring the global lock sooner. Along the way, this fixes
some LORs and bogus error handling introduced in r333175, and commits
some related cleanup.
Reported by: syzbot+ba7c4943547e0604faca@syzkaller.appspotmail.com
Reported by: syzbot+1b803796ab94d11a46f9@syzkaller.appspotmail.com
Reviewed by: ae
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20070
Boot-time netdump configuration is much more useful if one can configure the
client and gateway addresses. Fix trivial typo.
(Long-standing bug, I believe it dates to the original netdump commit.)
Spotted by: one of vangyzen@ or markj@
Sponsored by: Dell EMC Isilon
Prior to this revision, struct diocskerneldump_arg (and struct netdump_conf
with embedded diocskerneldump_arg before r347192), were copied in their
entirety to the global 'nd_conf' variable. Also prior to this revision,
de-configuring netdump would *not* remove the the key material from global
nd_conf.
As part of Encrypted Kernel Crash Dumps (EKCD), which was developed
contemporaneously with netdump but happened to land first, the
diocskerneldump_arg structure will contain sensitive key material
(kda_key[]) when encrypted dumps are configured.
Netdump doesn't have any use for the key data -- encryption is handled in
the core dumper code -- so in this revision, we no longer store it.
Unfortunately, I think this leak dates to the initial import of netdump in
r333283; so it's present in FreeBSD 12.0.
Fortunately, the impact *seems* relatively minor. Any new *netdump*
configuration would overwrite the key material; for active encrypted netdump
configurations, the key data stored was just a duplicate of the key material
already in the core dumper code; and no user interface (other than
/dev/kmem) actually exposed the leaked material to userspace.
Reviewed by: markj, rpokala (earlier commit message)
MFC after: 2 weeks
Security: yes (minor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20233
Bind the TCP pacer threads to NUMA domains and build per-domain
pacer-thread lookup tables. These tables allow us to use the
inpcb's NUMA domain information to match an inpcb with a pacer
thread on the same domain.
The motivation for this is to keep the TCP connection local to a
NUMA domain as much as possible.
Thanks to jhb for pre-reviewing an earlier version of the patch.
Reviewed by: rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20134
This adds initial support for RFC 2883.
Submitted by: Richard Scheffenegger
Reviewed by: rrs@
Differential Revision: https://reviews.freebsd.org/D19334
This is descrined in RFC 6582, which updates RFC 3782.
Submitted by: Richard Scheffenegger
Reviewed by: lstewart@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17614
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.
Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D19804
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails. E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.
This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now. It also does not implement
IPv6 netdump.
savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.
This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device. Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.
As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.
Reviewed by: markj, scottl (earlier version)
Relnotes: maybe
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19996
Turning on multicast debug made multicast failure worse
because the strings and #define values no longer matched
up. Fix them, and make sure they stay matched-up.
Submitted by: torek
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.
Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.
Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028
This GRE-in-UDP encapsulation allows the UDP source port field to be
used as an entropy field for load-balancing of GRE traffic in transit
networks. Also most of multiqueue network cards are able distribute
incoming UDP datagrams to different NIC queues, while very little are
able do this for GRE packets.
When an administrator enables UDP encapsulation with command
`ifconfig gre0 udpencap`, the driver creates kernel socket, that binds
to tunnel source address and after udp_set_kernel_tunneling() starts
receiving of all UDP packets destined to 4754 port. Each kernel socket
maintains list of tunnels with different destination addresses. Thus
when several tunnels use the same source address, they all handled by
single socket. The IP[V6]_BINDANY socket option is used to be able bind
socket to source address even if it is not yet available in the system.
This may happen on system boot, when gre(4) interface is created before
source address become available. The encapsulation and sending of packets
is done directly from gre(4) into ip[6]_output() without using sockets.
Reviewed by: eugen
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D19921
In r297225 the initial INP_RLOCK() was replaced by an early
acquisition of an r- or w-lock depending on input variables
possibly extending the write locked area for reasons not entirely
clear but possibly to avoid a later case of unlock and relock
leading to a possible race condition and possibly in order to
allow the route cache to work for connected sockets.
Unfortunately the conditions were not 1:1 replicated (probably
because of the route cache needs). While this would not be a
problem the legacy IP code compared to IPv6 has an extra case
when dealing with IP_SENDSRCADDR. In a particular case we were
holding an exclusive inp lock and acquired the shared udbinfo
lock (now epoch).
When then running into an error case, the locking assertions
on release fired as the udpinfo and inp lock levels did not match.
Break up the special case and in that particular case acquire
and udpinfo lock depending on the exclusitivity of the inp lock.
MFC After: 9 days
Reported-by: syzbot+1f5c6800e4f99bdb1a48@syzkaller.appspotmail.com
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D19594
fragmented packets.
When sending IPv4 and IPv6 fragmented packets and a fragment is lost,
the mbuf making up the fragment will remain in the temporary hashed
fragment list for a while. If the network interface departs before the
so-called slow timeout clears the packet, the fragment causes a panic
when the timeout kicks in due to accessing a freed network interface
structure.
Make sure that when a network device is departing, all hashed IPv4 and
IPv6 fragments belonging to it, get freed.
Backtrace:
panic()
icmp6_reflect()
hlim = ND_IFINFO(m->m_pkthdr.rcvif)->chlim;
^^^^ rcvif->if_afdata[AF_INET6] is NULL.
icmp6_error()
frag6_freef()
frag6_slowtimo()
pfslowtimo()
softclock_call_cc()
softclock()
ithread_loop()
Differential Revision: https://reviews.freebsd.org/D19622
Reviewed by: bz (network), adrian
MFC after: 1 week
Sponsored by: Mellanox Technologies
The logic was present for the 11 version of the DIOCSKERNELDUMP ioctl, but
had not been updated for the 12 ABI.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D19980
The inp lock is not needed for testing the V6 flag as that flag is set
once when the inp is created and never changes. For non-TCP socket
options the lock is immediately dropped after checking that flag.
This just pushes the lock down to only be acquired for TCP socket
options.
This isn't a hot-path, more a cosmetic cleanup I noticed while reading
the code.
Reviewed by: bz
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19740
ensure that the ip_hl field is valid. Furthermore, ensure that the complete
IPv4 header is contained in the first mbuf. Finally, move the length checks
before relying on them when accessing fields of the IPv4 header.
Reported by: jtl@
Reviewed by: jtl@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19181
When leaving a multicast group, a hole may be created in the inpcb's
source filter and group membership arrays. To remove the hole, the
succeeding array elements are copied over by one entry. The multicast
code expects that a newly allocated array element is initialized, but
the code which shifts a tail of the array was leaving stale data
in the final entry. Fix this by explicitly reinitializing the last
entry following such a copy.
Reported by: syzbot+f8c3c564ee21d650475e@syzkaller.appspotmail.com
Reviewed by: ae
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19872
was unlocked and yet the bucket-unlock flag was not
changed to false. This can cause a panic if INVARIANTS
is on and we go through the right path (though rare).
This fixes the correct bug :)
Reported by: syzbot+179a1ad49f3c4c215fa2@syzkaller.appspotmail.com
Reviewed by: tuexen@
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.
Reviewed by: karels, kristof
Approved by: bde (mentor)
MFC after: 3 weeks
Sponsored by: John Gilmore
Differential Revision: https://reviews.freebsd.org/D19317
was unlocked and yet the bucket-unlock flag was not
changed to false. This can cause a panic if INVARIANTS
is on and we go through the right path (though rare).
Reported by: syzbot+179a1ad49f3c4c215fa2@syzkaller.appspotmail.com
Reviewed by: tuexen@
MFC after: 1 week
SCTP_SENDALL flag. Allow also only one operation per SCTP endpoint.
This fixes an issue found by running syzkaller and is joint work with rrs@.
MFC after: 1 week
CLAT is customer-side translator that algorithmically translates 1:1
private IPv4 addresses to global IPv6 addresses, and vice versa.
It is implemented as part of ipfw_nat64 kernel module. When module
is loaded or compiled into the kernel, it registers "nat64clat" external
action. External action named instance can be created using `create`
command and then used in ipfw rules. The create command accepts two
IPv6 prefixes `plat_prefix` and `clat_prefix`. If plat_prefix is ommitted,
IPv6 NAT64 Well-Known prefix 64:ff9b::/96 will be used.
# ipfw nat64clat CLAT create clat_prefix SRC_PFX plat_prefix DST_PFX
# ipfw add nat64clat CLAT ip4 from IPv4_PFX to any out
# ipfw add nat64clat CLAT ip6 from DST_PFX to SRC_PFX in
Obtained from: Yandex LLC
Submitted by: Boris N. Lytochkin
MFC after: 1 month
Relnotes: yes
Sponsored by: Yandex LLC
r344504 added an extra ARP_LOG() call in case of an if_output() failure.
It turns out IPv4 can be noisy. In order to not spam the console by default:
(a) add a counter for these events so people can keep better track of how
often it happens, and
(b) add a sysctl to select the default ARP_LOG log level and set it to
INFO avoiding the one (the new) DEBUG level by default.
Claim a spare (1st one after 10 years since the stats were added) in order
to not break netstat from FreeBSD 12->13 updates in the future.
Reviewed by: karels
Differential Revision: https://reviews.freebsd.org/D19490
level socket option SCTP_GET_LOCAL_ADDRESSES in a getsockopt() call.
Thanks to Thomas Barabosch for reporting the issue which was found by
running syzkaller.
MFC after: 3 days