Commit Graph

6307 Commits

Author SHA1 Message Date
glebius
6d3bde7c4a Now that there is no R/W lock on PCB list the pcblist sysctls
handlers can be greatly simplified.  All the previous double
cycling and complex locking was added to avoid these functions
holding global PCB locks for extended period of time, preventing
addition of new entries.
2019-11-07 21:27:32 +00:00
glebius
cb8ae442ac Now that all of the tcp_input() and all its branches are executed
in the network epoch, we can greatly simplify synchronization.
Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro.
Remove some unneccesary assertions and convert necessary ones into the
NET_EPOCH_ASSERT macro.
2019-11-07 21:23:07 +00:00
glebius
05dccf5517 Remove unnecessary recursive epoch enter via INP_INFO_RLOCK
macro in udp_input().  It shall always run in the network epoch.
2019-11-07 21:08:49 +00:00
glebius
e3012914f1 Remove now unused INP_HASH_RLOCK() macros. 2019-11-07 21:03:15 +00:00
glebius
f83a4563fd Now with epoch synchronized PCB lookup tables we can greatly simplify
locking in udp_output() and udp6_output().

First, we select if we need read or write lock in PCB itself, we take
the lock and enter network epoch.  Then, we proceed for the rest of
the function.  In case if we need to modify PCB hash, we would take
write lock on it for a short piece of code.

We could exit the epoch before allocating an mbuf, but with this
patch we are keeping it all the way into ip_output()/ip6_output().
Today this creates an epoch recursion, since ip_output() enters epoch
itself.  However, once all protocols are reviewed, ip_output() and
ip6_output() would require epoch instead of entering it.

Note: I'm not 100% sure that in udp6_output() the epoch is required.
We don't do PCB hash lookup for a bound socket.  And all branches of
in6_select_src() don't require epoch, at least they lack assertions.
Today inet6 address list is protected by rmlock, although it is CKLIST.
AFAIU, the future plan is to protect it by network epoch.  That would
require epoch in in6_select_src().  Anyway, in future ip6_output()
would require epoch, udp6_output() would need to enter it.
2019-11-07 21:01:36 +00:00
glebius
deb3c19c55 Add INP_UNLOCK() which will do whatever R/W unlock is required. 2019-11-07 20:57:51 +00:00
glebius
76a6e088e6 Since r353292 on input path we are always in network epoch, when
we lookup PCBs.  Thus, do not enter epoch recursively in
in_pcblookup_hash() and in6_pcblookup_hash().  Same applies to
tcp_ctlinput() and tcp6_ctlinput().

This leaves several sysctl(9) handlers that return PCB credentials
unprotected.  Add epoch enter/exit to all of them.

Differential Revision:	https://reviews.freebsd.org/D22197
2019-11-07 20:49:56 +00:00
glebius
d7a93f3623 Remove unnecessary recursive epoch enter via INP_INFO_RLOCK
macro in divert_packet().  This function is called only from
pfil(9) filters, which in their place always run in the
network epoch.
2019-11-07 20:44:34 +00:00
glebius
61ee0e91c4 Remove unnecessary recursive epoch enter via INP_INFO_RLOCK
macro in raw input functions for IPv4 and IPv6.  They shall
always run in the network epoch.
2019-11-07 20:40:44 +00:00
bz
be19ea6cb3 netinet*: variable cleanup
In preparation for another change factor out various variable cleanups.
These mainly include:
(1) do not assign values to variables during declaration:  this makes
    the code more readable and does allow for better grouping of
    variable declarations,
(2) do not assign values to variables before need; e.g., if a variable
    is only used in the 2nd half of a function and we have multiple
    return paths before that, then do not set it before it is needed, and
(3) try to avoid assigning the same value multiple times.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-07 18:29:51 +00:00
glebius
a09b9c105d TCP timers are executed in callout context, so they need to enter network
epoch to look into PCB lists.  Mechanically convert INP_INFO_RLOCK() to
NET_EPOCH_ENTER().  No functional change here.
2019-11-07 00:27:23 +00:00
glebius
d85e3111e4 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in
TCP functions that are executed in syscall context.  No
functional change here.
2019-11-07 00:10:14 +00:00
glebius
62dc620e39 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
bz
d7908b9933 Properly set VNET when nuking recvif from fragment queues.
In theory the eventhandler invoke should be in the same VNET as
the the current interface. We however cannot guarantee that for
all cases in the future.

So before checking if the fragmentation handling for this VNET
is active, switch the VNET to the VNET of the interface to always
get the one we want.

Reviewed by:	hselasky
MFC after:	3 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22153
2019-10-25 18:54:06 +00:00
tuexen
56626fc5ad Ensure that the flags indicating IPv4/IPv6 are not changed by failing
bind() calls. This would lead to inconsistent state resulting in a panic.
A fix for stable/11 was committed in
https://svnweb.freebsd.org/base?view=revision&revision=338986
An accelerated MFC is planned as discussed with emaste@.

Reported by:		syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com
Obtained from:		jtl@
MFC after:		1 day
Sponsored by:		Netflix, Inc.
2019-10-24 20:05:10 +00:00
tuexen
ddaae075ce Store a handle for the event handler. This will be used when unloading the
SCTP as a module.

Obtained from:		markj@
2019-10-24 09:22:23 +00:00
rrs
f651cbcfcc Fix a small bug in bbr when running under a VM. Basically what
happens is we are more delayed in the pacer calling in so
we remove the stack from the pacer and recalculate how
much time is left after all data has been acknowledged. However
the comparision was backwards so we end up with a negative
value in the last_pacing_delay time which causes us to
add in a huge value to the next pacing time thus stalling
the connection.

Reported by:	vm2.finance@gmail.com
2019-10-24 05:54:30 +00:00
tuexen
71224580cf Fix compile issues when building a kernel without the VIMAGE option.
Thanks to cem@ for discussing the issue which resulted in this patch.

Reviewed by:		cem@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22089
2019-10-19 20:48:53 +00:00
cem
abc2745a10 debugnet(4): Add optional full-duplex mode
It remains unattached to any client protocol.  Netdump is unaffected
(remaining half-duplex).  The intended consumer is NetGDB.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed with:	markj
Differential Revision:	https://reviews.freebsd.org/D21541
2019-10-17 20:25:15 +00:00
cem
f92f351606 debugnet(4): Infer non-server connection parameters
Loosen requirements for connecting to debugnet-type servers.  Only require a
destination address; the rest can theoretically be inferred from the routing
table.

Relax corresponding constraints in netdump(4) and move ifp validation to
debugnet connection time.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D21482
2019-10-17 20:10:32 +00:00
cem
b0452a96d2 Add ddb(4) 'netdump' command to netdump a core without preconfiguration
Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine
to the generic debugnet code.  The imagined use is both netdump, shown here,
and NetGDB (vaporware).  It uses the ddb(4) lexer, with some new extensions,
to parse out IPv4 addresses.

'Netdump' uses the generic debugnet routine to load a configuration and
start a dump, without any netdump configuration prior to panic.

Loosely derived from work by:	John Reimer <john.reimer AT emc.com>
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D21460
2019-10-17 19:49:20 +00:00
glebius
a38665c7e7 Quickly fix up r353683: enter the epoch before calling into netisr_dispatch(). 2019-10-17 17:02:50 +00:00
cem
f3a0ee41db Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
glebius
a6c3971a22 igmp_v1v2_queue_report() doesn't require epoch. 2019-10-17 16:02:34 +00:00
hselasky
94dc322ef6 Fix panic in network stack due to use after free when receiving
partial fragmented packets before a network interface is detached.

When sending IPv4 or IPv6 fragmented packets and a fragment is lost
before the network device is freed, the mbuf making up the fragment
will remain in the temporary hashed fragment list and cause a panic
when it times out due to accessing a freed network interface
structure.


1) Make sure the m_pkthdr.rcvif always points to a valid network
interface. Else the rcvif field should be set to NULL.

2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for
the fully defragged packet, instead of the first received fragment.

Panic backtrace for IPv6:

panic()
icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx
icmp6_error()
frag6_freef()
frag6_slowtimo()
pfslowtimo()
softclock_call_cc()
softclock()
ithread_loop()

Reviewed by:	bz
Differential Revision:	https://reviews.freebsd.org/D19622
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-10-16 09:11:49 +00:00
tuexen
9c9657076e Separate out SCTP related dtrace code.
This is based on work done by markj@.

Discussed with:		markj@
MFC after:		3 days
2019-10-14 20:32:11 +00:00
rrs
2271f78dc9 if_hw_tsomaxsegsize needs to be initialized to zero, just
like in bbr.c and tcp_output.c
2019-10-14 13:10:29 +00:00
tuexen
c376e82038 Rename sctp_dtrace_declare.h to sctp_kdtrace.h for consistentcy.
MFC after:		3 days
2019-10-14 13:02:49 +00:00
glebius
c58d10ed59 Revert r353313. It is not needed with r353357 and is actually incorrect. 2019-10-14 04:10:00 +00:00
tuexen
e037f7f53e Use an event handler to notify the SCTP about IP address changes
instead of calling an SCTP specific function from the IP code.
This is a requirement of supporting SCTP as a kernel loadable module.
This patch was developed by markj@, I tweaked a bit the SCTP related
code.

Submitted by:		markj@
MFC after:		3 days
2019-10-13 18:17:08 +00:00
markj
8cb00574b8 Move SCTP DTrace probe definitions into a .c file.
Previously they were defined in a header which was included exactly
once.  Change this to follow the usual practice of putting definitions
in C files.  No functional change intended.

Discussed with:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-13 16:14:04 +00:00
tuexen
b9f81c4458 Ensure that local variables are reset to their initial value when
dealing with error cases in a loop over all remote addresses.
This issue was found and reported by OSS_Fuzz in:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18080
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18086
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18121
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18163

MFC after:		3 days
2019-10-12 17:57:03 +00:00
glebius
3c520efd9f The divert(4) module must always be running in network epoch, thus
call to if_addr_rlock() isn't needed.
2019-10-10 23:48:42 +00:00
imp
516b5533ca Fix casting error from newer gcc
Cast the pointers to (uintptr_t) before assigning to type
uint64_t. This eliminates an error from gcc when we cast the pointer
to a larger integer type.
2019-10-09 21:02:06 +00:00
hselasky
ff0a2f4dbc Factor out TCP rateset destruction code.
Ensure the epoch_call() function is not called more than one time
before the callback has been executed, by always checking the
RS_FUNERAL_SCHD flag before invoking epoch_call().

The "rs_number_dead" is balanced again after r353353.

Discussed with:	rrs@
Sponsored by:	Mellanox Technologies
2019-10-09 17:08:40 +00:00
glebius
ac17d49973 Revert most of the multicast changes from r353292. This needs a more
accurate approach.
2019-10-09 17:03:20 +00:00
hselasky
166e590d9c Fix locking order reversal in the TCP ratelimit code by moving
destructors outside the rsmtx mutex.

Witness message:
lock order reversal: (sleepable after non-sleepable)
   1st tcp_rs_mtx (rsmtx) @ sys/netinet/tcp_ratelimit.c:242
   2nd sysctl lock (sysctl lock) @ sys/kern/kern_sysctl.c:607

Backtrace:
witness_debugger
witness_checkorder
_rm_wlock_debug
sysctl_ctx_free
rs_destroy
epoch_call_task
gtaskqueue_run_locked
gtaskqueue_thread_loop

Discussed with:	rrs@
Sponsored by:	Mellanox Technologies
2019-10-09 16:48:48 +00:00
jhb
02e5a4c53c Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
glebius
1c64917ffa Quickly plug another regression from r353292. Again, multicast locking needs
lots of work...

Reported by:	pho
2019-10-08 16:59:17 +00:00
tuexen
e8c2889eec Validate length before use it, not vice versa.
r353060 should have contained this...
This fixes
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18070
MFC after:		3 days
2019-10-08 11:07:16 +00:00
glebius
337378e04f Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
tuexen
03060312fd In r343587 a simple port filter as sysctl tunable was added to siftr.
The new sysctl was not added to the siftr.4 man page at the time.
This updates the man page, and removes one left over trailing whitespace.

Submitted by:		Richard Scheffenegger
Reviewed by:		bcr@
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D21619
2019-10-07 20:35:04 +00:00
rrs
b1b5cae2c8 Brad Davis identified a problem with the new LRO code, VLAN's
no longer worked. The problem was that the defines used the
same space as the VLAN id. This commit does three things.
1) Move the LRO used fields to the PH_per fields. This is
   safe since the entire PH_per is used for IP reassembly
   which LRO code will not hit.
2) Remove old unused pace fields that are not used in mbuf.h
3) The VLAN processing is not in the mbuf queueing code. Consequently
   if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing
   for now until rack_bbr_common is updated to handle the VLAN properly.

Reported by:	Brad Davis
2019-10-06 22:29:02 +00:00
tuexen
f1dc180209 Plumb an mbuf leak in a code path that should not be taken. Also avoid
that this path is taken by setting the tail pointer correctly.
There is still bug related to handling unordered unfragmented messages
which were delayed in deferred handling.
This issue was found by OSS-Fuzz testing the usrsctp stack and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17794

MFC after:		3 days
2019-10-06 08:47:10 +00:00
tuexen
e878b27186 Fix a use after free bug when removing remote addresses.
This bug was found by OSS-Fuzz and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18004

MFC after:		3 days
2019-10-05 13:28:01 +00:00
tuexen
15678bfa06 Plumb an mbuf leak found by Mark Wodrich from Google by fuzz testing the
userland stack and reporting it in:
https://github.com/sctplab/usrsctp/issues/396

MFC after:		3 days
2019-10-05 12:34:50 +00:00
tuexen
353256a9ce Fix the adding of padding to COOKIE-ECHO chunks.
Thanks to Mark Wodrich who found this issue while fuzz testing the
usrsctp stack and reported the issue in
https://github.com/sctplab/usrsctp/issues/382

MFC after:		3 days
2019-10-05 09:46:11 +00:00
tuexen
f630826573 When skipping the address parameter, take the padding into account.
MFC after:		3 days
2019-10-03 20:47:57 +00:00
tuexen
9d0210a9cf Cleanup sctp_asconf_error_response() and ensure that the parameter
is padded as required. This fixes the followig bug reported by
OSS-Fuzz for the usersctp stack:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17790

MFC after:		3 days
2019-10-03 20:39:17 +00:00
tuexen
ff50a5791b Add missing input validation. This could result in reading from
uninitialized memory.
The issue was found by OSS-Fuzz for usrsctp  and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17780

MFC after:		3 days
2019-10-03 18:36:54 +00:00