Commit Graph

6389 Commits

Author SHA1 Message Date
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
Michael Tuexen
3a4f12e3b1 Allow sending on demand SCTP HEARTBEATS only in the ESTABLISHED state.
This issue was found by running syzkaller.

MFC after:		3 days
2019-05-19 17:53:36 +00:00
Michael Tuexen
fc26bf717c Improve input validation for the IPPROTO_SCTP level socket options
SCTP_CONNECT_X and SCTP_CONNECT_X_DELAYED.

Some issues where found by running syzkaller.

MFC after:		3 days
2019-05-19 17:28:00 +00:00
Mark Johnston
f00876fb60 Revert r347582 for now.
The inp lock still needs to be dropped when calling into the driver ioctl
handler, as some drivers expect to be able to sleep.

Reported by:	kib
2019-05-16 13:04:26 +00:00
Mark Johnston
5a1e222bfd Close some races in multicast socket option handling.
r333175 converted the global multicast lock to a sleepable sx lock,
so the lock order with respect to the (non-sleepable) inp lock changed.
To handle this, r333175 and r333505 added code to drop the inp lock,
but this opened races that could leave multicast group description
structures in an inconsistent state.  This change fixes the problem by
simply acquiring the global lock sooner.  Along the way, this fixes
some LORs and bogus error handling introduced in r333175, and commits
some related cleanup.

Reported by:	syzbot+ba7c4943547e0604faca@syzkaller.appspotmail.com
Reported by:	syzbot+1b803796ab94d11a46f9@syzkaller.appspotmail.com
Reviewed by:	ae
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20070
2019-05-14 21:30:55 +00:00
Conrad Meyer
64e7d18f34 netdump: Ref the interface we're attached to
Serialize netdump configuration / deconfiguration, and discard our
configuration when the affiliated interface goes away by monitoring
ifnet_departure_event.

Reviewed by:	markj, with input from vangyzen@ (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20206
2019-05-10 23:12:59 +00:00
Conrad Meyer
070e7bf95e netdump: Fix boot-time configuration typo
Boot-time netdump configuration is much more useful if one can configure the
client and gateway addresses.  Fix trivial typo.

(Long-standing bug, I believe it dates to the original netdump commit.)

Spotted by:	one of vangyzen@ or markj@
Sponsored by:	Dell EMC Isilon
2019-05-10 23:10:22 +00:00
Conrad Meyer
6144b50f8b netdump: Don't store sensitive key data we don't need
Prior to this revision, struct diocskerneldump_arg (and struct netdump_conf
with embedded diocskerneldump_arg before r347192), were copied in their
entirety to the global 'nd_conf' variable.  Also prior to this revision,
de-configuring netdump would *not* remove the the key material from global
nd_conf.

As part of Encrypted Kernel Crash Dumps (EKCD), which was developed
contemporaneously with netdump but happened to land first, the
diocskerneldump_arg structure will contain sensitive key material
(kda_key[]) when encrypted dumps are configured.

Netdump doesn't have any use for the key data -- encryption is handled in
the core dumper code -- so in this revision, we no longer store it.

Unfortunately, I think this leak dates to the initial import of netdump in
r333283; so it's present in FreeBSD 12.0.

Fortunately, the impact *seems* relatively minor.  Any new *netdump*
configuration would overwrite the key material; for active encrypted netdump
configurations, the key data stored was just a duplicate of the key material
already in the core dumper code; and no user interface (other than
/dev/kmem) actually exposed the leaked material to userspace.

Reviewed by:	markj, rpokala (earlier commit message)
MFC after:	2 weeks
Security:	yes (minor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20233
2019-05-10 21:55:11 +00:00
Gleb Smirnoff
54bb7ac0c4 Fix regression from r347375: do not panic when sending an IP multicast
packet from an interface that doesn't have IPv4 address.

Reported by:	Michael Butler <imb protected-networks.net>
2019-05-10 21:51:17 +00:00
Andrew Gallatin
4e255d7479 Bind TCP HPTS (pacer) threads to NUMA domains
Bind the TCP pacer threads to NUMA domains and build per-domain
pacer-thread lookup tables. These tables allow us to use the
inpcb's NUMA domain information to match an inpcb with a pacer
thread on the same domain.

The motivation for this is to keep the TCP connection local to a
NUMA domain as much as possible.

Thanks to jhb for pre-reviewing an earlier version of the patch.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20134
2019-05-10 13:41:19 +00:00
Michael Tuexen
b5a154d8e3 Don't use C++ style comments.
These where introduced in r347382.
Reported by:		ngie@
2019-05-09 21:00:15 +00:00
Michael Tuexen
5acfd95cbc Receiver side DSACK implemenation.
This adds initial support for RFC 2883.

Submitted by:		Richard Scheffenegger
Reviewed by:		rrs@
Differential Revision:	https://reviews.freebsd.org/D19334
2019-05-09 07:34:15 +00:00
Michael Tuexen
5cc11a89db Prevent cwnd to collapse down to 1 MSS after exiting recovery.
This is descrined in RFC 6582, which updates RFC 3782.

Submitted by:		Richard Scheffenegger
Reviewed by:		lstewart@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D17614
2019-05-09 07:11:08 +00:00
Gleb Smirnoff
6ca363eb7b Existense of PCB route caching doesn't allow us to use new fast route
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.

Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.

Reviewed by:	melifaro
Differential Revision:	https://reviews.freebsd.org/D19804
2019-05-08 23:39:24 +00:00
Conrad Meyer
6b6e2954dd List-ify kernel dump device configuration
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails.  E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.

This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now.  It also does not implement
IPv6 netdump.

savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.

This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device.  Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.

As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.

Reviewed by:	markj, scottl (earlier version)
Relnotes:	maybe
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19996
2019-05-06 18:24:07 +00:00
Alexander Motin
fb6a844704 ip multicast debug: fix strings vs defines
Turning on multicast debug made multicast failure worse
because the strings and #define values no longer matched
up.  Fix them, and make sure they stay matched-up.

Submitted by:	torek
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-04-29 18:09:55 +00:00
Andrew Gallatin
50575ce11c Track TCP connection's NUMA domain in the inpcb
Drivers can now pass up numa domain information via the
mbuf numa domain field.  This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by:	markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20028
2019-04-25 15:37:28 +00:00
Andrey V. Elsukov
aee793eec9 Add GRE-in-UDP encapsulation support as defined in RFC8086.
This GRE-in-UDP encapsulation allows the UDP source port field to be
used as an entropy field for load-balancing of GRE traffic in transit
networks. Also most of multiqueue network cards are able distribute
incoming UDP datagrams to different NIC queues, while very little are
able do this for GRE packets.

When an administrator enables UDP encapsulation with command
`ifconfig gre0 udpencap`, the driver creates kernel socket, that binds
to tunnel source address and after udp_set_kernel_tunneling() starts
receiving of all UDP packets destined to 4754 port. Each kernel socket
maintains list of tunnels with different destination addresses. Thus
when several tunnels use the same source address, they all handled by
single socket.  The IP[V6]_BINDANY socket option is used to be able bind
socket to source address even if it is not yet available in the system.
This may happen on system boot, when gre(4) interface is created before
source address become available. The encapsulation and sending of packets
is done directly from gre(4) into ip[6]_output() without using sockets.

Reviewed by:	eugen
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D19921
2019-04-24 09:05:45 +00:00
Conrad Meyer
a9f7f19242 netdump: Fix !COMPAT_FREEBSD11 unused variable warning
Reported by:	Ralf Wenk <iz-rpi03_hs-karlsruhe.de>
Sponsored by:	Dell EMC Isilon
2019-04-23 17:05:57 +00:00
Bjoern A. Zeeb
d86ecbe993 iFix udp_output() lock inconsistency.
In r297225 the initial INP_RLOCK() was replaced by an early
acquisition of an r- or w-lock depending on input variables
possibly extending the write locked area for reasons not entirely
clear but possibly to avoid a later case of unlock and relock
leading to a possible race condition and possibly in order to
allow the route cache to work for connected sockets.

Unfortunately the conditions were not 1:1 replicated (probably
because of the route cache needs). While this would not be a
problem the legacy IP code compared to IPv6 has an extra case
when dealing with IP_SENDSRCADDR. In a particular case we were
holding an exclusive inp lock and acquired the shared udbinfo
lock (now epoch).
When then running into an error case, the locking assertions
on release fired as the udpinfo and inp lock levels did not match.

Break up the special case and in that particular case acquire
and udpinfo lock depending on the exclusitivity of the inp lock.

MFC After:	9 days
Reported-by:	syzbot+1f5c6800e4f99bdb1a48@syzkaller.appspotmail.com
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D19594
2019-04-23 10:12:33 +00:00
Hans Petter Selasky
6bbdbbb830 Revert r346530 until further.
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2019-04-22 19:36:19 +00:00
Bjoern A. Zeeb
ade1258dc1 r297225 move the assignment of sin from add to the top of the function.
sin is not changed after the initial assignment, so no need to set it again.

MFC after:	10 days
2019-04-22 14:53:53 +00:00
Bjoern A. Zeeb
e932299837 Remove some excessive brackets.
No functional change.

MFC after:	10 days
2019-04-22 14:20:49 +00:00
Hans Petter Selasky
04f44499ca Fix build for mips and powerpc after r346530.
Need to include sys/kernel.h to define SYSINIT() which is used
by sys/eventhandler.h .

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2019-04-22 08:32:00 +00:00
Hans Petter Selasky
40eb389666 Fix panic in network stack due to memory use after free in relation to
fragmented packets.

When sending IPv4 and IPv6 fragmented packets and a fragment is lost,
the mbuf making up the fragment will remain in the temporary hashed
fragment list for a while. If the network interface departs before the
so-called slow timeout clears the packet, the fragment causes a panic
when the timeout kicks in due to accessing a freed network interface
structure.

Make sure that when a network device is departing, all hashed IPv4 and
IPv6 fragments belonging to it, get freed.

Backtrace:
panic()
icmp6_reflect()

hlim = ND_IFINFO(m->m_pkthdr.rcvif)->chlim;
^^^^ rcvif->if_afdata[AF_INET6] is NULL.

icmp6_error()
frag6_freef()
frag6_slowtimo()
pfslowtimo()
softclock_call_cc()
softclock()
ithread_loop()

Differential Revision:	https://reviews.freebsd.org/D19622
Reviewed by:		bz (network), adrian
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2019-04-22 07:27:24 +00:00
Conrad Meyer
60ade167fd netdump: Fix 11 compatibility DIOCSKERNELDUMP ioctl
The logic was present for the 11 version of the DIOCSKERNELDUMP ioctl, but
had not been updated for the 12 ABI.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D19980
2019-04-20 16:07:29 +00:00
John Baldwin
68cea2b106 Push down INP_WLOCK slightly in tcp_ctloutput.
The inp lock is not needed for testing the V6 flag as that flag is set
once when the inp is created and never changes.  For non-TCP socket
options the lock is immediately dropped after checking that flag.
This just pushes the lock down to only be acquired for TCP socket
options.

This isn't a hot-path, more a cosmetic cleanup I noticed while reading
the code.

Reviewed by:	bz
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19740
2019-04-18 23:21:26 +00:00
Michael Tuexen
20a6a3a7a7 When sending IPv4 packets on a SOCK_RAW socket using the IP_HDRINCL option,
ensure that the ip_hl field is valid. Furthermore, ensure that the complete
IPv4 header is contained in the first mbuf. Finally, move the length checks
before relying on them when accessing fields of the IPv4 header.
Reported by:		jtl@
Reviewed by:		jtl@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D19181
2019-04-13 10:47:47 +00:00
Michael Tuexen
6982c0fac1 Fix an SCTP related locking issue. Don't report that the TCB_SEND_LOCK
is owned, when it is not.

This issue was found by running syzkaller.
MFC after:		1 week
2019-04-11 20:39:12 +00:00
Mark Johnston
f1ef572a1e Reinitialize multicast source filter structures after invalidation.
When leaving a multicast group, a hole may be created in the inpcb's
source filter and group membership arrays.  To remove the hole, the
succeeding array elements are copied over by one entry.  The multicast
code expects that a newly allocated array element is initialized, but
the code which shifts a tail of the array was leaving stale data
in the final entry.  Fix this by explicitly reinitializing the last
entry following such a copy.

Reported by:	syzbot+f8c3c564ee21d650475e@syzkaller.appspotmail.com
Reviewed by:	ae
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19872
2019-04-11 08:00:59 +00:00
Randall Stewart
8021928623 Fix a small bug in the tcp_log_id where the bucket
was unlocked and yet the bucket-unlock flag was not
changed to false. This can cause a panic if INVARIANTS
is on and we go through the right path (though rare).
This fixes the correct bug :)

Reported by:	syzbot+179a1ad49f3c4c215fa2@syzkaller.appspotmail.com
Reviewed by:	tuexen@
2019-04-10 18:58:11 +00:00
Rodney W. Grimes
6c1c6ae537 Use IN_foo() macros from sys/netinet/in.h inplace of handcrafted code
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros.  Correct that by replacing handcrafted
code with proper macro usage.

Reviewed by:		karels, kristof
Approved by:		bde (mentor)
MFC after:		3 weeks
Sponsored by:		John Gilmore
Differential Revision:	https://reviews.freebsd.org/D19317
2019-04-04 19:01:13 +00:00
Randall Stewart
fd29ff5d53 Undo my previous erroneous commit changing the tcp_output kassert.
Hmm now the question is where did the tcp_log_id change go :o
2019-04-03 19:35:07 +00:00
Navdeep Parhar
7893235ff0 tcp_autorcvbuf_inc was removed in r344433.
Discussed with:	tuexen@
Sponsored by:	Chelsio Communications
2019-03-29 21:39:47 +00:00
John Baldwin
43b65e3c98 Don't check the inp socket pointer in in_pcboutput_eagain.
Reviewed by:	hps (by saying it was ok to be removed)
MFC after:	1 month
Sponsored by:	Netflix
2019-03-29 19:47:42 +00:00
Mark Johnston
7762bbc30e Add CTLFLAG_VNET to the net.inet.icmp.tstamprepl definition.
Reported by:	Hans Fiedler <hans@hfconsulting.com>
MFC after:	3 days
2019-03-26 22:14:50 +00:00
Randall Stewart
7854c63d6f Fix a small bug in the tcp_log_id where the bucket
was unlocked and yet the bucket-unlock flag was not
changed to false. This can cause a panic if INVARIANTS
is on and we go through the right path (though rare).

Reported by:	syzbot+179a1ad49f3c4c215fa2@syzkaller.appspotmail.com
Reviewed by:	tuexen@
MFC after:	1 week
2019-03-26 10:41:27 +00:00
Michael Tuexen
eb3b9ea3fe Fix a double free of an SCTP association in an error path.
This is joint work with rrs@. The issue was found by running
syzkaller.

MFC after:		1 week
2019-03-26 08:27:00 +00:00
Michael Tuexen
7c96d54f20 Initialize scheduler specific data for the FCFS scheduler.
This is joint work with rrs@. The issue was reported by using
syzkaller.

MFC after:		1 week
2019-03-25 16:40:54 +00:00
Michael Tuexen
689ed08920 Improve locking when tearing down an SCTP association.
This is joint work with rrs@ and the issue was found by
syzkaller.

MFC after:		1 week
2019-03-25 15:23:20 +00:00
Michael Tuexen
2de5b90420 Fix the handling of fragmented unordered messages when using DATA chunks
and FORWARD-TSN.

This bug was reported in https://github.com/sctplab/usrsctp/issues/286
for the userland stack.

This is joint work with rrs@.

MFC after:		1 week
2019-03-25 09:47:22 +00:00
Michael Tuexen
58e6eeef45 Fix build issue for the userland stack.
Joint work with rrs@.

MFC after:		1 week
2019-03-24 12:13:05 +00:00
Michael Tuexen
6b6de29ca1 Fox more signed unsigned issues. This time on the send path.
This is joint work with rrs@ and was found by running syzkaller.

MFC after:		1 week
2019-03-24 10:40:20 +00:00
Michael Tuexen
0d3cf13dab Fix a signed/unsigned bug when receiving SCTP messages.
This is joint work with rrs@.

Reported by:		syzbot+6b8a4bc8cc828e9d9790@syzkaller.appspotmail.com
MFC after:		1 week
2019-03-24 09:46:16 +00:00
Michael Tuexen
7de4780412 Limit the size of messages sent on 1-to-many style SCTP sockets with the
SCTP_SENDALL flag. Allow also only one operation per SCTP endpoint.

This fixes an issue found by running syzkaller and is joint work with rrs@.

MFC after:		1 week
2019-03-23 22:56:03 +00:00
Michael Tuexen
2ef5bd2f0c Limit the number of bytes which can be queued for SCTP sockets.
This is joint work with rrs@.
Reported by:		syzbot+307f167f9bc214f095bc@syzkaller.appspotmail.com
MFC after:		1 week
2019-03-23 22:46:29 +00:00
Michael Tuexen
0999766ddf Add sysctl variable net.inet.tcp.rexmit_initial for setting RTO.Initial
used by TCP.

Reviewed by:		rrs@, 0mp@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19355
2019-03-23 21:36:59 +00:00
Michael Tuexen
05fb056c06 Fix a KASSERT() in tcp_output().
When checking the length of the headers at this point, the IP level
options have not been added to the mbuf chain.
So don't take them into account.

Reported by:		syzbot+16025fff7ee5f7c5957b@syzkaller.appspotmail.com
Reported by:		syzbot+adb5836b8a9ff621b2aa@syzkaller.appspotmail.com
Reported by:		syzbot+d25a5352bcdf40acdbb8@syzkaller.appspotmail.com
Reviewed by:		rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-03-23 09:56:41 +00:00
Andrey V. Elsukov
5c04f73e07 Add NAT64 CLAT implementation as defined in RFC6877.
CLAT is customer-side translator that algorithmically translates 1:1
private IPv4 addresses to global IPv6 addresses, and vice versa.
It is implemented as part of ipfw_nat64 kernel module. When module
is loaded or compiled into the kernel, it registers "nat64clat" external
action. External action named instance can be created using `create`
command and then used in ipfw rules. The create command accepts two
IPv6 prefixes `plat_prefix` and `clat_prefix`. If plat_prefix is ommitted,
IPv6 NAT64 Well-Known prefix 64:ff9b::/96 will be used.

  # ipfw nat64clat CLAT create clat_prefix SRC_PFX plat_prefix DST_PFX
  # ipfw add nat64clat CLAT ip4 from IPv4_PFX to any out
  # ipfw add nat64clat CLAT ip6 from DST_PFX to SRC_PFX in

Obtained from:	Yandex LLC
Submitted by:	Boris N. Lytochkin
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
2019-03-18 11:44:53 +00:00
Gleb Smirnoff
dc0fa4f712 Remove 'dir' argument from dummynet_io(). This makes it possible to make
dn_dir flags private to dummynet. There is still some room for improvement.
2019-03-14 22:32:50 +00:00
Gleb Smirnoff
cef9f220cd Remove 'dir' argument in ng_ipfw_input, since ip_fw_args now has this info.
While here make 'tee' boolean.
2019-03-14 22:30:05 +00:00
Gleb Smirnoff
1830dae3d3 Make second argument of ip_divert(), that specifies packet direction a bool.
This allows pf(4) to avoid including ipfw(4) private files.
2019-03-14 22:23:09 +00:00
Bjoern A. Zeeb
b25d74e06c Improve ARP logging.
r344504 added an extra ARP_LOG() call in case of an if_output() failure.
It turns out IPv4 can be noisy. In order to not spam the console by default:
(a) add a counter for these events so people can keep better track of how
    often it happens, and
(b) add a sysctl to select the default ARP_LOG log level and set it to
    INFO avoiding the one (the new) DEBUG level by default.

Claim a spare (1st one after 10 years since the stats were added) in order
to not break netstat from FreeBSD 12->13 updates in the future.

Reviewed by:		karels
Differential Revision:	https://reviews.freebsd.org/D19490
2019-03-09 01:12:59 +00:00
Michael Tuexen
3a35ad54a8 Fix locking bug.
MFC after:		3 days
2019-03-08 18:17:57 +00:00
Michael Tuexen
a458a6e620 Some cleanup and consistency improvements.
MFC after:		3 days
2019-03-08 18:16:19 +00:00
Michael Tuexen
e6dcce69ca After removing an entry from the stream scheduler list, set the pointers
to NULL, since we are checking for it in case the element gets inserted
again.

This issue was found by running syzkaller.

MFC after:		3 days
2019-03-07 08:43:20 +00:00
Michael Tuexen
be62c88b80 Allocate an assocition id and register the stcb with holding the lock.
This avoids a race where stcbs can be found, which are not completely
initialized.

This was found by running syzkaller.

MFC after:		3 days
2019-03-03 19:55:06 +00:00
Michael Tuexen
5f98c80550 Remove debug output.
MFC after:		3 days
2019-03-02 16:10:11 +00:00
Michael Tuexen
bab9988af5 Allow SCTP stream reconfiguration operations only in ESTABLISHED
state.

This issue was found by running syzkaller.

MFC after:		3 days
2019-03-02 14:30:27 +00:00
Michael Tuexen
49f1449309 Handle the case when calling the IPPROTO_SCTP level socket option
SCTP_STATUS on an association with no primary path (early state).

This issue was found by running syzkaller.

MFC after:		3 days
2019-03-02 14:15:33 +00:00
Michael Tuexen
e57d481c5e Report the correct length when using the IPPROTO_SCTP level
socket options SCTP_GET_PEER_ADDRESSES and SCTP_GET_LOCAL_ADDRESSES.
2019-03-02 13:12:37 +00:00
Michael Tuexen
20ab225b61 Honor the memory limits provided when processing the IPPROTO_SCTP
level socket option SCTP_GET_LOCAL_ADDRESSES in a getsockopt() call.

Thanks to Thomas Barabosch for reporting the issue which was found by
running syzkaller.

MFC after:		3 days
2019-03-01 18:47:41 +00:00
Michael Tuexen
3aee58ca76 Improve consistency, not functional change.
MFC after:		3 days
2019-03-01 15:57:55 +00:00
John Baldwin
dbcc200058 Various cleanups to the management of multiple TCP stacks.
- Use strlcpy() with sizeof() instead of strncpy().

- Simplify initialization of TCP functions structures.

  init_tcp_functions() was already called before the first call to
  register a stack.  Just inline the work in the SYSINIT and remove
  the racy helper variable.  Instead, KASSERT that the rw lock is
  initialized when registering a stack.

- Protect the default stack via a direct pointer comparison.

  The default stack uses the name "freebsd" instead of "default" so
  this protection wasn't working for the default stack anyway.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19152
2019-02-27 20:24:23 +00:00
Bjoern A. Zeeb
a4c69b8bc0 Make arp code return (more) errors.
arprequest() is a void function and in case of error we simply
return without any feedback. In case of any local operation
or *if_output() failing no feedback is send up the stack for the
packet which triggered the arp request to be sent.
arpresolve_full() has three pre-canned possible errors returned
(if we have not yet sent enough arp requests or if we tried
often enough without success) otherwise "no error" is returned.

Make arprequest() an "internal" function arprequest_internal() which
does return a possible error to the caller. Preserve arprequest()
as a void wrapper function for external consumers.
In arpresolve_full() add an extra error checking. Use the
arprequest_internal() function and only return an error if non
of the three ones (mentioend above) are already set.

This will return possible errors all the way up the stack and
allows functions and programs to react on the send errors rather
than leaving them in the dark. Also they might get more detailed
feedback of why packets cannot be sent and they will receive it
quicker.

Reviewed by:		karels, hselasky
Differential Revision:	https://reviews.freebsd.org/D18904
2019-02-24 22:49:56 +00:00
Gleb Smirnoff
0dfc145abe Support struct ip_mreqn as argument for IP_ADD_MEMBERSHIP. Legacy support
for struct ip_mreq remains in place.

The struct ip_mreqn is Linux extension to classic BSD multicast API. It
has extra field allowing to specify the interface index explicitly. In
Linux it used as argument for IP_MULTICAST_IF and IP_ADD_MEMBERSHIP.
FreeBSD kernel also declares this structure and supports it as argument
to IP_MULTICAST_IF since r170613. So, we have structure declared but
not fully supported, this confused third party application configure
scripts.

Code handling IP_ADD_MEMBERSHIP was mixed together with code for
IP_ADD_SOURCE_MEMBERSHIP.  Bringing legacy and new structure support
into the mess would made the "argument switcharoo" intolerable, so
code was separated into its own switch case clause.

MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D19276
2019-02-23 06:03:18 +00:00
Michael Tuexen
560c058683 The receive buffer autoscaling for TCP is based on a linear growth, which
is acceptable in the congestion avoidance phase, but not during slow start.
The MTU is is also not taken into account.
Use a method instead, which is based on exponential growth working also in
slow start and being independent from the MTU.

This is joint work with rrs@.

Reviewed by:		rrs@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18375
2019-02-21 10:35:32 +00:00
Michael Tuexen
a1f0e13475 This patch addresses an issue brought up by bz@ in D18968:
When TCP_REASS_LOGGING is defined, a NULL pointer dereference would happen,
if user data was received during the TCP handshake and BB logging is used.

A KASSERT is also added to detect tcp_reass() calls with illegal parameter
combinations.

Reported by:		bz@
Reviewed by:		rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19254
2019-02-21 09:34:47 +00:00
Michael Tuexen
3b853844d7 Reduce the TCP initial retransmission timeout from 3 seconds to
1 second as allowed by RFC 6298.

Reviewed by:		kbowling@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18941
2019-02-20 18:03:43 +00:00
Michael Tuexen
c6dcb64b18 Use exponential backoff for retransmitting SYN segments as specified
in the TCP RFCs.

Reviewed by:		rrs@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18974
2019-02-20 17:56:38 +00:00
Michael Tuexen
e82fdca156 Fix a byte ordering issue for the advertised receiver window in ACK
segments sent in TIMEWAIT state, which I introduced in r336937.

MFC after:	3 days
Sponsored by:	Netflix, Inc.
2019-02-15 09:45:17 +00:00
Andrey V. Elsukov
c7ee62fcd5 In r335015 PCB destroing was made deferred using epoch_call().
But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables,
and thus it needs VNET context, that is missing during gtaskqueue
executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred().

PR:		235684
MFC after:	1 week
2019-02-13 15:46:05 +00:00
Kristof Provost
3838c6a3e6 garp: Fix vnet related panic for gratuitous arp
Gratuitous ARP packets are sent from a timer, which means we don't have a vnet
context set. As a result we panic trying to send the packet.

Set the vnet context based on the interface associated with the interface
address.

To reproduce:
  sysctl net.link.ether.inet.garp_rexmit_count=2
  ifconfig vtnet1 10.0.0.1/24 up

PR:		235699
Reviewed by:	vangyzen@
MFC after:	1 week
2019-02-12 21:22:57 +00:00
Michael Tuexen
aef0641755 Improve input validation for raw IPv4 socket using the IP_HDRINCL
option.

This issue was found by running syzkaller on OpenBSD.
Greg Steuck made me aware that the problem might also exist on FreeBSD.

Reported by:		Greg Steuck
MFC after:		1 month
Differential Revision:	https://reviews.freebsd.org/D18834
2019-02-12 10:17:21 +00:00
Michael Tuexen
d9707e43df Fix a locking issue when reporing outbount messages.
MFC after:		3 days
2019-02-10 14:02:14 +00:00
Michael Tuexen
507bb10421 Fix a locking issue in the IPPROTO_SCTP level SCTP_PEER_ADDR_THLDS socket
option. The problem affects only setsockopt with invalid parameters.

This issue was found by syzkaller.

MFC after:		3 days
2019-02-10 13:55:32 +00:00
Michael Tuexen
6cf360772f Fix a locking bug in the IPPROTO_SCTP level SCTP_EVENT socket option.
This occurs when call setsockopt() with invalid parameters.

This issue was found by syzkaller.

MFC after:		3 days
2019-02-10 10:42:16 +00:00
Michael Tuexen
333669e016 Fix locking for IPPROTO_SCTP level SCTP_DEFAULT_PRINFO socket option.
This problem occurred when calling setsockopt() will invalid parameters.

This issue was found by running syzkaller.

MFC after:		3 days
2019-02-10 08:28:56 +00:00
Michael Tuexen
aa36fbd6fa Ensure that when using the TCP CDG congestion control and setting the
sysctl variable net.inet.tcp.cc.cdg.smoothing_factor to 0, the smoothing
is disabled. Without this patch, a division by zero orrurs.

PR:			193762
Reviewed by:		lstewart@, rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19071
2019-02-08 20:42:49 +00:00
Michael Tuexen
baed5270e1 Only reduce the PMTU after the send call. The only way to increase it, is
via PMTUD.

This fixes an MTU issue reported by Timo Voelker.

MFC after:		3 days
2019-02-05 10:29:31 +00:00
Michael Tuexen
e4c42fa266 Fix an off-by-one error in the input validation of the SCTP_RESET_STREAMS
socketoption.

This was found by running syzkaller.

MFC after:		3 days
2019-02-05 10:13:51 +00:00
Warner Losh
52467047aa Regularize the Netflix copyright
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
   piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.

Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
2019-02-04 21:28:25 +00:00
Michael Tuexen
116ef4d6e7 When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd
consistently.

This inconsistency was observed when working on the bug reported in
PR 235256, although it does not fix the reported issue. The fix for
the PR will be a separate commit.

PR:			235256
Reviewed by:		rrs@, Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19033
2019-02-01 12:33:00 +00:00
Gleb Smirnoff
547392731f Repair siftr(4): PFIL_IN and PFIL_OUT are defines of some value, relying
on them having particular values can break things.
2019-02-01 08:10:26 +00:00
Gleb Smirnoff
b252313f0b New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented.  The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision:	https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
Brooks Davis
435a8c1560 Add a simple port filter to SIFTR.
SIFTR does not allow any kind of filtering, but captures every packet
processed by the TCP stack.
Often, only a specific session or service is of interest, and doing the
filtering in post-processing of the log adds to the overhead of SIFTR.

This adds a new sysctl net.inet.siftr.port_filter. When set to zero, all
packets get captured as previously. If set to any other value, only
packets where either the source or the destination ports match, are
captured in the log file.

Submitted by:	Richard Scheffenegger
Reviewed by:	Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18897
2019-01-30 17:44:30 +00:00
Michael Tuexen
bf7fcdb18a Fix the detection of ECN-setup SYN-ACK packets.
RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags
set and the CWR flags not set. The code was only checking if ECE flag
is set. This patch adds the check to verify that the CWR flags is not
set.

Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18996
2019-01-28 12:45:31 +00:00
Michael Tuexen
f635b1c264 Don't include two header files when not needed.
This allows the part of the rewrite of TCP reassembly in this
files to be MFCed to stable/11 with manual change.

MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-25 17:08:28 +00:00
Michael Tuexen
7dc90a1de0 Fix a bug in the restart window computation of TCP New Reno
When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.

Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18940
2019-01-25 13:57:09 +00:00
Michael Tuexen
989321df11 Get the arithmetic right...
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-24 16:47:18 +00:00
Michael Tuexen
42395cbe31 Kill a trailing whitespace character...
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-24 16:43:13 +00:00
Michael Tuexen
34bb795ba1 Update a comment to reflect the current reality.
SYN-cache entries live for abaut 12 seconds, not 45, when default
setting are used.

MFC after:		1 week
Sponsored by:		Netflix, Inc.
2019-01-24 16:40:14 +00:00
Mark Johnston
49cf58e559 Style.
Reviewed by:	bz
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-01-23 22:19:49 +00:00
Mark Johnston
c06cc56e39 Fix an LLE lookup race.
After the afdata read lock was converted to epoch(9), readers could
observe a linked LLE and block on the LLE while a thread was
unlinking the LLE.  The writer would then release the lock and schedule
the LLE for deferred free, allowing readers to continue and potentially
schedule the LLE timer.  By the point the timer fires, the structure is
freed, typically resulting in a crash in the callout subsystem.

Fix the problem by modifying the lookup path to check for the LLE_LINKED
flag upon acquiring the LLE lock.  If it's not set, the lookup fails.

PR:		234296
Reviewed by:	bz
Tested by:	sbruno, Victor <chernov_victor@list.ru>,
		Mike Andrews <mandrews@bit0.com>
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18906
2019-01-23 22:18:23 +00:00
Brooks Davis
c53d6b90ba Make SIFTR work again after r342125 (D18443).
Correct a logic error.

Only disable when already enabled or enable when disabled.

Submitted by:	Richard Scheffenegger
Reviewed by:	Cheng Cui
Obtained from:	Cheng Cui
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D18885
2019-01-18 21:46:38 +00:00
Michael Tuexen
d9ba240c1c Limit the user-controllable amount of memory the kernel allocates
via IPPROTO_SCTP level socket options.

This issue was found by running syzkaller.

MFC after:	1 week
2019-01-16 11:33:47 +00:00
Stephen Hurd
500759395a Fix window update issue when scaling disabled
When the TCP window scale option is not used, and the window
opens up enough in one soreceive, a window update will not be sent.

For example, if recwin == 65535, so->so_rcv.sb_hiwat >= 262144, and
so->so_rcv.sb_hiwat <= 524272, the window update will never be sent.
This is because recwin and adv are clamped to TCP_MAXWIN << tp->rcv_scale,
and so will never be >= so->so_rcv.sb_hiwat / 4
or <= so->so_rcv.sb_hiwat / 8.

This patch ensures a window update is sent if the window opens by
TCP_MAXWIN << tp->rcv_scale, which should only happen when the window
size goes from zero to the max expressible.

This issue looks like it was introduced in r306769 when recwin was clamped
to TCP_MAXWIN << tp->rcv_scale.

MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D18821
2019-01-15 17:40:19 +00:00
Michael Tuexen
10731c54b6 Fix getsockopt() for IP_OPTIONS/IP_RETOPTS.
r336616 copies inp->inp_options using the m_dup() function.
However, this function expects an mbuf packet header at the beginning,
which is not true in this case.
Therefore, use m_copym() instead of m_dup().

This issue was found by syzkaller.
Reviewed by:		mmacy@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18753
2019-01-09 06:36:57 +00:00
Gleb Smirnoff
a68cc38879 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
Mark Johnston
2f2ddd68a5 Support MSG_DONTWAIT in send*(2).
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead.  Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.

Submitted by:	Greg V <greg@unrelenting.technology>
Reviewed by:	tuexen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18728
2019-01-04 17:31:50 +00:00
Michael Tuexen
09423f72fd Fix a regression in the TCP handling of received segments.
When receiving TCP segments the stack protects itself by limiting
the resources allocated for a TCP connections. This patch adds
an exception to these limitations for the TCP segement which is the next
expected in-sequence segment. Without this patch, TCP connections
may stall and finally fail in some cases of packet loss.

Reported by:		jhb@
Reviewed by:		jtl@, rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18580
2018-12-20 16:05:30 +00:00
Hiren Panchasara
51e712f865 Revert r331567 CC Cubic: fix underflow for cubic_cwnd()
This change is causing TCP connections using cubic to hang. Need to dig more to
find exact cause and fix it.

Reported by:	tj at mrsk dot me, Matt Garber (via twitter)
Discussed with:	sbruno (previously), allanjude, cperciva
MFC after:	3 days
2018-12-15 17:01:16 +00:00
Brooks Davis
855acb84ca Fix bugs in plugable CC algorithm and siftr sysctls.
Use the sysctl_handle_int() handler to write out the old value and read
the new value into a temporary variable. Use the temporary variable
for any checks of values rather than using the CAST_PTR_INT() macro on
req->newptr. The prior usage read directly from userspace memory if the
sysctl() was called correctly. This is unsafe and doesn't work at all on
some architectures (at least i386.)

In some cases, the code could also be tricked into reading from kernel
memory and leaking limited information about the contents or crashing
the system. This was true for CDG, newreno, and siftr on all platforms
and true for i386 in all cases. The impact of this bug is largest in
VIMAGE jails which have been configured to allow writing to these
sysctls.

Per discussion with the security officer, we will not be issuing an
advisory for this issue as root access and a non-default config are
required to be impacted.

Reviewed by:	markj, bz
Discussed with:	gordon (security officer)
MFC after:	3 days
Security:	kernel information leak, local DoS (both require root)
Differential Revision:	https://reviews.freebsd.org/D18443
2018-12-15 15:06:22 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mark Johnston
9d2877fc3d Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains.
Memory beyond that limit was previously unused, wasting roughly 1MB per
8GB of RAM.  Also retire INP_PCBLBGROUP_PORTHASH, which was identical to
INP_PCBPORTHASH.

Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D17803
2018-12-05 17:06:00 +00:00
Andrey V. Elsukov
d66f9c86fa Add ability to request listing and deleting only for dynamic states.
This can be useful, when net.inet.ip.fw.dyn_keep_states is enabled, but
after rules reloading some state must be deleted. Added new flag '-D'
for such purpose.

Retire '-e' flag, since there can not be expired states in the meaning
that this flag historically had.

Also add "verbose" mode for listing of dynamic states, it can be enabled
with '-v' flag and adds additional information to states list. This can
be useful for debugging.

Obtained from:	Yandex LLC
MFC after:	2 months
Sponsored by:	Yandex LLC
2018-12-04 16:12:43 +00:00
Michael Tuexen
c8b53ced95 Limit option_len for the TCP_CCALGOOPT.
Limiting the length to 2048 bytes seems to be acceptable, since
the values used right now are using 8 bytes.

Reviewed by:		glebius, bz, rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18366
2018-11-30 10:50:07 +00:00
Mark Johnston
79db6fe7aa Plug some networking sysctl leaks.
Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland.  Fix a number of such handlers.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	tuexen
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18301
2018-11-22 20:49:41 +00:00
Michael Tuexen
ad2be38941 A TCP stack is required to check SEG.ACK first, when processing a
segment in the SYN-SENT state as stated in Section 3.9 of RFC 793,
page 66. Ensure this is also done by the TCP RACK stack.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18034
2018-11-22 20:05:57 +00:00
Michael Tuexen
fef56019e9 Ensure that the TCP RACK stack honours the setting of the
net.inet.tcp.drop_synfin sysctl-variable.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18033
2018-11-22 20:02:39 +00:00
Michael Tuexen
7e729f0787 Ensure that the default RTT stack can make an RTT measurement if
the TCP connection was initiated using the RACK stack, but the
peer does not support the TCP RACK extension.

This ensures that the TCP behaviour on the wire is the same if
the TCP connection is initated using the RACK stack or the default
stack.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18032
2018-11-22 19:56:52 +00:00
Michael Tuexen
794107181a Ensure that TCP RST-segments announce consistently a receiver window of
zero. This was already done when sending them via tcp_respond().

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D17949
2018-11-22 19:49:52 +00:00
Michael Tuexen
3bea9a2664 Improve two KASSERTs in the TCP RACK stack.
There are two locations where an always true comparison was made in
a KASSERT. Replace this by an appropriate check and use a consistent
panic message. Also use this code when checking a similar condition.

PR:			229664
Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18021
2018-11-21 18:19:15 +00:00
Andrey V. Elsukov
5786c6b9f9 Make multiline APPLY_MASK() macro to be function-like.
Reported by:	cem
MFC after:	1 week
2018-11-20 18:38:28 +00:00
Bjoern A. Zeeb
945aad9c62 Improve the comment for arpresolve_full() in if_ether.c.
No functional changes.

MFC after:	6 weeks
2018-11-17 16:13:09 +00:00
Bjoern A. Zeeb
90d99b6587 Retire arpresolve_addr(), which is not used anywhere, from if_ether.c. 2018-11-17 16:08:36 +00:00
Jonathan T. Looney
2157f3c36a Add some additional length checks to the IPv4 fragmentation code.
Specifically, block 0-length fragments, even when the MF bit is clear.
Also, ensure that every fragment with the MF bit clear ends at the same
offset and that no subsequently-received fragments exceed that offset.

Reviewed by:	glebius, markj
MFC after:	3 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D17922
2018-11-16 18:32:48 +00:00
Mark Johnston
86af1d0241 Ensure that IP fragments do not extend beyond IP_MAXPACKET.
Such fragments are obviously invalid, and when processed may end up
violating the sort order (by offset) of fragments of a given packet.
This doesn't appear to be exploitable, however.

Reviewed by:	emaste
Discussed with:	jtl
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17914
2018-11-10 03:00:36 +00:00
Ed Maste
2bfaf585ca Avoid buffer underwrite in icmp_error
icmp_error allocates either an mbuf (with pkthdr) or a cluster depending
on the size of data to be quoted in the ICMP reply, but the calculation
failed to account for the additional padding that m_align may apply.

Include the ip header in the size passed to m_align.  On 64-bit archs
this will have the net effect of moving everything 4 bytes later in the
mbuf or cluster.  This will result in slightly pessimal alignment for
the ICMP data copy.

Also add an assertion that we do not move m_data before the beginning of
the mbuf or cluster.

Reported by:	A reddit user
Reviewed by:	bz, jtl
MFC after:	3 days
Security:	CVE-2018-17156
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17909
2018-11-08 20:17:36 +00:00
Michael Tuexen
8553b984a5 Don't use a function when neither INET nor INET6 are defined.
This is a valid case for the userland stack, where this fixes
two set-but-not-used warnings in this case.

Thanks to Christian Wright for reporting the issue.
2018-11-06 12:55:03 +00:00
Jonathan T. Looney
54e675342b m_pulldown() may reallocate n. Update the oip pointer after the
m_pulldown() call.

MFC after:	2 weeks
Sponsored by:	Netflix
2018-11-02 19:14:15 +00:00
Bjoern A. Zeeb
e2c532f156 carpstats are the last virtualised variable in the file and end up at the
end of the vnet_set.  The generated code uses an absolute relocation at
one byte beyond the end of the carpstats array.  This means the relocation
for the vnet does not happen for carpstats initialisation and as a result
the kernel panics on module load.

This problem has only been observed with carp and only on i386.
We considered various possible solutions including using linker scripts
to add padding to all kernel modules for pcpu and vnet sections.

While the symbols (by chance) stay in the order of appearance in the file
adding an unused non-file-local variable at the end of the file will extend
the size of set_vnet and hence make the absolute relocation for carpstats
work (think of this as a single-module set_vnet padding).

This is a (tmporary) hack.  It is the least intrusive one as we need a
timely solution for the upcoming release.  We will revisit the problem in
HEAD.  For a lot more information and the possible alternate solutions
please see the PR and the references therein.

PR:			230857
MFC after:		3 days
2018-11-01 17:26:18 +00:00
Mark Johnston
d9ff5789be Remove redundant checks for a NULL lbgroup table.
No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17108
2018-11-01 15:52:49 +00:00
Mark Johnston
79ee680b65 Improve style in in_pcbinslbgrouphash() and related subroutines.
No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17107
2018-11-01 15:51:49 +00:00
Michael Tuexen
6999f6975c Remove debug code which slipped in accidently.
MFC after:		4 weeks
X-MFC with:		r339989
Sponsored by:		Netflix, Inc.
2018-11-01 11:41:40 +00:00
Michael Tuexen
099ab39f44 Improve a comment to refer to the actual sections in the TCP
specification for the comparisons made.
Thanks to lstewart@ for the suggestion.

MFC after:		4 weeks
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D17595
2018-11-01 11:35:28 +00:00
Bjoern A. Zeeb
201100c58b Initial implementation of draft-ietf-6man-ipv6only-flag.
This change defines the RA "6" (IPv6-Only) flag which routers
may advertise, kernel logic to check if all routers on a link
have the flag set and accordingly update a per-interface flag.

If all routers agree that it is an IPv6-only link, ether_output_frame(),
based on the interface flag, will filter out all ETHERTYPE_IP/ARP
frames, drop them, and return EAFNOSUPPORT to upper layers.

The change also updates ndp to show the "6" flag, ifconfig to
display the IPV6_ONLY nd6 flag if set, and rtadvd to allow
announcing the flag.

Further changes to tcpdump (contrib code) are availble and will
be upstreamed.

Tested the code (slightly earlier version) with 2 FreeBSD
IPv6 routers, a FreeBSD laptop on ethernet as well as wifi,
and with Win10 and OSX clients (which did not fall over with
the "6" flag set but not understood).

We may also want to (a) implement and RX filter, and (b) over
time enahnce user space to, say, stop dhclient from running
when the interface flag is set.  Also we might want to start
IPv6 before IPv4 in the future.

All the code is hidden under the EXPERIMENTAL option and not
compiled by default as the draft is a work-in-progress and
we cannot rely on the fact that IANA will assign the bits
as requested by the draft and hence they may change.

Dear 6man, you have running code.

Discussed with:	Bob Hinden, Brian E Carpenter
2018-10-30 20:08:48 +00:00
Mark Johnston
da7d7778b0 Expose some netdump configuration parameters through sysctl.
Reviewed by:	cem
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17755
2018-10-29 21:16:26 +00:00
Eugene Grosbein
1a5995cc88 Prevent ip_input() from panicing due to unprotected access to INADDR_HASH.
PR:			220078
MFC after:		1 month
Differential Revision:	https://reviews.freebsd.org/D12457
Tested-by:		Cassiano Peixoto and others
2018-10-27 04:59:35 +00:00
Eugene Grosbein
4f1e3122ac Prevent multicast code from panicing due to unprotected access to INADDR_HASH.
PR:			220078
MFC after:		1 month
Differential Revision:	https://reviews.freebsd.org/D12457
Tested-by:		Cassiano Peixoto and others
2018-10-27 04:53:25 +00:00
Michael Tuexen
de00ad05e6 Add initial descriptions for SCTP related MIB variable.
This work was mostly done by Marie-Helene Kvello-Aune.

MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D3583
2018-10-26 21:04:17 +00:00
Andrey V. Elsukov
8796e291f8 Add the check that current VNET is ready and access to srchash is allowed.
This change is similar to r339646. The callback that checks for appearing
and disappearing of tunnel ingress address can be called during VNET
teardown. To prevent access to already freed memory, add check to the
callback and epoch_wait() call to be sure that callback has finished its
work.

MFC after:	20 days
2018-10-23 13:11:45 +00:00
John Baldwin
74e10fb613 A couple of style fixes in recent TCP changes.
- Add a blank line before a block comment to match other block comments
  in the same function.
- Sort the prototype for sbsndptr_adv and fix whitespace between return
  type and function name.

Reviewed by:	gallatin, bz
Differential Revision:	https://reviews.freebsd.org/D17474
2018-10-22 21:17:36 +00:00
Eugene Grosbein
410634efd1 New sysctl: net.inet.icmp.error_keeptags
Currently, icmp_error() function copies FIB number from original packet
into generated ICMP response but not mbuf_tags(9) chain.
This prevents us from easily matching ICMP responses corresponding
to tagged original packets by means of packet filter such as ipfw(8).
For example, ICMP "time-exceeded in-transit" packets usually generated
in response to traceroute probes lose tags attached to original packets.

This change adds new sysctl net.inet.icmp.error_keeptags
that defaults to 0 to avoid extra overhead when this feature not needed.

Set net.inet.icmp.error_keeptags=1 to make icmp_error() copy mbuf_tags
from original packet to generated ICMP response.

PR:		215874
MFC after:	1 month
2018-10-21 21:29:19 +00:00
Andrey V. Elsukov
f252e3f2f2 Include <sys/eventhandler.h> to fix the build.
MFC after:	1 month
2018-10-21 18:39:34 +00:00
Andrey V. Elsukov
19873f4780 Add handling for appearing/disappearing of ingress addresses to if_gre(4).
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
  and set it otherwise;

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17214
2018-10-21 18:13:45 +00:00
Andrey V. Elsukov
009d82ee0f Add handling for appearing/disappearing of ingress addresses to if_gif(4).
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
  and set it otherwise;
* remove the note about ingress address from BUGS section.

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17134
2018-10-21 18:06:15 +00:00
Andrey V. Elsukov
8251c68d5c Add KPI that can be used by tunneling interfaces to handle IP addresses
appearing and disappearing on the host system.

Such handling is need, because tunneling interfaces must use addresses,
that are configured on the host as ingress addresses for tunnels.
Otherwise the system can send spoofed packets with source address, that
belongs to foreign host.

The KPI uses ifaddr_event_ext event to implement addresses tracking.
Tunneling interfaces register event handlers and then they are
notified by the kernel, when an address disappears or appears.

ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event()
in the ip_encap.c

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17134
2018-10-21 17:55:26 +00:00
Andrey V. Elsukov
094d6f8d75 Add IPFW_RULE_JUSTOPTS flag, that is used by ipfw(8) to mark rule,
that was added using "new rule format". And then, when the kernel
returns rule with this flag, ipfw(8) can correctly show it.

Reported by:	lev
MFC after:	3 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17373
2018-10-21 15:10:59 +00:00
Andrey V. Elsukov
64d63b1e03 Add ifaddr_event_ext event. It is similar to ifaddr_event, but the
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.

MFC after:	3 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17100
2018-10-21 15:02:06 +00:00
Michael Tuexen
93899d10b4 The handling of RST segments in the SYN-RCVD state exists in the
code paths. Both are not consistent and the one on the syn cache code
does not conform to the relevant specifications (Page 69 of RFC 793
and Section 4.2 of RFC 5961).

This patch fixes this:
* The sequence numbers checks are fixed as specified on
  page Page 69 RFC 793.
* The sysctl variable net.inet.tcp.insecure_rst is now honoured
  and the behaviour as specified in Section 4.2 of RFC 5961.

Approved by:		re (gjb@)
Reviewed by:		bz@, glebius@, rrs@,
Differential Revision:	https://reviews.freebsd.org/D17595
Sponsored by:		Netflix, Inc.
2018-10-18 19:21:18 +00:00
Jonathan T. Looney
ac75e35d85 In r338102, the TCP reassembly code was substantially restructured. Prior
to this change, the code sometimes used a temporary stack variable to hold
details of a TCP segment. r338102 stopped using the variable to hold
segments, but did not actually remove the variable.

Because the variable is no longer used, we can safely remove it.

Approved by:	re (gjb)
2018-10-16 14:41:09 +00:00
Bjoern A. Zeeb
4ba16a92c7 In udp_input() when walking the pcblist we can come across
an inp marked FREED after the epoch(9) changes.
Check once we hold the lock and skip the inp if it is the case.

Contrary to IPv6 the locking of the inp is outside the multicast
section and hence a single check seems to suffice.

PR:		232192
Reviewed by:	mmacy, markj
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17540
2018-10-12 22:51:45 +00:00
Bjoern A. Zeeb
3afdfcaf33 r217592 moved the check for imo in udp_input() into the conditional block
but leaving the variable assignment outside the block, where it is no longer
used. Move both the variable and the assignment one block further in.

This should result in no functional changes. It will however make upcoming
changes slightly easier to apply.

Reviewed by:		markj, jtl, tuexen
Approved by:		re (kib)
Differential Revision:	https://reviews.freebsd.org/D17525
2018-10-12 11:30:46 +00:00
Jonathan T. Looney
13c6ba6d94 There are three places where we return from a function which entered an
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.

Fix the epoch leak by making sure we exit epoch sections before returning.

Reviewed by:	ae, gallatin, mmacy
Approved by:	re (gjb, kib)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D17450
2018-10-09 13:26:06 +00:00
Michael Tuexen
3535cdc43e Avoid truncating unrecognised parameters when reporting them.
This resulted in sending malformed packets.

Approved by:		re (kib@)
MFC after:		1 week
2018-10-07 15:13:47 +00:00
Michael Tuexen
3924dfa721 Ensure that the ips_localout counter is incremented for
locally generated SCTP packets sent over IPv4. This make
the behaviour consistent with IPv6.

Reviewed by:		ae@, bz@, jtl@
Approved by:		re (kib@)
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D17406
2018-10-07 11:26:15 +00:00
Tom Jones
b6e870116f Convert UDP length to host byte order
When getting the number of bytes to checksum make sure to convert the UDP
length to host byte order when the entire header is not in the first mbuf.

Reviewed by: jtl, tuexen, ae
Approved by: re (gjb), jtl (mentor)
Differential Revision:  https://reviews.freebsd.org/D17357
2018-10-05 12:51:30 +00:00
Ryan Stone
083a010c62 Hold a write lock across udp_notify()
With the new route cache feature udp_notify() will modify the inp when it
needs to invalidate the route cache.  Ensure that we hold a write lock on
the inp before calling the function to ensure that multiple threads don't
race while trying to invalidate the cache (which previously lead to a page
fault).

Differential Revision: https://reviews.freebsd.org/D17246
Reviewed by: sbruno, bz, karels
Sponsored by: Dell EMC Isilon
Approved by:	re (gjb)
2018-10-04 22:03:58 +00:00
Michael Tuexen
15a087e551 Mitigate providing a timing signal if the COOKIE or AUTH
validation fails.
Thanks to jmg@ for reporting the issue, which was discussed in
https://admbugs.freebsd.org/show_bug.cgi?id=878

Approved by:            re (TBD@)
MFC after:              1 week
2018-10-01 14:05:31 +00:00
Michael Tuexen
9d2e3f14c4 After allocating chunks set the fields in a consistent way.
This removes two assignments for the flags field being done
twice and adds one, which was missing.
Thanks to Felix Weinrank for reporting the issue he found
by using fuzz testing of the userland stack.

Approved by:            re (kib@)
MFC after:              1 week
2018-10-01 13:09:18 +00:00
Andrey V. Elsukov
384a5c3c28 Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.

PR:		231428
Reviewed by:	mmacy, tuexen
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17335
2018-10-01 10:46:00 +00:00
Michael Tuexen
1b084a5e5e Plug mbuf leak in the SCTP input path in an error case.
Approved by:            re (kib@)
MFC after:              1 week
CID:			749312
2018-09-30 21:54:02 +00:00
Michael Tuexen
66bcf0b333 Plug mbuf leaks in the SCTP output path in error cases.
Approved by:            re (kib@)
MFC after:              1 week
CID:			1395307
2018-09-30 21:31:33 +00:00
Michael Tuexen
8184648425 Fix the handling of ancillary data for SCTP socket. Implement
sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs()
similar to sctp_find_cmsg() to improve consistency and avoid
the signed/unsigned issues in sctp_process_cmsgs_for_init()
and sctp_findassociation_cmsgs().

Thanks to andrew@ for reporting the problem he found using
syzcaller.

Approved by:            re (kib@)
MFC after:              1 week
2018-09-30 16:21:31 +00:00
Michael Tuexen
ae0a9a8850 Increment the corresponding UDP stats counter (udps_opackets) when
sending UDP encapsulated SCTP packets.
This is consistent with the behaviour that when such packets are received,
the corresponding UDP stats counter (udps_ipackets) is incremented.
Thanks to Peter Lei for making me aware of this inconsistency.

Approved by:            re (kib@)
MFC after:              1 week
2018-09-30 12:16:06 +00:00
Michael Tuexen
3552f16d82 Fix typo in comment.
Reported by:		@danfe
Approved by:		re (kib@)
MFC after:		1 week
X-MFC:			r338941
2018-09-28 19:47:32 +00:00
Michael Tuexen
0277ec9c43 Whitespace changes and fixing a typo. No functional change.
Approved by:	re (kib@)
MFC after:	1 week
2018-09-26 10:24:50 +00:00
Michael Tuexen
078a49a077 Remove the unused parameter 'locked' from the function
syncache_respond(). There is no functional change. The
parameter became unused in r313330, but wasn't removed.

Approved by:		re (kib@)
MFC after:		1 month
Sponsored by:		Netflix, Inc.
2018-09-23 16:37:32 +00:00
Andrey V. Elsukov
76b09d1823 Add new field max_hdrsize to struct encap_config.
It is currently unused and reserved for future use to keep KBI/KPI.
Also add several spare pointers to be able extend structure if it
will be needed.

Approved by:	re (gjb)
2018-09-20 19:45:27 +00:00
Michael Tuexen
ba4704a278 Remove unused code.
Approved by:	re (kib@)
MFC after:	1 week
2018-09-18 10:53:07 +00:00
Michael Tuexen
a8a8a8a808 Fix TCP Fast Open for the TCP RACK stack.
* Fix a bug where the SYN handling during established state was
  applied to a front state.
* Move a check for retransmission after the timer handling.
  This was suppressing timer based retransmissions.
* Fix an off-by one byte in the sequence number of retransmissions.
* Apply fixes corresponding to
  https://svnweb.freebsd.org/changeset/base/336934

Reviewed by:		rrs@
Approved by:		re (kib@)
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16912
2018-09-12 10:27:58 +00:00
Mark Johnston
54af3d0dac Fix synchronization of LB group access.
Lookups are protected by an epoch section, so the LB group linkage must
be a CK_LIST rather than a plain LIST.  Furthermore, we were not
deferring LB group frees, so in_pcbremlbgrouphash() could race with
readers and cause a use-after-free.

Reviewed by:	sbruno, Johannes Lundberg <johalun0@gmail.com>
Tested by:	gallatin
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17031
2018-09-10 19:00:29 +00:00
Mark Johnston
a7026c7fd9 Use ratecheck(9) in in_pcbinslbgrouphash().
Reviewed by:	bz, Johannes Lundberg <johalun0@gmail.com>
Approved by:	re (kib)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17065
2018-09-07 21:11:41 +00:00
Bjoern A. Zeeb
113c4fad55 The inp_lle field to struct inpcb, along with two "valid" flags
for the rt and lle cache were added in r191129 (2009).
To my best knowledge they have never been used and route caching
has converted the inp_rt field from that commit to inp_route
rendering this field and these flags obsolete.

Convert the pointer into a spare pointer to not change the size of
the structure anymore (and to have a spare pointer) and mark the
two fields as unused.

Reviewed by:	markj, karels
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17062
2018-09-06 19:55:40 +00:00
Bjoern A. Zeeb
6d2b0c0166 Make tcp_hpts.c compile a LINT kernel with options RSS and PCBGROUPS added by
adding the missing include files and changing a the type of cpuid which
would otherwise cause a false comparison with NETISR_CPUID_NONE.

Reviewed by:	rrs
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16891
2018-09-06 16:11:24 +00:00
Mark Johnston
49365eb433 Define sctp probes only when SCTP is configured.
Otherwise the "depends_on provider" guard in sctp.d does not work as
intended.

Reported by:	mjg
Reviewed by:	tuexen
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17057
2018-09-06 14:15:03 +00:00
Mark Johnston
8be02ee4da Fix style bugs in in_pcblookup_lbgroup().
No functional change intended.

Reviewed by:	bz, Johannes Lundberg <johalun0@gmail.com>
Approved by:	re (rgrimes)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17030
2018-09-05 15:04:11 +00:00
Eugene Grosbein
d5d21ad932 Fix "ipfw fwd" to work for incoming IPv4 packets when ip_tryforward() chooses
fast forwarding path, as it already works for IPv6 and for both of them
on old slow path.

PR:			231143
Reviewed by:		ae
Approved by:		re (gjb)
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D17039
2018-09-05 13:59:36 +00:00
Mark Johnston
73ad0b6abf Use the correct malloc type in in_pcblbgroup_free().
Approved by:	re (kib)
Sponsored by:	The FreeBSD Foundation
2018-09-03 17:39:09 +00:00
Michael Tuexen
c6c0be2765 Fix a shadowed variable warning.
Thanks to Peter Lei for reporting the issue.

Approved by:		re(kib@)
MFH:			1 month
Sponsored by:		Netflix, Inc.
2018-08-24 10:50:19 +00:00
Michael Tuexen
90ab3571d8 Use arc4rand() instead of read_random() in the SCTP and TCP code.
This was suggested by jmg@.

Reviewed by:		delphij@, jmg@, jtl@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16860
2018-08-23 19:10:45 +00:00
Michael Tuexen
4ba1513d1a Don't use the explicit number 32 for the length of the secrets,
use sizeof() or explicit #definesi instead. No functional change.
This was suggested by jmg@.

MFC after:		1 month
XMFC with:		r338053
Sponsored by:		Netflix, Inc.
2018-08-23 06:03:59 +00:00
Michael Tuexen
1e88cc8b59 Add support for send, receive and state-change DTrace providers for
SCTP. They are based on what is specified in the Solaris DTrace manual
for Solaris 11.4.

Reviewed by:		0mp, dteske, markj
Relnotes:		yes
Differential Revision:	https://reviews.freebsd.org/D16839
2018-08-22 21:23:32 +00:00
Matt Macy
d3878608d7 in_mcast: fix copy paste error when clearing flag 2018-08-22 04:09:55 +00:00
Michael Tuexen
5dff1c3845 Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR:			173444
Reviewed by:		bz@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16796
2018-08-21 14:12:30 +00:00
Michael Tuexen
7d4dcc36a8 Fix the inheritance of IPv6 level socket options on TCP sockets.
This was broken for IPv6 listening socket, which are not IPV6_ONLY,
and the accepted TCP connection was using IPv4.

Reviewed by:		bz@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16792
2018-08-21 14:07:36 +00:00
Michael Tuexen
6ef849e601 Whitespace change. 2018-08-21 13:37:06 +00:00
Michael Tuexen
1a0b021677 Refactor the SHUTDOWN_PENDING state handling.
This is not a functional change but a preperation for the upcoming
DTrace support. It is necessary to change the state in one
logical operation, even if it involves clearing the sub state
SHUTDOWN_PENDING.

MFC after:		1 month
2018-08-21 13:25:32 +00:00
Bjoern A. Zeeb
10b070c166 GC inc_isipv6; it was added for "temp" compatibility in 2001, r86764
and does not seem to be used.
2018-08-20 20:06:36 +00:00
Randall Stewart
c28440db29 This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D16626
2018-08-20 12:43:18 +00:00
Michael Tuexen
8e02b4e00c Don't expose the uptime via the TCP timestamps.
The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by:		rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16636
2018-08-19 14:56:10 +00:00
Navdeep Parhar
32d2623ae2 Add the ability to look up the 3b PCP of a VLAN interface. Use it in
toe_l2_resolve to fill up the complete vtag and not just the vid.

Reviewed by:	kib@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D16752
2018-08-16 23:46:38 +00:00
Matt Macy
f9be038601 Fix in6_multi double free
This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
  - add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
  goes to zero OR when the interface goes away
  - decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
  - add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
  door wider still to races
  - call inp_gcmoptions synchronously after dropping the the inpcb lock

Fundamentally multicast needs a rewrite - but keep applying band-aids for now.

Tested by: kp
Reported by: novel, kp, lwhsu
2018-08-15 20:23:08 +00:00
Luiz Otavio O Souza
59b2022f94 Late style follow up on r312770.
Submitted by:	glebius
X-MFC with:	r312770
MFC after:	3 days
2018-08-15 15:44:30 +00:00
Jonathan T. Looney
a967df1c8f Lower the default limits on the IPv4 reassembly queue.
In particular, try to ensure that no bucket will have a reassembly
queue larger than approximately 100 items. This limits the cost to
find the correct reassembly queue when processing an incoming
fragment.

Due to the low limits on each bucket's length, increase the size of
the hash table from 64 to 1024.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:30:46 +00:00
Jonathan T. Looney
ff790bbad0 Implement a limit on on the number of IPv4 reassembly queues per bucket.
There is a hashing algorithm which should distribute IPv4 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv4 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.

Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet.ip.maxfragbucketsize).

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:23:05 +00:00
Jonathan T. Looney
7b9c5eb0a5 Add a global limit on the number of IPv4 fragments.
The IP reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
of enough VNETs), it is possible that the sum of all the VNET limits
will exceed the number of mbuf clusters available in the system.

Given the fact that the fragment limit is intended (at least in part) to
regulate access to a global resource, the fragment limit should
be applied on a global basis.

VNET-specific limits can be adjusted by modifying the
net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket
sysctls.

To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0.
To disable fragment reassembly for a particular VNET, set
net.inet.ip.maxfragpackets to 0.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:19:49 +00:00
Jonathan T. Looney
5d9bd45518 Improve hashing of IPv4 fragments.
Currently, IPv4 fragments are hashed into buckets based on a 32-bit
key which is calculated by (src_ip ^ ip_id) and combined with a random
seed. However, because an attacker can control the values of src_ip
and ip_id, it is possible to construct an attack which causes very
deep chains to form in a given bucket.

To ensure more uniform distribution (and lower predictability for
an attacker), calculate the hash based on a key which includes all
the fields we use to identify a reassembly queue (dst_ip, src_ip,
ip_id, and the ip protocol) as well as a random seed.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:15:47 +00:00
Michael Tuexen
0f1346f7f4 Remove a set but not used warning showing up in usrsctp. 2018-08-14 08:32:33 +00:00
Andrey V. Elsukov
62484790e0 Restore ability to send ICMP and ICMPv6 redirects.
It was lost when tryforward appeared. Now ip[6]_tryforward will be enabled
only when sending redirects for corresponding IP version is disabled via
sysctl. Otherwise will be used default forwarding function.

PR:		221137
Submitted by:	mckay@
MFC after:	2 weeks
2018-08-14 07:54:14 +00:00
Michael Tuexen
839d21d62e Use the stacb instead of the asoc in state macros.
This is not a functional change. Just a preparation for upcoming
dtrace state change provider support.
2018-08-13 13:58:45 +00:00
Michael Tuexen
61a2188021 Use consistently the macors to modify the assoc state.
No functional change.
2018-08-13 11:56:21 +00:00
Michael Tuexen
812649d86f Add explicit cast to silence a warning for the userland stack.
Thanks to Felix Weinrank for providing the patch.
2018-08-12 14:05:15 +00:00
Devin Teske
ab9ed8a1bd Fix misspellings of transmitter/transmitted
Reviewed by:	emaste, bcr
Sponsored by:	Smule, Inc.
Differential Revision:	https://reviews.freebsd.org/D16025
2018-08-10 20:37:32 +00:00
Andrey V. Elsukov
16bbf600d9 Remove unneeded ipsec-related includes.
Reviewed by:	rrs
Differential Revision:	https://reviews.freebsd.org/D16637
2018-08-10 07:24:01 +00:00
Leandro Lupori
c8e2123b6a [ppc] Fix kernel panic when using BOOTP_NFSROOT
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.

This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.

2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.

PR:		Bug 230168
Reported by:	sbruno
Reviewed by:	jhibbits, mmacy, sbruno
Approved by:	jhibbits
Differential Revision:	D16633
2018-08-09 14:04:51 +00:00
Randall Stewart
d18ea344e6 Fix a small bug in rack where it will
end up sending the FIN twice.
Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D16604
2018-08-08 13:36:49 +00:00
Jonathan T. Looney
95a914f631 Address concerns about CPU usage while doing TCP reassembly.
Currently, the per-queue limit is a function of the receive buffer
size and the MSS.  In certain cases (such as connections with large
receive buffers), the per-queue segment limit can be quite large.
Because we process segments as a linked list, large queues may not
perform acceptably.

The better long-term solution is to make the queue more efficient.
But, in the short-term, we can provide a way for a system
administrator to set the maximum queue size.

We set the default queue limit to 100.  This is an effort to balance
performance with a sane resource limit.  Depending on their
environment, goals, etc., an administrator may choose to modify this
limit in either direction.

Reviewed by:	jhb
Approved by:	so
Security:	FreeBSD-SA-18:08.tcp
Security:	CVE-2018-6922
2018-08-06 17:36:57 +00:00
Randall Stewart
936b2b64ae This fixes a bug in Rack where we were
not properly using the correct value for
Delayed Ack.

Sponsored by:	Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16579
2018-08-06 09:22:07 +00:00