6109 Commits

Author SHA1 Message Date
tuexen
5ec78c1aa0 Improve consistency, not functional change.
MFC after:		3 days
2019-03-01 15:57:55 +00:00
jhb
1498304f92 Various cleanups to the management of multiple TCP stacks.
- Use strlcpy() with sizeof() instead of strncpy().

- Simplify initialization of TCP functions structures.

  init_tcp_functions() was already called before the first call to
  register a stack.  Just inline the work in the SYSINIT and remove
  the racy helper variable.  Instead, KASSERT that the rw lock is
  initialized when registering a stack.

- Protect the default stack via a direct pointer comparison.

  The default stack uses the name "freebsd" instead of "default" so
  this protection wasn't working for the default stack anyway.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19152
2019-02-27 20:24:23 +00:00
bz
0d49e52bdf Make arp code return (more) errors.
arprequest() is a void function and in case of error we simply
return without any feedback. In case of any local operation
or *if_output() failing no feedback is send up the stack for the
packet which triggered the arp request to be sent.
arpresolve_full() has three pre-canned possible errors returned
(if we have not yet sent enough arp requests or if we tried
often enough without success) otherwise "no error" is returned.

Make arprequest() an "internal" function arprequest_internal() which
does return a possible error to the caller. Preserve arprequest()
as a void wrapper function for external consumers.
In arpresolve_full() add an extra error checking. Use the
arprequest_internal() function and only return an error if non
of the three ones (mentioend above) are already set.

This will return possible errors all the way up the stack and
allows functions and programs to react on the send errors rather
than leaving them in the dark. Also they might get more detailed
feedback of why packets cannot be sent and they will receive it
quicker.

Reviewed by:		karels, hselasky
Differential Revision:	https://reviews.freebsd.org/D18904
2019-02-24 22:49:56 +00:00
glebius
13a1011c10 Support struct ip_mreqn as argument for IP_ADD_MEMBERSHIP. Legacy support
for struct ip_mreq remains in place.

The struct ip_mreqn is Linux extension to classic BSD multicast API. It
has extra field allowing to specify the interface index explicitly. In
Linux it used as argument for IP_MULTICAST_IF and IP_ADD_MEMBERSHIP.
FreeBSD kernel also declares this structure and supports it as argument
to IP_MULTICAST_IF since r170613. So, we have structure declared but
not fully supported, this confused third party application configure
scripts.

Code handling IP_ADD_MEMBERSHIP was mixed together with code for
IP_ADD_SOURCE_MEMBERSHIP.  Bringing legacy and new structure support
into the mess would made the "argument switcharoo" intolerable, so
code was separated into its own switch case clause.

MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D19276
2019-02-23 06:03:18 +00:00
tuexen
e763f31429 The receive buffer autoscaling for TCP is based on a linear growth, which
is acceptable in the congestion avoidance phase, but not during slow start.
The MTU is is also not taken into account.
Use a method instead, which is based on exponential growth working also in
slow start and being independent from the MTU.

This is joint work with rrs@.

Reviewed by:		rrs@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18375
2019-02-21 10:35:32 +00:00
tuexen
192c95e996 This patch addresses an issue brought up by bz@ in D18968:
When TCP_REASS_LOGGING is defined, a NULL pointer dereference would happen,
if user data was received during the TCP handshake and BB logging is used.

A KASSERT is also added to detect tcp_reass() calls with illegal parameter
combinations.

Reported by:		bz@
Reviewed by:		rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19254
2019-02-21 09:34:47 +00:00
tuexen
87f2a8bca4 Reduce the TCP initial retransmission timeout from 3 seconds to
1 second as allowed by RFC 6298.

Reviewed by:		kbowling@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18941
2019-02-20 18:03:43 +00:00
tuexen
796133921a Use exponential backoff for retransmitting SYN segments as specified
in the TCP RFCs.

Reviewed by:		rrs@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18974
2019-02-20 17:56:38 +00:00
tuexen
0c4eb6ecc6 Fix a byte ordering issue for the advertised receiver window in ACK
segments sent in TIMEWAIT state, which I introduced in r336937.

MFC after:	3 days
Sponsored by:	Netflix, Inc.
2019-02-15 09:45:17 +00:00
ae
50a8601b2e In r335015 PCB destroing was made deferred using epoch_call().
But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables,
and thus it needs VNET context, that is missing during gtaskqueue
executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred().

PR:		235684
MFC after:	1 week
2019-02-13 15:46:05 +00:00
kp
8e2074c2e1 garp: Fix vnet related panic for gratuitous arp
Gratuitous ARP packets are sent from a timer, which means we don't have a vnet
context set. As a result we panic trying to send the packet.

Set the vnet context based on the interface associated with the interface
address.

To reproduce:
  sysctl net.link.ether.inet.garp_rexmit_count=2
  ifconfig vtnet1 10.0.0.1/24 up

PR:		235699
Reviewed by:	vangyzen@
MFC after:	1 week
2019-02-12 21:22:57 +00:00
tuexen
6903958c2a Improve input validation for raw IPv4 socket using the IP_HDRINCL
option.

This issue was found by running syzkaller on OpenBSD.
Greg Steuck made me aware that the problem might also exist on FreeBSD.

Reported by:		Greg Steuck
MFC after:		1 month
Differential Revision:	https://reviews.freebsd.org/D18834
2019-02-12 10:17:21 +00:00
tuexen
b80fcf68dd Fix a locking issue when reporing outbount messages.
MFC after:		3 days
2019-02-10 14:02:14 +00:00
tuexen
eee3d66791 Fix a locking issue in the IPPROTO_SCTP level SCTP_PEER_ADDR_THLDS socket
option. The problem affects only setsockopt with invalid parameters.

This issue was found by syzkaller.

MFC after:		3 days
2019-02-10 13:55:32 +00:00
tuexen
fb17e65b4c Fix a locking bug in the IPPROTO_SCTP level SCTP_EVENT socket option.
This occurs when call setsockopt() with invalid parameters.

This issue was found by syzkaller.

MFC after:		3 days
2019-02-10 10:42:16 +00:00
tuexen
b30530d5f6 Fix locking for IPPROTO_SCTP level SCTP_DEFAULT_PRINFO socket option.
This problem occurred when calling setsockopt() will invalid parameters.

This issue was found by running syzkaller.

MFC after:		3 days
2019-02-10 08:28:56 +00:00
tuexen
57ec47006e Ensure that when using the TCP CDG congestion control and setting the
sysctl variable net.inet.tcp.cc.cdg.smoothing_factor to 0, the smoothing
is disabled. Without this patch, a division by zero orrurs.

PR:			193762
Reviewed by:		lstewart@, rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19071
2019-02-08 20:42:49 +00:00
tuexen
c6fb952de9 Only reduce the PMTU after the send call. The only way to increase it, is
via PMTUD.

This fixes an MTU issue reported by Timo Voelker.

MFC after:		3 days
2019-02-05 10:29:31 +00:00
tuexen
78654493eb Fix an off-by-one error in the input validation of the SCTP_RESET_STREAMS
socketoption.

This was found by running syzkaller.

MFC after:		3 days
2019-02-05 10:13:51 +00:00
imp
82650adfef Regularize the Netflix copyright
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
   piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.

Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
2019-02-04 21:28:25 +00:00
tuexen
46bca47606 When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd
consistently.

This inconsistency was observed when working on the bug reported in
PR 235256, although it does not fix the reported issue. The fix for
the PR will be a separate commit.

PR:			235256
Reviewed by:		rrs@, Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19033
2019-02-01 12:33:00 +00:00
glebius
711fa71dfe Repair siftr(4): PFIL_IN and PFIL_OUT are defines of some value, relying
on them having particular values can break things.
2019-02-01 08:10:26 +00:00
glebius
9978a7d924 New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented.  The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision:	https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
brooks
b686fa9ec4 Add a simple port filter to SIFTR.
SIFTR does not allow any kind of filtering, but captures every packet
processed by the TCP stack.
Often, only a specific session or service is of interest, and doing the
filtering in post-processing of the log adds to the overhead of SIFTR.

This adds a new sysctl net.inet.siftr.port_filter. When set to zero, all
packets get captured as previously. If set to any other value, only
packets where either the source or the destination ports match, are
captured in the log file.

Submitted by:	Richard Scheffenegger
Reviewed by:	Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18897
2019-01-30 17:44:30 +00:00
tuexen
7c894e3728 Fix the detection of ECN-setup SYN-ACK packets.
RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags
set and the CWR flags not set. The code was only checking if ECE flag
is set. This patch adds the check to verify that the CWR flags is not
set.

Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18996
2019-01-28 12:45:31 +00:00
tuexen
98afe6eb13 Don't include two header files when not needed.
This allows the part of the rewrite of TCP reassembly in this
files to be MFCed to stable/11 with manual change.

MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-25 17:08:28 +00:00
tuexen
8e4ad3b84a Fix a bug in the restart window computation of TCP New Reno
When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.

Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18940
2019-01-25 13:57:09 +00:00
tuexen
cbb1f242e8 Get the arithmetic right...
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-24 16:47:18 +00:00
tuexen
347de7d146 Kill a trailing whitespace character...
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-01-24 16:43:13 +00:00
tuexen
7521f0c094 Update a comment to reflect the current reality.
SYN-cache entries live for abaut 12 seconds, not 45, when default
setting are used.

MFC after:		1 week
Sponsored by:		Netflix, Inc.
2019-01-24 16:40:14 +00:00
markj
acda75d443 Style.
Reviewed by:	bz
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-01-23 22:19:49 +00:00
markj
af69719726 Fix an LLE lookup race.
After the afdata read lock was converted to epoch(9), readers could
observe a linked LLE and block on the LLE while a thread was
unlinking the LLE.  The writer would then release the lock and schedule
the LLE for deferred free, allowing readers to continue and potentially
schedule the LLE timer.  By the point the timer fires, the structure is
freed, typically resulting in a crash in the callout subsystem.

Fix the problem by modifying the lookup path to check for the LLE_LINKED
flag upon acquiring the LLE lock.  If it's not set, the lookup fails.

PR:		234296
Reviewed by:	bz
Tested by:	sbruno, Victor <chernov_victor@list.ru>,
		Mike Andrews <mandrews@bit0.com>
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18906
2019-01-23 22:18:23 +00:00
brooks
8ba0807050 Make SIFTR work again after r342125 (D18443).
Correct a logic error.

Only disable when already enabled or enable when disabled.

Submitted by:	Richard Scheffenegger
Reviewed by:	Cheng Cui
Obtained from:	Cheng Cui
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D18885
2019-01-18 21:46:38 +00:00
tuexen
eb82fa876e Limit the user-controllable amount of memory the kernel allocates
via IPPROTO_SCTP level socket options.

This issue was found by running syzkaller.

MFC after:	1 week
2019-01-16 11:33:47 +00:00
shurd
1b9893d7d7 Fix window update issue when scaling disabled
When the TCP window scale option is not used, and the window
opens up enough in one soreceive, a window update will not be sent.

For example, if recwin == 65535, so->so_rcv.sb_hiwat >= 262144, and
so->so_rcv.sb_hiwat <= 524272, the window update will never be sent.
This is because recwin and adv are clamped to TCP_MAXWIN << tp->rcv_scale,
and so will never be >= so->so_rcv.sb_hiwat / 4
or <= so->so_rcv.sb_hiwat / 8.

This patch ensures a window update is sent if the window opens by
TCP_MAXWIN << tp->rcv_scale, which should only happen when the window
size goes from zero to the max expressible.

This issue looks like it was introduced in r306769 when recwin was clamped
to TCP_MAXWIN << tp->rcv_scale.

MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D18821
2019-01-15 17:40:19 +00:00
tuexen
87ee738236 Fix getsockopt() for IP_OPTIONS/IP_RETOPTS.
r336616 copies inp->inp_options using the m_dup() function.
However, this function expects an mbuf packet header at the beginning,
which is not true in this case.
Therefore, use m_copym() instead of m_dup().

This issue was found by syzkaller.
Reviewed by:		mmacy@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18753
2019-01-09 06:36:57 +00:00
glebius
6d8cc191f9 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
markj
85037a09f3 Support MSG_DONTWAIT in send*(2).
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead.  Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.

Submitted by:	Greg V <greg@unrelenting.technology>
Reviewed by:	tuexen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18728
2019-01-04 17:31:50 +00:00
tuexen
c5b096f7c9 Fix a regression in the TCP handling of received segments.
When receiving TCP segments the stack protects itself by limiting
the resources allocated for a TCP connections. This patch adds
an exception to these limitations for the TCP segement which is the next
expected in-sequence segment. Without this patch, TCP connections
may stall and finally fail in some cases of packet loss.

Reported by:		jhb@
Reviewed by:		jtl@, rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18580
2018-12-20 16:05:30 +00:00
hiren
3dc78ca62f Revert r331567 CC Cubic: fix underflow for cubic_cwnd()
This change is causing TCP connections using cubic to hang. Need to dig more to
find exact cause and fix it.

Reported by:	tj at mrsk dot me, Matt Garber (via twitter)
Discussed with:	sbruno (previously), allanjude, cperciva
MFC after:	3 days
2018-12-15 17:01:16 +00:00
brooks
acaa8abf5c Fix bugs in plugable CC algorithm and siftr sysctls.
Use the sysctl_handle_int() handler to write out the old value and read
the new value into a temporary variable. Use the temporary variable
for any checks of values rather than using the CAST_PTR_INT() macro on
req->newptr. The prior usage read directly from userspace memory if the
sysctl() was called correctly. This is unsafe and doesn't work at all on
some architectures (at least i386.)

In some cases, the code could also be tricked into reading from kernel
memory and leaking limited information about the contents or crashing
the system. This was true for CDG, newreno, and siftr on all platforms
and true for i386 in all cases. The impact of this bug is largest in
VIMAGE jails which have been configured to allow writing to these
sysctls.

Per discussion with the security officer, we will not be issuing an
advisory for this issue as root access and a non-default config are
required to be impacted.

Reviewed by:	markj, bz
Discussed with:	gordon (security officer)
MFC after:	3 days
Security:	kernel information leak, local DoS (both require root)
Differential Revision:	https://reviews.freebsd.org/D18443
2018-12-15 15:06:22 +00:00
mjg
7e31d1de7e Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
markj
a74ba1d431 Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains.
Memory beyond that limit was previously unused, wasting roughly 1MB per
8GB of RAM.  Also retire INP_PCBLBGROUP_PORTHASH, which was identical to
INP_PCBPORTHASH.

Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D17803
2018-12-05 17:06:00 +00:00
ae
0d01acf0ac Add ability to request listing and deleting only for dynamic states.
This can be useful, when net.inet.ip.fw.dyn_keep_states is enabled, but
after rules reloading some state must be deleted. Added new flag '-D'
for such purpose.

Retire '-e' flag, since there can not be expired states in the meaning
that this flag historically had.

Also add "verbose" mode for listing of dynamic states, it can be enabled
with '-v' flag and adds additional information to states list. This can
be useful for debugging.

Obtained from:	Yandex LLC
MFC after:	2 months
Sponsored by:	Yandex LLC
2018-12-04 16:12:43 +00:00
tuexen
2da835a4bb Limit option_len for the TCP_CCALGOOPT.
Limiting the length to 2048 bytes seems to be acceptable, since
the values used right now are using 8 bytes.

Reviewed by:		glebius, bz, rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18366
2018-11-30 10:50:07 +00:00
markj
5c563658ea Plug some networking sysctl leaks.
Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland.  Fix a number of such handlers.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	tuexen
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18301
2018-11-22 20:49:41 +00:00
tuexen
a02b4525ca A TCP stack is required to check SEG.ACK first, when processing a
segment in the SYN-SENT state as stated in Section 3.9 of RFC 793,
page 66. Ensure this is also done by the TCP RACK stack.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18034
2018-11-22 20:05:57 +00:00
tuexen
82210e189d Ensure that the TCP RACK stack honours the setting of the
net.inet.tcp.drop_synfin sysctl-variable.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18033
2018-11-22 20:02:39 +00:00
tuexen
e4a2b60c79 Ensure that the default RTT stack can make an RTT measurement if
the TCP connection was initiated using the RACK stack, but the
peer does not support the TCP RACK extension.

This ensures that the TCP behaviour on the wire is the same if
the TCP connection is initated using the RACK stack or the default
stack.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18032
2018-11-22 19:56:52 +00:00
tuexen
3ece71ca83 Ensure that TCP RST-segments announce consistently a receiver window of
zero. This was already done when sending them via tcp_respond().

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D17949
2018-11-22 19:49:52 +00:00