specification for the comparisons made.
Thanks to lstewart@ for the suggestion.
MFC after: 4 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17595
This change defines the RA "6" (IPv6-Only) flag which routers
may advertise, kernel logic to check if all routers on a link
have the flag set and accordingly update a per-interface flag.
If all routers agree that it is an IPv6-only link, ether_output_frame(),
based on the interface flag, will filter out all ETHERTYPE_IP/ARP
frames, drop them, and return EAFNOSUPPORT to upper layers.
The change also updates ndp to show the "6" flag, ifconfig to
display the IPV6_ONLY nd6 flag if set, and rtadvd to allow
announcing the flag.
Further changes to tcpdump (contrib code) are availble and will
be upstreamed.
Tested the code (slightly earlier version) with 2 FreeBSD
IPv6 routers, a FreeBSD laptop on ethernet as well as wifi,
and with Win10 and OSX clients (which did not fall over with
the "6" flag set but not understood).
We may also want to (a) implement and RX filter, and (b) over
time enahnce user space to, say, stop dhclient from running
when the interface flag is set. Also we might want to start
IPv6 before IPv4 in the future.
All the code is hidden under the EXPERIMENTAL option and not
compiled by default as the draft is a work-in-progress and
we cannot rely on the fact that IANA will assign the bits
as requested by the draft and hence they may change.
Dear 6man, you have running code.
Discussed with: Bob Hinden, Brian E Carpenter
This change is similar to r339646. The callback that checks for appearing
and disappearing of tunnel ingress address can be called during VNET
teardown. To prevent access to already freed memory, add check to the
callback and epoch_wait() call to be sure that callback has finished its
work.
MFC after: 20 days
- Add a blank line before a block comment to match other block comments
in the same function.
- Sort the prototype for sbsndptr_adv and fix whitespace between return
type and function name.
Reviewed by: gallatin, bz
Differential Revision: https://reviews.freebsd.org/D17474
Currently, icmp_error() function copies FIB number from original packet
into generated ICMP response but not mbuf_tags(9) chain.
This prevents us from easily matching ICMP responses corresponding
to tagged original packets by means of packet filter such as ipfw(8).
For example, ICMP "time-exceeded in-transit" packets usually generated
in response to traceroute probes lose tags attached to original packets.
This change adds new sysctl net.inet.icmp.error_keeptags
that defaults to 0 to avoid extra overhead when this feature not needed.
Set net.inet.icmp.error_keeptags=1 to make icmp_error() copy mbuf_tags
from original packet to generated ICMP response.
PR: 215874
MFC after: 1 month
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
and set it otherwise;
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17214
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
and set it otherwise;
* remove the note about ingress address from BUGS section.
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17134
appearing and disappearing on the host system.
Such handling is need, because tunneling interfaces must use addresses,
that are configured on the host as ingress addresses for tunnels.
Otherwise the system can send spoofed packets with source address, that
belongs to foreign host.
The KPI uses ifaddr_event_ext event to implement addresses tracking.
Tunneling interfaces register event handlers and then they are
notified by the kernel, when an address disappears or appears.
ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event()
in the ip_encap.c
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17134
that was added using "new rule format". And then, when the kernel
returns rule with this flag, ipfw(8) can correctly show it.
Reported by: lev
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17373
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17100
code paths. Both are not consistent and the one on the syn cache code
does not conform to the relevant specifications (Page 69 of RFC 793
and Section 4.2 of RFC 5961).
This patch fixes this:
* The sequence numbers checks are fixed as specified on
page Page 69 RFC 793.
* The sysctl variable net.inet.tcp.insecure_rst is now honoured
and the behaviour as specified in Section 4.2 of RFC 5961.
Approved by: re (gjb@)
Reviewed by: bz@, glebius@, rrs@,
Differential Revision: https://reviews.freebsd.org/D17595
Sponsored by: Netflix, Inc.
to this change, the code sometimes used a temporary stack variable to hold
details of a TCP segment. r338102 stopped using the variable to hold
segments, but did not actually remove the variable.
Because the variable is no longer used, we can safely remove it.
Approved by: re (gjb)
an inp marked FREED after the epoch(9) changes.
Check once we hold the lock and skip the inp if it is the case.
Contrary to IPv6 the locking of the inp is outside the multicast
section and hence a single check seems to suffice.
PR: 232192
Reviewed by: mmacy, markj
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17540
but leaving the variable assignment outside the block, where it is no longer
used. Move both the variable and the assignment one block further in.
This should result in no functional changes. It will however make upcoming
changes slightly easier to apply.
Reviewed by: markj, jtl, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17525
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.
Fix the epoch leak by making sure we exit epoch sections before returning.
Reviewed by: ae, gallatin, mmacy
Approved by: re (gjb, kib)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D17450
locally generated SCTP packets sent over IPv4. This make
the behaviour consistent with IPv6.
Reviewed by: ae@, bz@, jtl@
Approved by: re (kib@)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17406
When getting the number of bytes to checksum make sure to convert the UDP
length to host byte order when the entire header is not in the first mbuf.
Reviewed by: jtl, tuexen, ae
Approved by: re (gjb), jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D17357
With the new route cache feature udp_notify() will modify the inp when it
needs to invalidate the route cache. Ensure that we hold a write lock on
the inp before calling the function to ensure that multiple threads don't
race while trying to invalidate the cache (which previously lead to a page
fault).
Differential Revision: https://reviews.freebsd.org/D17246
Reviewed by: sbruno, bz, karels
Sponsored by: Dell EMC Isilon
Approved by: re (gjb)
This removes two assignments for the flags field being done
twice and adds one, which was missing.
Thanks to Felix Weinrank for reporting the issue he found
by using fuzz testing of the userland stack.
Approved by: re (kib@)
MFC after: 1 week
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.
PR: 231428
Reviewed by: mmacy, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17335
sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs()
similar to sctp_find_cmsg() to improve consistency and avoid
the signed/unsigned issues in sctp_process_cmsgs_for_init()
and sctp_findassociation_cmsgs().
Thanks to andrew@ for reporting the problem he found using
syzcaller.
Approved by: re (kib@)
MFC after: 1 week
sending UDP encapsulated SCTP packets.
This is consistent with the behaviour that when such packets are received,
the corresponding UDP stats counter (udps_ipackets) is incremented.
Thanks to Peter Lei for making me aware of this inconsistency.
Approved by: re (kib@)
MFC after: 1 week
syncache_respond(). There is no functional change. The
parameter became unused in r313330, but wasn't removed.
Approved by: re (kib@)
MFC after: 1 month
Sponsored by: Netflix, Inc.
It is currently unused and reserved for future use to keep KBI/KPI.
Also add several spare pointers to be able extend structure if it
will be needed.
Approved by: re (gjb)
* Fix a bug where the SYN handling during established state was
applied to a front state.
* Move a check for retransmission after the timer handling.
This was suppressing timer based retransmissions.
* Fix an off-by one byte in the sequence number of retransmissions.
* Apply fixes corresponding to
https://svnweb.freebsd.org/changeset/base/336934
Reviewed by: rrs@
Approved by: re (kib@)
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16912
Lookups are protected by an epoch section, so the LB group linkage must
be a CK_LIST rather than a plain LIST. Furthermore, we were not
deferring LB group frees, so in_pcbremlbgrouphash() could race with
readers and cause a use-after-free.
Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com>
Tested by: gallatin
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17031
Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com>
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17065
for the rt and lle cache were added in r191129 (2009).
To my best knowledge they have never been used and route caching
has converted the inp_rt field from that commit to inp_route
rendering this field and these flags obsolete.
Convert the pointer into a spare pointer to not change the size of
the structure anymore (and to have a spare pointer) and mark the
two fields as unused.
Reviewed by: markj, karels
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17062
adding the missing include files and changing a the type of cpuid which
would otherwise cause a false comparison with NETISR_CPUID_NONE.
Reviewed by: rrs
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16891
Otherwise the "depends_on provider" guard in sctp.d does not work as
intended.
Reported by: mjg
Reviewed by: tuexen
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17057
No functional change intended.
Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com>
Approved by: re (rgrimes)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17030
fast forwarding path, as it already works for IPv6 and for both of them
on old slow path.
PR: 231143
Reviewed by: ae
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17039
use sizeof() or explicit #definesi instead. No functional change.
This was suggested by jmg@.
MFC after: 1 month
XMFC with: r338053
Sponsored by: Netflix, Inc.
SCTP. They are based on what is specified in the Solaris DTrace manual
for Solaris 11.4.
Reviewed by: 0mp, dteske, markj
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D16839
socket resulted in sending fragmented IPV6 packets.
This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.
PR: 173444
Reviewed by: bz@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16796
This was broken for IPv6 listening socket, which are not IPV6_ONLY,
and the accepted TCP connection was using IPv4.
Reviewed by: bz@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16792
This is not a functional change but a preperation for the upcoming
DTrace support. It is necessary to change the state in one
logical operation, even if it involves clearing the sub state
SHUTDOWN_PENDING.
MFC after: 1 month
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16626
The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.
Reviewed by: rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16636
toe_l2_resolve to fill up the complete vtag and not just the vid.
Reviewed by: kib@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D16752
This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
- add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
goes to zero OR when the interface goes away
- decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
- add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
door wider still to races
- call inp_gcmoptions synchronously after dropping the the inpcb lock
Fundamentally multicast needs a rewrite - but keep applying band-aids for now.
Tested by: kp
Reported by: novel, kp, lwhsu
In particular, try to ensure that no bucket will have a reassembly
queue larger than approximately 100 items. This limits the cost to
find the correct reassembly queue when processing an incoming
fragment.
Due to the low limits on each bucket's length, increase the size of
the hash table from 64 to 1024.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
There is a hashing algorithm which should distribute IPv4 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv4 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.
Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet.ip.maxfragbucketsize).
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
The IP reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
of enough VNETs), it is possible that the sum of all the VNET limits
will exceed the number of mbuf clusters available in the system.
Given the fact that the fragment limit is intended (at least in part) to
regulate access to a global resource, the fragment limit should
be applied on a global basis.
VNET-specific limits can be adjusted by modifying the
net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket
sysctls.
To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0.
To disable fragment reassembly for a particular VNET, set
net.inet.ip.maxfragpackets to 0.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
Currently, IPv4 fragments are hashed into buckets based on a 32-bit
key which is calculated by (src_ip ^ ip_id) and combined with a random
seed. However, because an attacker can control the values of src_ip
and ip_id, it is possible to construct an attack which causes very
deep chains to form in a given bucket.
To ensure more uniform distribution (and lower predictability for
an attacker), calculate the hash based on a key which includes all
the fields we use to identify a reassembly queue (dst_ip, src_ip,
ip_id, and the ip protocol) as well as a random seed.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
It was lost when tryforward appeared. Now ip[6]_tryforward will be enabled
only when sending redirects for corresponding IP version is disabled via
sysctl. Otherwise will be used default forwarding function.
PR: 221137
Submitted by: mckay@
MFC after: 2 weeks
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.
This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.
2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.
PR: Bug 230168
Reported by: sbruno
Reviewed by: jhibbits, mmacy, sbruno
Approved by: jhibbits
Differential Revision: D16633
Currently, the per-queue limit is a function of the receive buffer
size and the MSS. In certain cases (such as connections with large
receive buffers), the per-queue segment limit can be quite large.
Because we process segments as a linked list, large queues may not
perform acceptably.
The better long-term solution is to make the queue more efficient.
But, in the short-term, we can provide a way for a system
administrator to set the maximum queue size.
We set the default queue limit to 100. This is an effort to balance
performance with a sane resource limit. Depending on their
environment, goals, etc., an administrator may choose to modify this
limit in either direction.
Reviewed by: jhb
Approved by: so
Security: FreeBSD-SA-18:08.tcp
Security: CVE-2018-6922
The dtrace provider for UDP-Lite is modeled after the UDP provider.
This fixes the bug that UDP-Lite packets were triggering the UDP
provider.
Thanks to dteske@ for providing the dwatch module.
Reviewed by: dteske@, markj@, rrs@
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D16377
r336940 introduced an "unused variable" warning on platforms which
support INET, but not INET6, like MALTA and MALTA64 as reported
by Mark Millard. Improve the #ifdefs to address this issue.
Sponsored by: Netflix, Inc.
TCP/IPv4 allows an implicit connection setup using sendto(), which
is used for TTCP and TCP fast open. This patch adds support for
TCP/IPv6.
While there, improve some tests for detecting multicast addresses,
which are mapped.
Reviewed by: bz@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16458
When sending TCP segments from the timewait code path, a stored
value of the last sent window is used. Use the same code for
computing this in the timewait code path as in the main code
path used in tcp_output() to avoiv inconsistencies.
Reviewed by: rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16503
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.
Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485
These missing probe are mostly in the syncache and timewait code.
Reviewed by: markj@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16369
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.
Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.
Bump _FreeBSD_version due to the breaking KPI change.
Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725
sending an invalid segment into the reassembly
queue. This would happen if you enabled the
data after close option.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16453
When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has
a destructor (newreno_cb_destroy) for per connection state. Other congestion
controls may allocate and free cc_data on entry and exit, but the field is
never explicitly NULLed if moving back to NewReno which only internally
allocates stateful data (no entry contstructor) resulting in a situation where
newreno_cb_destory might be called on a junk pointer.
- NULL out cc_data in the framework after calling {cc}_cb_destroy
- free(9) checks for NULL so there is no need to perform not NULL checks
before calling free.
- Improve a comment about NewReno in tcp_ccalgounload
This is the result of a debugging session from Jason Wolfe, Jason Eggleston,
and mmacy@ and very helpful insight from lstewart@.
Submitted by: Kevin Bowling
Reviewed by: lstewart
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16282
Fire UDP receive probes when a packet is received and there is no
endpoint consuming it. Fire the probe also if the TTL of the
received packet is smaller than the minimum required by the endpoint.
Clarify also in the man page, when the probe fires.
Reviewed by: dteske@, markj@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16046
This deduplicates the code a bit, and also implicitly adds missing
callout_stop() to in[6]_lltable_delete_entry() functions.
PR: 209682, 225927
Submitted by: hselasky (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D4605