response.
Delete an unneeded rate limit for UDP under IPv6. Because ICMP6
messages have their own rate limit, it is unnecessary to apply a
second rate limit to UDP messages.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D10387
considering cache line hits and misses. Put the lock and hash list
glue into the first cache line, put inp_refcount inp_flags inp_socket
into the second cache line.
o On allocation zero out entire structure except the lock and list entries,
including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were
introduced, they were added below inp_zero_size, resulting on not being
cleared after free/alloc. This definitely was a source of bugs with route
caching. Could be that r315956 has just fixed one of them.
The inp_gencnt is reinitialized on every alloc, so it is safe to clear it.
This has been proved to improve TCP performance at Netflix.
Obtained from: rrs
Differential Revision: D10686
if it is called on a TCP socket
* with an IPv6 address and the socket is bound to an
IPv4-mapped IPv6 address.
* with an IPv4-mapped IPv6 address and the socket is bound to an
IPv6 address.
Thanks to Jonathan T. Leighton for reporting this issue.
Reviewed by: bz gnn
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D9163
r290383 has changed how mbufs sent by divert socket are handled.
Previously they are always handled by slow path processing in ip_input().
Now ip_tryforward() is invoked from ip_input() before in_broadcast() check.
Since diverted packet lost all mbuf flags, it passes the broadcast check
in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast
packet is forwarded to the wire instead of be consumed by network stack.
Add in_broadcast() check to the div_output() function. And restore the
M_BCAST flag if destination address is broadcast for the given network
interface.
PR: 209491
MFC after: 1 week
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.
ANSIfy related prototypes while here.
Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193
function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away.
The zone_fini parameter also goes away. Previously no protocols (except
divert) supplied zone_fini function, so inpcb locks were leaked with slabs.
This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now
this is a leak. Fix that by suppling inpcb_fini() function as fini method
for all inpcb zones.
compiled into the kernel
This ensures that .iss_asm (the number of ASM listeners) isn't incorrectly
decremented for MLD-layer source datagrams when inspecting im*s_st[1]
(the second state in the structure).
MFC after: 2 months
PR: 217509 [1]
Reported by: Coverity (Isilon)
Reviewed by: ae ("This patch looks correct to me." [1])
Submitted by: Miles Ohlrich <miles.ohlrich@isilon.com>
Sponsored by: Dell EMC Isilon
It has strong locking model, doesn't have any timers associated with
entries. The entries theirselves are referenced only from the tcpcb zone,
which itself is a normal zone, without the UMA_ZONE_NOFREE flag.
that chooses right alias_address for outgoing packets that already have
corresponding state in one of aliasing instances. This feature works just fine
for ICMP, UDP, TCP and SCTP packes but not for others. For example,
outgoing PPtP/GRE packets always get alias_address of latest configured
instance no matter whether such packets have corresponding state or not.
This change unbreaks translation of transit PPtP/GRE connections
for "nat global" case fixing a bug in static ProtoAliasOut() function
that ignores its "create" argument and performs translation
regardless of its value. This static function is called only
by LibAliasOutLocked() function and only for packers other than
ICMP, UDP, TCP and SCTP. LibAliasOutLocked() passes its "create"
argument unmodified.
We have only two consumers of LibAliasOutLocked() in the source tree
calling it with "create" unequal to 1: "ipfw nat global" code and similar
natd code having same problem. All other consumers of LibAliasOutLocked()
call it with create = 1 and the patch is "no-op" for such cases.
PR: 218968
Approved by: ae, vsevolod (mentor)
MFC after: 1 week
This patch allows the MTU stored in the hostcache to be used as an
initial value for SCTP paths. When an ICMP PTB message is received,
store the MTU in the hostcache.
MFC after: 1 week
validation of SEG.ACK as the first step. If the ACK is not acceptable,
a RST segment should be sent and the segment should be dropped.
Up to now, the segment was partially processed.
This patch moves the check for the SEG.ACK validation up to the front
as required.
Reviewed by: hiren, gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D10424
wait for the next enqueue from the driver.
Reviewed by: gnn@, hselasky@, gallatin@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D10432
patm(4) devices.
Maintaining an address family and framework has real costs when we make
infrastructure improvements. In the case of NATM we support no devices
manufactured in the last 20 years and some will not even work in modern
motherboards (some newer devices that patm(4) could be updated to
support apparently exist, but we do not currently have support).
With this change, support remains for some netgraph modules that don't
require NATM support code. It is unclear if all these should remain,
though ng_atmllc certainly stands alone.
Note well: FreeBSD 11 supports NATM and will continue to do so until at
least September 30, 2021. Improvements to the code in FreeBSD 11 are
certainly welcome.
Reviewed by: philip
Approved by: harti
-(SYNCOOKIE_LIFETIME + 1) instead of INT64_MIN, since it is
good enough and works when time_t is int32 or int64.
This fixes the issue reported by cy@ on i386.
Reported by: cy
MFC after: 1 week
Sponsored by: Netflix, Inc.
overflows, syncookies are used.
This patch restricts the usage of syncookies in this case: accept
syncookies only if there was an overflow of the syncache recently.
This mitigates a problem reported in PR217637, where is syncookie was
accepted without any recent drops.
Thanks to glebius@ for suggesting an improvement.
PR: 217637
Reviewed by: gnn, glebius
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D10272
do for streaming sockets.
And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.
Suggested by: glebius
Tested by: Irina Liakh <spell at itl ua>
MFC after: 1 week
When checksums of received IP and UDP header already checked, UDP uses
sbappendaddr_locked() to pass received data to the socket.
sbappendaddr_locked() uses given mbuf as is, and if NIC supports checksum
offloading, mbuf contains csum_data and csum_flags that were calculated
for already stripped headers. Some NICs support only limited checksums
offloading and do not use CSUM_PSEUDO_HDR flag, and csum_data contains
some value that UDP/TCP should use for pseudo header checksum calculation.
When L2TP is used for tunneling with mpd5, ng_ksocket receives mbuf with
filled csum_flags and csum_data, that were calculated for outer headers.
When L2TP header is stripped, a packet that was tunneled goes to the IP
layer and due to presence of csum_flags (without CSUM_PSEUDO_HDR) and
csum_data, the UDP/TCP checksum check fails for this packet.
Reported by: Irina Liakh <spell at itl ua>
Tested by: Irina Liakh <spell at itl ua>
MFC after: 1 week
for example not in SYN-SENT.
This patch adds code to check the sysctl variable in other states than
LISTEN.
Thanks to ae and gnn for providing comments.
Reviewed by: gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D9894
Switched from using timestamps to RTT estimates when performing TCP receive
buffer auto resizing, as not all hosts support / enable TCP timestamps.
Disabled reset of receive buffer auto scaling when not in bulk receive mode,
which gives an extra 20% performance increase.
Also extracted auto resizing to a common method shared between standard and
fastpath modules.
With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from
~3MB/s to ~100MB/s using the default settings.
Reviewed by: lstewart, gnn
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D9668
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet. Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast". The
optimization has been disabled for several months now with no progress
towards fixing it, so it needs to go.
This opcode can be used to attach some data to external action opcode.
And unlike to O_EXTERNAL_INSTANCE opcode, this opcode does not require
creating of named instance to pass configuration arguments to external
action handler. The data is coming just next to O_EXTERNAL_ACTION opcode.
The userlevel part currenly supports formatting for opcode with ipfw_insn
size, by default it expects u16 numeric value in the arg1.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
If a jail has an explicitly assigned loopback address then allow it to be
used instead of remapping requests for the loopback adddress to the first
IPv4 address assigned to the jail.
This fixes issues where applications attempt to detect their bound port
where they requested a loopback address, which was available, but instead
the kernel remapped it to the jails first address.
A example of this is binding nginx to 127.0.0.1 and then running "service
nginx upgrade" which before this change would cause nginx to fail.
Also:
* Correct the description of prison_check_ip4_locked to match the code.
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Multiplay
tcp_output.c was using a route on the stack for IPv6, which does not
allow route caching or LLE/ndp caching. Switch to using the route
(v6 flavor) in the in_pcb, which was already present, which caches
both L3 and L2 lookups.
Reviewed by: gnn hiren
MFC after: 2 weeks
ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache
entry because they used their own routes on the stack, not in_pcb routes.
The original model for route caching was callers that provided a route
structure to ip{,6}input() would keep the route, and this model was used
for L2 caching as well. Instead, change L2 caching to be done by default
only when using a route structure in the in_pcb; the pcb deallocation
code frees L2 as well as L3 cacches. A separate change will add route
caching to TCP/IPv6.
Another suggestion was to have the transport protocols indicate willingness
to use L2 caching, but this approach keeps the changes in the network
level
Reviewed by: ae gnn
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10059
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
inet_ntoa() and inet_ntoa_r() take the address in network
byte-order. When I removed those calls, I should have
replaced them with ntohl() to make the hex addresses slightly
less unreadable. Here they are.
See r315277 regarding classic blunders.
vangyzen: you're deep in "no good deed" territory, it seems
--badger
Reported by: ian
MFC after: 3 days
MFC when: I finally get it right
Sponsored by: Dell EMC
When I made the changes in r313821, I fell victim to one of the
classic blunders, the most famous of which is: never get involved
in a land war in Asia. But only slightly less well known is this:
Keep your brain turned on and engaged when making a tedious, sweeping,
mechanical change. KTR can correctly log the immediate integral values
passed to it, as well as constant strings, but not non-constant strings,
since they might change by the time ktrdump retrieves them.
Reported by: glebius
MFC after: 3 days
Sponsored by: Dell EMC
This was introduced on accident in r165243, when return sites were unified
to add a lock around LibAliasProxyRule().
PR: 217749
Submitted by: Svyatoslav <razmyslov at viva64.com>
Sponsored by: Viva64 (PVS-Studio)
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.
The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.
This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.
Reviewed by: adrian, aw
Approved by: ae (mentor)
Sponsored by: rsync.net
Differential Revision: D9235
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
least 2 * MSS. However, if the receive buffer size is small, this might
be impossible. Add back a criterion to send a TCP window update if
the window can be increased by at least half of the receive buffer size.
This condition was removed in r242252. This patch simply brings it back.
PR: 211003
Reviewed by: gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D9475
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Remove it from the kernel.
Suggested by: glebius, emaste
Reviewed by: gnn
MFC after: never
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9625
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.
Suggested by: glebius, emaste
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9625
The inpcb structure has inp_sp pointer that is initialized by
ipsec_init_pcbpolicy() function. This pointer keeps strorage for IPsec
security policies associated with a specific socket.
An application can use IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket
options to configure these security policies. Then ip[6]_output()
uses inpcb pointer to specify that an outgoing packet is associated
with some socket. And IPSEC_OUTPUT() method can use a security policy
stored in the inp_sp. For inbound packet the protocol-specific input
routine uses IPSEC_CHECK_POLICY() method to check that a packet conforms
to inbound security policy configured in the inpcb.
SCTP protocol doesn't specify inpcb for ip[6]_output() when it sends
packets. Thus IPSEC_OUTPUT() method does not consider such packets as
associated with some socket and can not apply security policies
from inpcb, even if they are configured. Since IPSEC_CHECK_POLICY()
method is called from protocol-specific input routine, it can specify
inpcb pointer and associated with socket inbound policy will be
checked. But there are two problems:
1. Such check is asymmetric, becasue we can not apply security policy
from inpcb for outgoing packet.
2. IPSEC_CHECK_POLICY() expects that caller holds INPCB lock and
access to inp_sp is protected. But for SCTP this is not correct,
becasue SCTP uses own locks to protect inpcb.
To fix these problems remove IPsec related PCB code from SCTP.
This imply that IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket options
will be not applicable to SCTP sockets. To be able correctly check
inbound security policies for SCTP, mark its protocol header with
the PR_LASTHDR flag.
Reported by: tuexen
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D9538
If the TCP stack has retransmitted more than 1/4 of the total
number of retransmits before a connection drop, it decides that
its current RTT estimate is hopelessly out of date and decides
to recalculate it from scratch starting with the next ACK.
Unfortunately, it implements this by zeroing out the current RTT
estimate. Drop this hack entirely, as it makes it significantly more
difficult to debug connection issues. Instead check for excessive
retransmits at the point where srtt is updated from an ACK being
received. If we've exceeded 1/4 of the maximum retransmits,
discard the previous srtt estimate and replace it with the latest
rtt measurement.
Differential Revision: https://reviews.freebsd.org/D9519
Reviewed by: gnn
Sponsored by: Dell EMC Isilon
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.
The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.
This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.
Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
If multiple threads emit a UDP log_in_vain message concurrently,
the IP addresses could be garbage due to concurrent usage of a
single string buffer inside inet_ntoa(). Use inet_ntoa_r() with
two stack buffers instead.
Reported by: Mark Martinec <Mark.Martinec+freebsd@ijs.si>
MFC after: 3 days
Relnotes: yes
Sponsored by: Dell EMC
Small summary
-------------
o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.
Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352
not being initialized, and the per-vnet fastopen context was only
being initialized for the default vnet.
PR: 216613
Reported by: Alex Deiter <alex dot deiter at gmail dot com>
MFC after: 1 week
space available for chunks. This unbreaks the handling of
ICMPV6 packets indicating "packet too big". It just worked
for IPv4 since we are overbooking for IPv4.
MFC after: 1 week
regardless of what the default stack for the system is set to.
With current/default behavior, after changing the default tcp stack, the
application needs to be restarted to pick up that change. Setting this new knob
net.inet.tcp.functions_inherit_listen_socket_stack to '0' would change that
behavior and make any new connection use the newly selected default tcp stack.
Reviewed by: rrs
MFC after: 2 weeks
Sponsored by: Limelight Networks
(intentionally) deleted first and then completely added again (so all the
events, announces and hooks are given a chance to run).
This cause an issue with CARP where the existing CARP data structure is
removed together with the last address for a given VHID, which will cause
a subsequent fail when the address is later re-added.
This change fixes this issue by adding a new flag to keep the CARP data
structure when an address is not being removed.
There was an additional issue with IPv6 CARP addresses, where the CARP data
structure would never be removed after a change and lead to VHIDs which
cannot be destroyed.
Reviewed by: glebius
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:
o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).
In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.
Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.
This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.
There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.
Reviewed by: adrian, gnn
Differential Revision: https://reviews.freebsd.org/D9171
header match when using a raw socket to send IPv4 packets and
providing the header. If they don't match, let send return -1
and set errno to EINVAL.
Before this patch is was only enforced that the length in the header
is not larger then the buffer length.
PR: 212283
Reviewed by: ae, gnn
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D9161
and tw_so_options defined here which is supposed to be a copy of the
former (short vs u_short respectively).
Switch tw_so_options to be "signed short" to match the type of the field
it's inherited from.
dangerous. Those wanting data from an mbuf should use DTrace itself to get
the data.
PR: 203409
Reviewed by: hiren
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D9035
structs under the INET6 #ifdef. Similarly (even though it doesn't seem
to affect the build), conditionalize all IPv4 structs under the INET
#ifdef
This also unbreaks the LINT-NOINET6 tinderbox target on amd64; I have not
verified other MACHINE/TARGET pairs (e.g. armv6/arm).
MFC after: 2 weeks
X-MFC with: r310847
Pointyhat to: jpaetzel
Reported by: O. Hartmann <o.hartmann@walstatt.org>
If there is a loop in the network a CARP that is in MASTER state will see it's
own broadcasts, which will then cause it to assume BACKUP state. When it
assumes BACKUP it will stop sending advertisements. In that state it will no
longer see advertisements and will assume MASTER...
We can't catch all the cases where we are seeing our own CARP broadcast, but
we can catch the obvious case.
Submitted by: torek
Obtained from: FreeNAS
MFC after: 2 weeks
Sponsored by: iXsystems
In case of the empty queue tp->snd_holes and tcp_sackhole_insert()
failing due to memory shortage, tp->snd_holes will be empty.
This problem was hit when stress tests where performed by pho.
PR: 215513
Reported by: pho
Tested by: pho
Sponsored by: Netflix, Inc.
This can lead to change of mbuf pointer (packet filter could do m_pullup(),
NAT, etc). Also in case of change of destination address, tryforward can
decide that packet should be handled by local system. In this case modified
mbuf can be returned to the ip[6]_input(). To handle this correctly, check
M_FASTFWD_OURS flag after return from ip[6]_tryforward. And if it is present,
update variables that depend from mbuf pointer and skip another inbound
firewall processing.
No objection from: #network
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D8764
in6p_options to check that. That is incorrect as we carry ip options in
in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't
suffice as we combine ip options and ip header fields both in that one field.
The commit fixes this by using ip6_optlen() which correctly calculates length
of only ip options for IPv6.
Reviewed by: ae, bz
MFC after: 3 weeks
Sponsored by: Limelight Networks
This fixes a bug where the wrong ppid was reported, if
* I-DATA was used on the first fragement was not received first
* DATA was used and different ppids where used.
Thanks to Julian Cordes for making me aware of the issue.
MFC after: 1 week
This made a couple of bugs visible in handling SSN wrap-arounds
when using DATA chunks. Now bulk transfer seems to work fine...
This fixes the issue reported in
https://github.com/sctplab/usrsctp/issues/111
MFC after: 1 week
The tools using to generate the sources has been updated and produces
different whitespaces. Commit this seperately to avoid intermixing
these with real code changes.
MFC after: 3 days
When a TCP segment with the FIN bit set was received in the CLOSED state,
a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the
FIN counts as one byte. This accounting was missing and is fixed by this
patch.
Reviewed by: hiren
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://svn.freebsd.org/base/head
many borken middle-boxes tend to do that. But during 3whs, in syncache_expand(),
we don't do that which causes us to send a RST to such a client. Relax this
constraint by only using tsecr to compare against timestamp that we sent when it
is not 0. As a result, we'd now accept the final ACK of 3whs with tsecr of 0.
Reviewed by: jtl, gnn
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8552
This does not cover state changes from TIME-WAIT.
Reviewed by: gnn
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8443
either in the CLOSING or LAST-ACK state.
Reviewed by: hiren
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8371
introduced with r261242. The useful and expected soisconnected()
call is done in tcp_do_segment().
Has been found as part of unrelated PR:212920 investigation.
Improve slightly (~2%) the maximum number of TCP accept per second.
Tested by: kevin.bowling_kev009.com, jch
Approved by: gnn, hiren
MFC after: 1 week
Sponsored by: Verisign, Inc
Differential Revision: https://reviews.freebsd.org/D8072
loss event but not use or obay the recommendations i.e. values set by it in some
cases.
Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.
tcp_stacks/fastpath module has been updated to adapt these changes.
Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.
In collaboration with: Matt Macy <mmacy at nextbsd dot org>
Discussed on: transport@ mailing list
Reviewed by: jtl
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8225
In r304435, ip_output() was changed to use the result of the route
lookup to decide whether the outgoing packet was a broadcast or
not. This introduced a regression on interfaces where
IFF_BROADCAST was not set (e.g. point-to-point links), as the
algorithm could incorrectly treat the destination address as a
broadcast address, and ip_output() would subsequently drop the
packet as broadcasting on a non-IFF_BROADCAST interface is not
allowed.
Differential Revision: https://reviews.freebsd.org/D8303
Reviewed by: jtl
Reported by: ambrisko
MFC after: 2 weeks
X-MFC-With: r304435
Sponsored by: Dell EMC Isilon
handling. Ensure that:
* Protocol unreachable errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Communication prohibited errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Hop Limited exceeded errors are handled by indicating EHOSTUNREACH
to the TCP user for both IPv4 and IPv6.
For IPv6 the TCP connected was dropped but errno wasn't set.
Reviewed by: gallatin, rrs
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: 7904
after having been dropped.
This fixes enforces in_pcbdrop() logic in tcp_input():
"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."
PR: 203175
Reported by: Palle Girgensohn, Slawa Olhovchenkov
Tested by: Slawa Olhovchenkov
Reviewed by: Slawa Olhovchenkov
Approved by: gnn, Slawa Olhovchenkov
Differential Revision: https://reviews.freebsd.org/D8211
MFC after: 1 week
Sponsored by: Verisign, inc
to at least 64.
This is still just a coverup to avoid kernel panic and not an actual fix.
PR: 213232
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8272
Also renamed some tfo labels and added/reworked comments for clarity.
Based on an initial patch from jtl.
PR: 213424
Reviewed by: jtl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D8235
compile when that option is configured. In tcp_destroy(), the error
variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block.
This broke the build for VNET images.
Enclose the error variable itself in an #ifdef block.
Submitted by: Shawn Webb <shawn.webb at hardenedbsd.org>
Reported by: Shawn Webb <shawn.webb at hardenedbsd.org>
PointyHat to: jtl
received on a TCP session that has entered the ESTABLISHED state. This
results in a lot of calls to reset the keepalive timer.
This patch changes the behavior so we set the keepalive timer for the
keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will
first check to see if the session has been idle for TP_KEEPIDLE ticks. If
not, it will reschedule the keepalive timer for the time the session will
have been idle for TP_KEEPIDLE ticks.
For a session with regular communication, the keepalive timer should fire
approximately once every TP_KEEPIDLE ticks. For sessions with irregular
communication, the keepalive timer might fire more often. But, the
disruption from a periodic keepalive timer should be less than the regular
cost of resetting the keepalive timer on every packet.
(FWIW, this change saved approximately 1.73% of the busy CPU cycles on a
particular test system with a heavy TCP output load. Of course, the
actual impact is very specific to the particular hardware and workload.)
Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8243
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.
While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.
Reviewed by: hiren
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8219
TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or
not the socket is a listening socket. However, this causes the code to
access a different cacheline. If we first check if the socket is in the
LISTEN state, we can avoid accessing so->so_options when processing packets
received for ESTABLISHED sessions.
If INVARIANTS is defined, the code still needs to access both variables to
check that so->so_options is consistent with the state.
Reviewed by: gallatin
MFC after: 1 week
Sponsored by: Netflix
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
This change extends the nd6 lock to protect the ND prefix list as well
as the list of advertising routers associated with each prefix. To handle
cases where the nd6 lock must be dropped while iterating over either the
prefix or default router lists, a generation counter is used to track
modifications to the lists. Additionally, a new mutex is used to serialize
prefix on-link/off-link transitions. This mutex must be acquired before
the nd6 lock and is held while updating the routing table in
nd6_prefix_onlink() and nd6_prefix_offlink().
Reviewed by: ae, tuexen (SCTP bits)
Tested by: Jason Wolfe <jason@llnw.com>,
Larry Rosenman <ler@lerctr.org>
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D8125
In the persist case, take the SYN and FIN flags into account when updating
the sequence space sent.
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Juniper Networks, Netflix
Differential Revision: https://reviews.freebsd.org/D7075
Tested by: Limelight, Netflix
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost. This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.
To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero. The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).
Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity. This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.
Submitted by: David A. Bright <david.a.bright@dell.com>
MFC after: 1 month
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D7695
is NULL and the function jumps to the "release:" label.
For this case, the "inp" was write locked, but the code attempted to
read unlock it. This patch fixes the problem.
This case could occur for NFS over UDP mounts, where the server was
down for a few minutes under certain circumstances.
Reported by: bde
Tested by: bde
Reviewed by: gnn
MFC after: 2 weeks
during testing of network related changes where cached entries may pollute your
results, or during known congestion events where you don't want to unfairly
penalize hosts.
Prior to r232346 this would have meant you would break any connection with a sub
1500 MTU, as the hostcache was authoritative. All entries as they stand today
should simply be used to pre populate values for efficiency.
Submitted by: Jason Wolfe (j at nitrology dot com)
Reviewed by: rwatson, sbruno, rrs , bz (earlier version)
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D6198
Restructure code slightly to save ip_tos bits earlier. Fix the bug
where the ip_tos field is zeroed out before assigning to the iptos
variable. Restore the ip_tos and ip_ver fields only if they have
been zeroed during the pseudo-header checksum calculation.
Reviewed by: cem, gnn, hiren
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8053
As a side effect of r261242 when using accept_filter the
first call to soisconnected() is done earlier in tcp_input()
instead of tcp_do_segment() context. Restore the expected behaviour.
Note: This call to soisconnected() seems to be extraneous in all
cases (with or without accept_filter). Will be addressed in a
separate commit.
PR: 212920
Reported by: Alexey
Tested by: Alexey, jch
Sponsored by: Verisign, Inc.
MFC after: 1 week
While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D8051
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.
PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724
There were two bugs:
* There was an accounting bug resulting in reporting a too small a_rwnd.
* There are a bug when abandoning messages in the reassembly queue.
MFC after: 4 weeks
* TCP_KEEPINIT
* TCP_KEEPINTVL
* TCP_KEEPIDLE
* TCP_KEEPCNT
always always report the values currently used when getsockopt()
is used. This wasn't the case when the sysctl-inherited default
values where used.
Ensure that the IPPROTO_TCP level socket option TCP_INFO has the
TCPI_OPT_ECN flag set in the tcpi_options field when ECN support
has been negotiated successfully.
Reviewed by: rrs, jtl, hiren
MFC after: 1 month
Differential Revision: 7833
warning:
sys/netinet/igmp.c:546:21: error: implicit conversion from 'int' to 'char' changes value from 148 to -108 [-Werror,-Wconstant-conversion]
p->ipopt_list[0] = IPOPT_RA; /* Router Alert Option */
~ ^~~~~~~~
sys/netinet/ip.h:153:19: note: expanded from macro 'IPOPT_RA'
#define IPOPT_RA 148 /* router alert */
^~~
This is because ipopt_list is an array of char, so IPOPT_RA is wrapped
to a negative value. It would be nice to change ipopt_list to an array
of u_char, but it changes the signature of the public struct ipoption,
so add an explicit cast to suppress the warning.
Reviewed by: imp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D7777
can be in after receiving a FIN.
FWIW, NetBSD has this change for quite some time.
This has been tested at Netflix and Limelight in production traffic.
Reported by: Sam Kumar <samkumar99 at gmail.com> on transport@
Reviewed by: rrs
MFC after: 4 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D7475
I-FORWARD-TSN chunk before any DATA or I-DATA chunk.
Thanks to Julian Cordes for finding this problem and prividing
packetdrill scripts to reporduce the issue.
MFC after: 3 days
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.
Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D7564
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet. Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".
Discussions on how to properly fix r304436 are ongoing, but in the meantime
disable the optimization to ensure that no existing network setups are broken.
Reported by: bms
First, keep a ref count on the stcb after looking it up, as
done in the other lookup cases.
Second, before looking again at sp, ensure that it is not
freed, because the assoc is about to be freed.
MFC after: 3 days
The default value of the tunable introduced in r304436 couldn't be
effectively overrided on VIMAGE kernels, because instead of being
accessed via the appropriate VNET() accessor macro, it was accessed
via the VNET_NAME() macro, which resolves to the (should-be) read-only
master template of initial values of per-VNET data. Hence, while the
value of udp_require_l2_bcast could be altered on per-VNET basis, the
code in udp_input() would ignore it as it would always read the default
value (one) from the VNET master template.
Silence from: rstone
in_broadcast() was iterating over the ifnet address list without
first taking an IF_ADDR_RLOCK. This could cause a panic if a
concurrent operation modified the list.
Reviewed by: bz
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7227
in_broadcast() can be quite expensive, so skip calling it if the
incoming mbuf wasn't sent to a broadcast L2 address in the first
place.
Reviewed by: gnn
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7309
For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address. in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.
This is completely redundant as we have already performed the
route lookup, so the source IP is already known. Just use that
address to directly check whether the destination IP is a
broadcast address or not.
MFC after: 2 months
Sponsored By: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266