sbavail() returns u_int and sendwin is a uint32_t. Therefore, min() (which
operates on two u_int values) is able to correctly calculate the minimum
of these two arguments.
Reported by: rrs
MFC after: 1 week
Sponsored by: Netflix
This allows them to be sent in a non truncated way and addresses a warning
given by newver versions of gcc.
Thanks to Anselm Jonas Scholl for reporting it and providing a patch.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
stack modules.
It adds support for mangling symbols exported by a module by prepending
a string to them. (This avoids overlapping symbols in the kernel linker.)
It allows the use of a macro as the module name in the DECLARE_MACRO()
and MACRO_VERSION() macros.
It allows the code to register stack aliases (e.g. both a generic name
["default"] and version-specific name ["default_10_3p1"]).
With these changes, it is trivial to compile TCP stack modules with
the name defined in the Makefile and to load multiple versions of the
same stack simultaneously. This functionality can be used to enable
side-by-side testing of an old and new version of the same TCP stack.
It also could support upgrading the TCP stack without a reboot.
Reviewed by: gnn, sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D11086
The ICMP6 packets might not be contained in a single mbuf. So don't
assume this. Keep the IPv4 and IPv6 code in sync and make explicit
that the syncache code only need the TCP sequence number, not the
complete TCP header.
MFC after: 3 days
Sponsored by: Netflix, Inc.
response.
Delete an unneeded rate limit for UDP under IPv6. Because ICMP6
messages have their own rate limit, it is unnecessary to apply a
second rate limit to UDP messages.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D10387
considering cache line hits and misses. Put the lock and hash list
glue into the first cache line, put inp_refcount inp_flags inp_socket
into the second cache line.
o On allocation zero out entire structure except the lock and list entries,
including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were
introduced, they were added below inp_zero_size, resulting on not being
cleared after free/alloc. This definitely was a source of bugs with route
caching. Could be that r315956 has just fixed one of them.
The inp_gencnt is reinitialized on every alloc, so it is safe to clear it.
This has been proved to improve TCP performance at Netflix.
Obtained from: rrs
Differential Revision: D10686
if it is called on a TCP socket
* with an IPv6 address and the socket is bound to an
IPv4-mapped IPv6 address.
* with an IPv4-mapped IPv6 address and the socket is bound to an
IPv6 address.
Thanks to Jonathan T. Leighton for reporting this issue.
Reviewed by: bz gnn
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D9163
r290383 has changed how mbufs sent by divert socket are handled.
Previously they are always handled by slow path processing in ip_input().
Now ip_tryforward() is invoked from ip_input() before in_broadcast() check.
Since diverted packet lost all mbuf flags, it passes the broadcast check
in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast
packet is forwarded to the wire instead of be consumed by network stack.
Add in_broadcast() check to the div_output() function. And restore the
M_BCAST flag if destination address is broadcast for the given network
interface.
PR: 209491
MFC after: 1 week
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.
ANSIfy related prototypes while here.
Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193
function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away.
The zone_fini parameter also goes away. Previously no protocols (except
divert) supplied zone_fini function, so inpcb locks were leaked with slabs.
This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now
this is a leak. Fix that by suppling inpcb_fini() function as fini method
for all inpcb zones.
compiled into the kernel
This ensures that .iss_asm (the number of ASM listeners) isn't incorrectly
decremented for MLD-layer source datagrams when inspecting im*s_st[1]
(the second state in the structure).
MFC after: 2 months
PR: 217509 [1]
Reported by: Coverity (Isilon)
Reviewed by: ae ("This patch looks correct to me." [1])
Submitted by: Miles Ohlrich <miles.ohlrich@isilon.com>
Sponsored by: Dell EMC Isilon
It has strong locking model, doesn't have any timers associated with
entries. The entries theirselves are referenced only from the tcpcb zone,
which itself is a normal zone, without the UMA_ZONE_NOFREE flag.
that chooses right alias_address for outgoing packets that already have
corresponding state in one of aliasing instances. This feature works just fine
for ICMP, UDP, TCP and SCTP packes but not for others. For example,
outgoing PPtP/GRE packets always get alias_address of latest configured
instance no matter whether such packets have corresponding state or not.
This change unbreaks translation of transit PPtP/GRE connections
for "nat global" case fixing a bug in static ProtoAliasOut() function
that ignores its "create" argument and performs translation
regardless of its value. This static function is called only
by LibAliasOutLocked() function and only for packers other than
ICMP, UDP, TCP and SCTP. LibAliasOutLocked() passes its "create"
argument unmodified.
We have only two consumers of LibAliasOutLocked() in the source tree
calling it with "create" unequal to 1: "ipfw nat global" code and similar
natd code having same problem. All other consumers of LibAliasOutLocked()
call it with create = 1 and the patch is "no-op" for such cases.
PR: 218968
Approved by: ae, vsevolod (mentor)
MFC after: 1 week
This patch allows the MTU stored in the hostcache to be used as an
initial value for SCTP paths. When an ICMP PTB message is received,
store the MTU in the hostcache.
MFC after: 1 week
validation of SEG.ACK as the first step. If the ACK is not acceptable,
a RST segment should be sent and the segment should be dropped.
Up to now, the segment was partially processed.
This patch moves the check for the SEG.ACK validation up to the front
as required.
Reviewed by: hiren, gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D10424
wait for the next enqueue from the driver.
Reviewed by: gnn@, hselasky@, gallatin@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D10432
patm(4) devices.
Maintaining an address family and framework has real costs when we make
infrastructure improvements. In the case of NATM we support no devices
manufactured in the last 20 years and some will not even work in modern
motherboards (some newer devices that patm(4) could be updated to
support apparently exist, but we do not currently have support).
With this change, support remains for some netgraph modules that don't
require NATM support code. It is unclear if all these should remain,
though ng_atmllc certainly stands alone.
Note well: FreeBSD 11 supports NATM and will continue to do so until at
least September 30, 2021. Improvements to the code in FreeBSD 11 are
certainly welcome.
Reviewed by: philip
Approved by: harti
-(SYNCOOKIE_LIFETIME + 1) instead of INT64_MIN, since it is
good enough and works when time_t is int32 or int64.
This fixes the issue reported by cy@ on i386.
Reported by: cy
MFC after: 1 week
Sponsored by: Netflix, Inc.
overflows, syncookies are used.
This patch restricts the usage of syncookies in this case: accept
syncookies only if there was an overflow of the syncache recently.
This mitigates a problem reported in PR217637, where is syncookie was
accepted without any recent drops.
Thanks to glebius@ for suggesting an improvement.
PR: 217637
Reviewed by: gnn, glebius
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D10272
do for streaming sockets.
And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.
Suggested by: glebius
Tested by: Irina Liakh <spell at itl ua>
MFC after: 1 week
When checksums of received IP and UDP header already checked, UDP uses
sbappendaddr_locked() to pass received data to the socket.
sbappendaddr_locked() uses given mbuf as is, and if NIC supports checksum
offloading, mbuf contains csum_data and csum_flags that were calculated
for already stripped headers. Some NICs support only limited checksums
offloading and do not use CSUM_PSEUDO_HDR flag, and csum_data contains
some value that UDP/TCP should use for pseudo header checksum calculation.
When L2TP is used for tunneling with mpd5, ng_ksocket receives mbuf with
filled csum_flags and csum_data, that were calculated for outer headers.
When L2TP header is stripped, a packet that was tunneled goes to the IP
layer and due to presence of csum_flags (without CSUM_PSEUDO_HDR) and
csum_data, the UDP/TCP checksum check fails for this packet.
Reported by: Irina Liakh <spell at itl ua>
Tested by: Irina Liakh <spell at itl ua>
MFC after: 1 week
for example not in SYN-SENT.
This patch adds code to check the sysctl variable in other states than
LISTEN.
Thanks to ae and gnn for providing comments.
Reviewed by: gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D9894
Switched from using timestamps to RTT estimates when performing TCP receive
buffer auto resizing, as not all hosts support / enable TCP timestamps.
Disabled reset of receive buffer auto scaling when not in bulk receive mode,
which gives an extra 20% performance increase.
Also extracted auto resizing to a common method shared between standard and
fastpath modules.
With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from
~3MB/s to ~100MB/s using the default settings.
Reviewed by: lstewart, gnn
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D9668
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet. Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast". The
optimization has been disabled for several months now with no progress
towards fixing it, so it needs to go.
This opcode can be used to attach some data to external action opcode.
And unlike to O_EXTERNAL_INSTANCE opcode, this opcode does not require
creating of named instance to pass configuration arguments to external
action handler. The data is coming just next to O_EXTERNAL_ACTION opcode.
The userlevel part currenly supports formatting for opcode with ipfw_insn
size, by default it expects u16 numeric value in the arg1.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
If a jail has an explicitly assigned loopback address then allow it to be
used instead of remapping requests for the loopback adddress to the first
IPv4 address assigned to the jail.
This fixes issues where applications attempt to detect their bound port
where they requested a loopback address, which was available, but instead
the kernel remapped it to the jails first address.
A example of this is binding nginx to 127.0.0.1 and then running "service
nginx upgrade" which before this change would cause nginx to fail.
Also:
* Correct the description of prison_check_ip4_locked to match the code.
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Multiplay
tcp_output.c was using a route on the stack for IPv6, which does not
allow route caching or LLE/ndp caching. Switch to using the route
(v6 flavor) in the in_pcb, which was already present, which caches
both L3 and L2 lookups.
Reviewed by: gnn hiren
MFC after: 2 weeks
ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache
entry because they used their own routes on the stack, not in_pcb routes.
The original model for route caching was callers that provided a route
structure to ip{,6}input() would keep the route, and this model was used
for L2 caching as well. Instead, change L2 caching to be done by default
only when using a route structure in the in_pcb; the pcb deallocation
code frees L2 as well as L3 cacches. A separate change will add route
caching to TCP/IPv6.
Another suggestion was to have the transport protocols indicate willingness
to use L2 caching, but this approach keeps the changes in the network
level
Reviewed by: ae gnn
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10059
and those below, will be ignored--
> Description of fields to fill in above: 76 columns --|
> PR: If and which Problem Report is related.
> Submitted by: If someone else sent in the change.
> Reported by: If someone else reported the issue.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> MFH: Ports tree branch name. Request approval for merge.
> Relnotes: Set to 'yes' for mention in release notes.
> Security: Vulnerability reference (one per line) or description.
> Sponsored by: If the change was sponsored by an organization.
> Differential Revision: https://reviews.freebsd.org/D### (*full* phabric URL needed).
> Empty fields above will be automatically removed.
M netinet/in_pcb.c
M netinet/ip_output.c
M netinet6/ip6_output.c
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
inet_ntoa() and inet_ntoa_r() take the address in network
byte-order. When I removed those calls, I should have
replaced them with ntohl() to make the hex addresses slightly
less unreadable. Here they are.
See r315277 regarding classic blunders.
vangyzen: you're deep in "no good deed" territory, it seems
--badger
Reported by: ian
MFC after: 3 days
MFC when: I finally get it right
Sponsored by: Dell EMC
When I made the changes in r313821, I fell victim to one of the
classic blunders, the most famous of which is: never get involved
in a land war in Asia. But only slightly less well known is this:
Keep your brain turned on and engaged when making a tedious, sweeping,
mechanical change. KTR can correctly log the immediate integral values
passed to it, as well as constant strings, but not non-constant strings,
since they might change by the time ktrdump retrieves them.
Reported by: glebius
MFC after: 3 days
Sponsored by: Dell EMC
This was introduced on accident in r165243, when return sites were unified
to add a lock around LibAliasProxyRule().
PR: 217749
Submitted by: Svyatoslav <razmyslov at viva64.com>
Sponsored by: Viva64 (PVS-Studio)
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.
The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.
This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.
Reviewed by: adrian, aw
Approved by: ae (mentor)
Sponsored by: rsync.net
Differential Revision: D9235
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
least 2 * MSS. However, if the receive buffer size is small, this might
be impossible. Add back a criterion to send a TCP window update if
the window can be increased by at least half of the receive buffer size.
This condition was removed in r242252. This patch simply brings it back.
PR: 211003
Reviewed by: gnn
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D9475
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Remove it from the kernel.
Suggested by: glebius, emaste
Reviewed by: gnn
MFC after: never
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9625
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.
Suggested by: glebius, emaste
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9625
The inpcb structure has inp_sp pointer that is initialized by
ipsec_init_pcbpolicy() function. This pointer keeps strorage for IPsec
security policies associated with a specific socket.
An application can use IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket
options to configure these security policies. Then ip[6]_output()
uses inpcb pointer to specify that an outgoing packet is associated
with some socket. And IPSEC_OUTPUT() method can use a security policy
stored in the inp_sp. For inbound packet the protocol-specific input
routine uses IPSEC_CHECK_POLICY() method to check that a packet conforms
to inbound security policy configured in the inpcb.
SCTP protocol doesn't specify inpcb for ip[6]_output() when it sends
packets. Thus IPSEC_OUTPUT() method does not consider such packets as
associated with some socket and can not apply security policies
from inpcb, even if they are configured. Since IPSEC_CHECK_POLICY()
method is called from protocol-specific input routine, it can specify
inpcb pointer and associated with socket inbound policy will be
checked. But there are two problems:
1. Such check is asymmetric, becasue we can not apply security policy
from inpcb for outgoing packet.
2. IPSEC_CHECK_POLICY() expects that caller holds INPCB lock and
access to inp_sp is protected. But for SCTP this is not correct,
becasue SCTP uses own locks to protect inpcb.
To fix these problems remove IPsec related PCB code from SCTP.
This imply that IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket options
will be not applicable to SCTP sockets. To be able correctly check
inbound security policies for SCTP, mark its protocol header with
the PR_LASTHDR flag.
Reported by: tuexen
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D9538
If the TCP stack has retransmitted more than 1/4 of the total
number of retransmits before a connection drop, it decides that
its current RTT estimate is hopelessly out of date and decides
to recalculate it from scratch starting with the next ACK.
Unfortunately, it implements this by zeroing out the current RTT
estimate. Drop this hack entirely, as it makes it significantly more
difficult to debug connection issues. Instead check for excessive
retransmits at the point where srtt is updated from an ACK being
received. If we've exceeded 1/4 of the maximum retransmits,
discard the previous srtt estimate and replace it with the latest
rtt measurement.
Differential Revision: https://reviews.freebsd.org/D9519
Reviewed by: gnn
Sponsored by: Dell EMC Isilon
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.
The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.
This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.
Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
If multiple threads emit a UDP log_in_vain message concurrently,
the IP addresses could be garbage due to concurrent usage of a
single string buffer inside inet_ntoa(). Use inet_ntoa_r() with
two stack buffers instead.
Reported by: Mark Martinec <Mark.Martinec+freebsd@ijs.si>
MFC after: 3 days
Relnotes: yes
Sponsored by: Dell EMC
Small summary
-------------
o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.
Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352
not being initialized, and the per-vnet fastopen context was only
being initialized for the default vnet.
PR: 216613
Reported by: Alex Deiter <alex dot deiter at gmail dot com>
MFC after: 1 week
space available for chunks. This unbreaks the handling of
ICMPV6 packets indicating "packet too big". It just worked
for IPv4 since we are overbooking for IPv4.
MFC after: 1 week
regardless of what the default stack for the system is set to.
With current/default behavior, after changing the default tcp stack, the
application needs to be restarted to pick up that change. Setting this new knob
net.inet.tcp.functions_inherit_listen_socket_stack to '0' would change that
behavior and make any new connection use the newly selected default tcp stack.
Reviewed by: rrs
MFC after: 2 weeks
Sponsored by: Limelight Networks
(intentionally) deleted first and then completely added again (so all the
events, announces and hooks are given a chance to run).
This cause an issue with CARP where the existing CARP data structure is
removed together with the last address for a given VHID, which will cause
a subsequent fail when the address is later re-added.
This change fixes this issue by adding a new flag to keep the CARP data
structure when an address is not being removed.
There was an additional issue with IPv6 CARP addresses, where the CARP data
structure would never be removed after a change and lead to VHIDs which
cannot be destroyed.
Reviewed by: glebius
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:
o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).
In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.
Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.
This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.
There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.
Reviewed by: adrian, gnn
Differential Revision: https://reviews.freebsd.org/D9171
header match when using a raw socket to send IPv4 packets and
providing the header. If they don't match, let send return -1
and set errno to EINVAL.
Before this patch is was only enforced that the length in the header
is not larger then the buffer length.
PR: 212283
Reviewed by: ae, gnn
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D9161
and tw_so_options defined here which is supposed to be a copy of the
former (short vs u_short respectively).
Switch tw_so_options to be "signed short" to match the type of the field
it's inherited from.
dangerous. Those wanting data from an mbuf should use DTrace itself to get
the data.
PR: 203409
Reviewed by: hiren
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D9035
structs under the INET6 #ifdef. Similarly (even though it doesn't seem
to affect the build), conditionalize all IPv4 structs under the INET
#ifdef
This also unbreaks the LINT-NOINET6 tinderbox target on amd64; I have not
verified other MACHINE/TARGET pairs (e.g. armv6/arm).
MFC after: 2 weeks
X-MFC with: r310847
Pointyhat to: jpaetzel
Reported by: O. Hartmann <o.hartmann@walstatt.org>
If there is a loop in the network a CARP that is in MASTER state will see it's
own broadcasts, which will then cause it to assume BACKUP state. When it
assumes BACKUP it will stop sending advertisements. In that state it will no
longer see advertisements and will assume MASTER...
We can't catch all the cases where we are seeing our own CARP broadcast, but
we can catch the obvious case.
Submitted by: torek
Obtained from: FreeNAS
MFC after: 2 weeks
Sponsored by: iXsystems
In case of the empty queue tp->snd_holes and tcp_sackhole_insert()
failing due to memory shortage, tp->snd_holes will be empty.
This problem was hit when stress tests where performed by pho.
PR: 215513
Reported by: pho
Tested by: pho
Sponsored by: Netflix, Inc.
This can lead to change of mbuf pointer (packet filter could do m_pullup(),
NAT, etc). Also in case of change of destination address, tryforward can
decide that packet should be handled by local system. In this case modified
mbuf can be returned to the ip[6]_input(). To handle this correctly, check
M_FASTFWD_OURS flag after return from ip[6]_tryforward. And if it is present,
update variables that depend from mbuf pointer and skip another inbound
firewall processing.
No objection from: #network
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D8764
in6p_options to check that. That is incorrect as we carry ip options in
in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't
suffice as we combine ip options and ip header fields both in that one field.
The commit fixes this by using ip6_optlen() which correctly calculates length
of only ip options for IPv6.
Reviewed by: ae, bz
MFC after: 3 weeks
Sponsored by: Limelight Networks
This fixes a bug where the wrong ppid was reported, if
* I-DATA was used on the first fragement was not received first
* DATA was used and different ppids where used.
Thanks to Julian Cordes for making me aware of the issue.
MFC after: 1 week
This made a couple of bugs visible in handling SSN wrap-arounds
when using DATA chunks. Now bulk transfer seems to work fine...
This fixes the issue reported in
https://github.com/sctplab/usrsctp/issues/111
MFC after: 1 week
The tools using to generate the sources has been updated and produces
different whitespaces. Commit this seperately to avoid intermixing
these with real code changes.
MFC after: 3 days
When a TCP segment with the FIN bit set was received in the CLOSED state,
a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the
FIN counts as one byte. This accounting was missing and is fixed by this
patch.
Reviewed by: hiren
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://svn.freebsd.org/base/head
many borken middle-boxes tend to do that. But during 3whs, in syncache_expand(),
we don't do that which causes us to send a RST to such a client. Relax this
constraint by only using tsecr to compare against timestamp that we sent when it
is not 0. As a result, we'd now accept the final ACK of 3whs with tsecr of 0.
Reviewed by: jtl, gnn
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8552
This does not cover state changes from TIME-WAIT.
Reviewed by: gnn
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8443
either in the CLOSING or LAST-ACK state.
Reviewed by: hiren
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8371
introduced with r261242. The useful and expected soisconnected()
call is done in tcp_do_segment().
Has been found as part of unrelated PR:212920 investigation.
Improve slightly (~2%) the maximum number of TCP accept per second.
Tested by: kevin.bowling_kev009.com, jch
Approved by: gnn, hiren
MFC after: 1 week
Sponsored by: Verisign, Inc
Differential Revision: https://reviews.freebsd.org/D8072
loss event but not use or obay the recommendations i.e. values set by it in some
cases.
Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.
tcp_stacks/fastpath module has been updated to adapt these changes.
Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.
In collaboration with: Matt Macy <mmacy at nextbsd dot org>
Discussed on: transport@ mailing list
Reviewed by: jtl
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8225
In r304435, ip_output() was changed to use the result of the route
lookup to decide whether the outgoing packet was a broadcast or
not. This introduced a regression on interfaces where
IFF_BROADCAST was not set (e.g. point-to-point links), as the
algorithm could incorrectly treat the destination address as a
broadcast address, and ip_output() would subsequently drop the
packet as broadcasting on a non-IFF_BROADCAST interface is not
allowed.
Differential Revision: https://reviews.freebsd.org/D8303
Reviewed by: jtl
Reported by: ambrisko
MFC after: 2 weeks
X-MFC-With: r304435
Sponsored by: Dell EMC Isilon
handling. Ensure that:
* Protocol unreachable errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Communication prohibited errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Hop Limited exceeded errors are handled by indicating EHOSTUNREACH
to the TCP user for both IPv4 and IPv6.
For IPv6 the TCP connected was dropped but errno wasn't set.
Reviewed by: gallatin, rrs
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: 7904
after having been dropped.
This fixes enforces in_pcbdrop() logic in tcp_input():
"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."
PR: 203175
Reported by: Palle Girgensohn, Slawa Olhovchenkov
Tested by: Slawa Olhovchenkov
Reviewed by: Slawa Olhovchenkov
Approved by: gnn, Slawa Olhovchenkov
Differential Revision: https://reviews.freebsd.org/D8211
MFC after: 1 week
Sponsored by: Verisign, inc
to at least 64.
This is still just a coverup to avoid kernel panic and not an actual fix.
PR: 213232
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8272
Also renamed some tfo labels and added/reworked comments for clarity.
Based on an initial patch from jtl.
PR: 213424
Reviewed by: jtl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D8235
compile when that option is configured. In tcp_destroy(), the error
variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block.
This broke the build for VNET images.
Enclose the error variable itself in an #ifdef block.
Submitted by: Shawn Webb <shawn.webb at hardenedbsd.org>
Reported by: Shawn Webb <shawn.webb at hardenedbsd.org>
PointyHat to: jtl
received on a TCP session that has entered the ESTABLISHED state. This
results in a lot of calls to reset the keepalive timer.
This patch changes the behavior so we set the keepalive timer for the
keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will
first check to see if the session has been idle for TP_KEEPIDLE ticks. If
not, it will reschedule the keepalive timer for the time the session will
have been idle for TP_KEEPIDLE ticks.
For a session with regular communication, the keepalive timer should fire
approximately once every TP_KEEPIDLE ticks. For sessions with irregular
communication, the keepalive timer might fire more often. But, the
disruption from a periodic keepalive timer should be less than the regular
cost of resetting the keepalive timer on every packet.
(FWIW, this change saved approximately 1.73% of the busy CPU cycles on a
particular test system with a heavy TCP output load. Of course, the
actual impact is very specific to the particular hardware and workload.)
Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8243
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.
While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.
Reviewed by: hiren
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8219
TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or
not the socket is a listening socket. However, this causes the code to
access a different cacheline. If we first check if the socket is in the
LISTEN state, we can avoid accessing so->so_options when processing packets
received for ESTABLISHED sessions.
If INVARIANTS is defined, the code still needs to access both variables to
check that so->so_options is consistent with the state.
Reviewed by: gallatin
MFC after: 1 week
Sponsored by: Netflix
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
This change extends the nd6 lock to protect the ND prefix list as well
as the list of advertising routers associated with each prefix. To handle
cases where the nd6 lock must be dropped while iterating over either the
prefix or default router lists, a generation counter is used to track
modifications to the lists. Additionally, a new mutex is used to serialize
prefix on-link/off-link transitions. This mutex must be acquired before
the nd6 lock and is held while updating the routing table in
nd6_prefix_onlink() and nd6_prefix_offlink().
Reviewed by: ae, tuexen (SCTP bits)
Tested by: Jason Wolfe <jason@llnw.com>,
Larry Rosenman <ler@lerctr.org>
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D8125
In the persist case, take the SYN and FIN flags into account when updating
the sequence space sent.
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Juniper Networks, Netflix
Differential Revision: https://reviews.freebsd.org/D7075
Tested by: Limelight, Netflix
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost. This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.
To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero. The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).
Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity. This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.
Submitted by: David A. Bright <david.a.bright@dell.com>
MFC after: 1 month
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D7695
is NULL and the function jumps to the "release:" label.
For this case, the "inp" was write locked, but the code attempted to
read unlock it. This patch fixes the problem.
This case could occur for NFS over UDP mounts, where the server was
down for a few minutes under certain circumstances.
Reported by: bde
Tested by: bde
Reviewed by: gnn
MFC after: 2 weeks
during testing of network related changes where cached entries may pollute your
results, or during known congestion events where you don't want to unfairly
penalize hosts.
Prior to r232346 this would have meant you would break any connection with a sub
1500 MTU, as the hostcache was authoritative. All entries as they stand today
should simply be used to pre populate values for efficiency.
Submitted by: Jason Wolfe (j at nitrology dot com)
Reviewed by: rwatson, sbruno, rrs , bz (earlier version)
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D6198
Restructure code slightly to save ip_tos bits earlier. Fix the bug
where the ip_tos field is zeroed out before assigning to the iptos
variable. Restore the ip_tos and ip_ver fields only if they have
been zeroed during the pseudo-header checksum calculation.
Reviewed by: cem, gnn, hiren
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8053
As a side effect of r261242 when using accept_filter the
first call to soisconnected() is done earlier in tcp_input()
instead of tcp_do_segment() context. Restore the expected behaviour.
Note: This call to soisconnected() seems to be extraneous in all
cases (with or without accept_filter). Will be addressed in a
separate commit.
PR: 212920
Reported by: Alexey
Tested by: Alexey, jch
Sponsored by: Verisign, Inc.
MFC after: 1 week
While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D8051
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.
PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724
There were two bugs:
* There was an accounting bug resulting in reporting a too small a_rwnd.
* There are a bug when abandoning messages in the reassembly queue.
MFC after: 4 weeks
* TCP_KEEPINIT
* TCP_KEEPINTVL
* TCP_KEEPIDLE
* TCP_KEEPCNT
always always report the values currently used when getsockopt()
is used. This wasn't the case when the sysctl-inherited default
values where used.
Ensure that the IPPROTO_TCP level socket option TCP_INFO has the
TCPI_OPT_ECN flag set in the tcpi_options field when ECN support
has been negotiated successfully.
Reviewed by: rrs, jtl, hiren
MFC after: 1 month
Differential Revision: 7833
warning:
sys/netinet/igmp.c:546:21: error: implicit conversion from 'int' to 'char' changes value from 148 to -108 [-Werror,-Wconstant-conversion]
p->ipopt_list[0] = IPOPT_RA; /* Router Alert Option */
~ ^~~~~~~~
sys/netinet/ip.h:153:19: note: expanded from macro 'IPOPT_RA'
#define IPOPT_RA 148 /* router alert */
^~~
This is because ipopt_list is an array of char, so IPOPT_RA is wrapped
to a negative value. It would be nice to change ipopt_list to an array
of u_char, but it changes the signature of the public struct ipoption,
so add an explicit cast to suppress the warning.
Reviewed by: imp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D7777
can be in after receiving a FIN.
FWIW, NetBSD has this change for quite some time.
This has been tested at Netflix and Limelight in production traffic.
Reported by: Sam Kumar <samkumar99 at gmail.com> on transport@
Reviewed by: rrs
MFC after: 4 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D7475
I-FORWARD-TSN chunk before any DATA or I-DATA chunk.
Thanks to Julian Cordes for finding this problem and prividing
packetdrill scripts to reporduce the issue.
MFC after: 3 days
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.
Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D7564
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet. Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".
Discussions on how to properly fix r304436 are ongoing, but in the meantime
disable the optimization to ensure that no existing network setups are broken.
Reported by: bms
First, keep a ref count on the stcb after looking it up, as
done in the other lookup cases.
Second, before looking again at sp, ensure that it is not
freed, because the assoc is about to be freed.
MFC after: 3 days
The default value of the tunable introduced in r304436 couldn't be
effectively overrided on VIMAGE kernels, because instead of being
accessed via the appropriate VNET() accessor macro, it was accessed
via the VNET_NAME() macro, which resolves to the (should-be) read-only
master template of initial values of per-VNET data. Hence, while the
value of udp_require_l2_bcast could be altered on per-VNET basis, the
code in udp_input() would ignore it as it would always read the default
value (one) from the VNET master template.
Silence from: rstone
in_broadcast() was iterating over the ifnet address list without
first taking an IF_ADDR_RLOCK. This could cause a panic if a
concurrent operation modified the list.
Reviewed by: bz
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7227
in_broadcast() can be quite expensive, so skip calling it if the
incoming mbuf wasn't sent to a broadcast L2 address in the first
place.
Reviewed by: gnn
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7309
For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address. in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.
This is completely redundant as we have already performed the
route lookup, so the source IP is already known. Just use that
address to directly check whether the destination IP is a
broadcast address or not.
MFC after: 2 months
Sponsored By: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.
Sponsored by: Netflix Inc.
Differential Revision: D6790
The module works together with ipfw(4) and implemented as its external
action module.
Stateless NAT64 registers external action with name nat64stl. This
keyword should be used to create NAT64 instance and to address this
instance in rules. Stateless NAT64 uses two lookup tables with mapped
IPv4->IPv6 and IPv6->IPv4 addresses to perform translation.
A configuration of instance should looks like this:
1. Create lookup tables:
# ipfw table T46 create type addr valtype ipv6
# ipfw table T64 create type addr valtype ipv4
2. Fill T46 and T64 tables.
3. Add rule to allow neighbor solicitation and advertisement:
# ipfw add allow icmp6 from any to any icmp6types 135,136
4. Create NAT64 instance:
# ipfw nat64stl NAT create table4 T46 table6 T64
5. Add rules that matches the traffic:
# ipfw add nat64stl NAT ip from any to table(T46)
# ipfw add nat64stl NAT ip from table(T64) to 64:ff9b::/96
6. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
via NAT64 host.
Stateful NAT64 registers external action with name nat64lsn. The only
one option required to create nat64lsn instance - prefix4. It defines
the pool of IPv4 addresses used for translation.
A configuration of instance should looks like this:
1. Add rule to allow neighbor solicitation and advertisement:
# ipfw add allow icmp6 from any to any icmp6types 135,136
2. Create NAT64 instance:
# ipfw nat64lsn NAT create prefix4 A.B.C.D/28
3. Add rules that matches the traffic:
# ipfw add nat64lsn NAT ip from any to A.B.C.D/28
# ipfw add nat64lsn NAT ip6 from any to 64:ff9b::/96
4. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
via NAT64 host.
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D6434
The current in_pcb.h includes route.h, which includes sockaddr structures.
Including <sys/socketvar.h> should require <sys/socket.h>; add it in
the appropriate place.
PR: 211385
Submitted by: Sergey Kandaurov and iron at mail.ua
Reviewed by: gnn
Approved by: gnn (mentor)
MFC after: 1 day
Now zero value of arg1 used to specify "tablearg", use the old "tablearg"
value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace
hardcoded magic number to specify "nat global". Also replace 65535 magic
number with corresponding macro. Fix typo in comments.
PR: 211256
Tested by: Victor Chernov
MFC after: 3 days
unordered user messages using DATA chunks as invalid ones.
While there, ensure that error causes are provided when
sending ABORT chunks in case of reassembly problems detected.
Thanks to Taylor Brandstetter for making me aware of this problem.
MFC after: 3 days
_prison_check_ip4 renamed to prison_check_ip4_locked
Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked
Add appropriate prototypes to sys/sys/jail.h
Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.
Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.
Reviewed by: jtl
Approved by: sjg (mentor)
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D6799
last SID/SSN pair wasn't filled in.
Thanks to Julian Cordes for providing a packetdrill script
triggering the issue and making me aware of the bug.
MFC after: 3 days
This keeps the segments/ACK/FIN delivery order.
Before this patch, it was observed: if A sent FIN immediately after
an ACK, B would deliver FIN first to the TCP stack, then the ACK.
This out-of-order delivery causes one unnecessary ACK sent from B.
Reviewed by: gallatin, hps
Obtained from: rrs, gallatin
Sponsored by: Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision: https://reviews.freebsd.org/D7415
- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()
- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
to send TCP packets without looking at the tcp host cache for every
single transmit.
- Make the icmp6 code mimic the IPv4 code & avoid returning
PRC_HOSTDEAD because it is so expensive.
Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held. Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.
Reviewed by: bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D7272
calling in_pcbnotifyall().
This avoids lock contention on tcbinfo due to in_pcbnotifyall()
holding the tcbinfo write lock while walking all connections.
Reviewed by: rrs, karels
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D7251
code in this file was written by Robert N. M. Waston.
Move cr_can* prototypes from sys/systm.h to sys/proc.h
Reported by: rwatson
Reviewed by: rwatson
Approved by: sjg (mentor)
Differential Revision: https://reviews.freebsd.org/D7345
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
INET and INET6-specific code from the rest of the prot code (It is only
used by the network stack, so it makes sense for it to live with the
other network stack code.)
- Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
- Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
cr_canseeothergids, make them non-static, and add prototypes (so they
can be seen/called by in_prot.c functions.)
- Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
unused variable.
Reviewed by: gnn, jtl
Approved by: sjg (mentor)
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D2901
r301217 re-added per-connection L2 caching from a previous change,
but it omitted caching in the fast path. Add it.
Reviewed By: gallatin
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D7239
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.
Reviewed by: julian
Obtained from: Yandex LLC
MFC after: 1 month
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D6674
as defined in RFC 6296. The module works together with ipfw(4) and
implemented as its external action module. When it is loaded, it registers
as eaction and can be used in rules. The usage pattern is similar to
ipfw_nat(4). All matched by rule traffic goes to the NPT module.
Reviewed by: hrs
Obtained from: Yandex LLC
MFC after: 1 month
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D6420
debugging TCP connections. This commit provides a mechanism to free those
mbufs when the system is under memory pressure.
Because this will result in lost debugging information, the behavior is
controllable by a sysctl. The default setting is to free the mbufs.
Reviewed by: gnn
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D6931
Input from: novice_techie.com
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
for SCTP DATA and I-DATA chunks.
* For fragmented user messages, set the I-Bit only on the last
fragment.
* When using explicit EOR mode, set the I-Bit on the last
fragment, whenever SCTP_SACK_IMMEDIATELY was set in snd_flags
for any of the send() calls.
Approved by: re (hrs)
MFC after: 1 week
for messages which have been put on the send queue:
* Do not report any DATA or I-DATA chunk padding.
* Correctly deal with the I-DATA chunk header instead of the DATA
chunk header when the I-DATA extension is used.
Approved by: re (kib)
MFC after: 1 week
use. Update comments regarding the spare fields in struct inpcb.
Bump __FreeBSD_version for the changes to the size of the structures.
Reviewed by: gnn@
Approved by: re@ (gjb@)
Sponsored by: Chelsio Communications
this code in the userland stack, it could result in a loop. This happened on iOS.
However, I was not able to reproduce this when using the code in the kernel.
Thanks to Eugen-Andrei Gavriloaie for reporting the issue and proving detailed
information to find the root of the problem.
Approved by: re (gjb)
MFC after: 1 week
That way timers can finish cleanly and we do not gamble with a DELAY().
Reviewed by: gnn, jtl
Approved by: re (gjb)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6923
Timewait code does a proper cleanup after itself.
Reviewed by: gnn
Approved by: re (gjb)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6922
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.
Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.
Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.
For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.
Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.
For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).
Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.
Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747
generally cleanup code might still acquire it thus try to be
consistent destroying locks late.
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.
Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.
The change should be transparent for non-VIMAGE kernels.
Reviewed by: gnn (, hiren)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6691
Replacing the bubble sort with insertion sort gives an 80% reduction
in runtime on average, with randomized keys, for small partitions.
If the keys are pre-sorted, insertion sort runs in linear time, and
even if the keys are reversed, insertion sort is faster than bubble
sort, although not by much.
Update comment describing "tcp_lro_sort()" while at it.
Differential Revision: https://reviews.freebsd.org/D6619
Sponsored by: Mellanox Technologies
Tested by: Netflix
Suggested by: Pieter de Goeje <pieter@degoeje.nl>
Reviewed by: ed, gallatin, gnn, transport
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.
Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262
specific order. VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.
Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.
Reviewed by: gnn, tuexen (sctp), jhb (kernel.h)
Obtained from: projects/vnet
MFC after: 2 weeks
X-MFC: do not remove pr_destroy
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6652
This is required to signal connetion setup on non-blocking sockets
via becoming writable. This still allows for implicit connection
setup.
MFC after: 1 week
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started. No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased. And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection. If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.
We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS. And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.
Discussed with: lstewart, hiren, jtl, Mike Karels
MFC after: 3 weeks
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5872
Centre for Advanced Internet Architectures
Implementing AQM in FreeBSD
* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>
* Articles, Papers and Presentations
<http://caia.swin.edu.au/freebsd/aqm/papers.html>
* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>
Overview
Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.
The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.
The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.
Project Goals
This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
sufficient to enable independent, functionally equivalent implementations
* Provide a broader suite of AQM options for sections the networking
community that rely on FreeBSD platforms
Program Members:
* Rasool Al Saadi (developer)
* Grenville Armitage (project lead)
Acknowledgements:
This project has been made possible in part by a gift from the
Comcast Innovation Fund.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection: core
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6388
Not all mbufs passed up from device drivers are M_WRITABLE(). In
particular, the Chelsio T4/T5 driver uses a feature called "buffer packing"
to receive multiple frames in a single receive buffer. The mbufs for
these frames all share the same external storage so are treated as
read-only by the rest of the stack when multiple frames are in flight.
Previously tcp_respond() would blindly overwrite read-only mbufs when
INVARIANTS was disabled or panic with an assertion failure if INVARIANTS
was enabled. Note that the new case is a bit of a mix of the two other
cases in tcp_respond(). The TCP and IP headers must be copied explicitly
into the new mbuf instead of being inherited (similar to the m == NULL
case), but the addresses and ports must be swapped in the reply (similar
to the m != NULL case).
Reviewed by: glebius
"qsort()".
The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.
The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.
When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".
Differential Revision: https://reviews.freebsd.org/D6472
Sponsored by: Mellanox Technologies
Tested by: Netflix
Reviewed by: gallatin, rrs, sephe, transport
* include the SCTP common header, if possible
* include the first 8 bytes of the INIT chunk, if possible
This provides the necesary information for the receiver of the ICMP
packet to process it.
MFC after: 1 week
control to a three way setting.
0 - Totally disable ECN. (no change)
1 - Enable ECN if incoming connections request it. Outgoing
connections will request ECN. (no change from present != 0 setting)
2 - Enable ECN if incoming connections request it. Outgoing
conections will not request ECN.
Change the default value of net.inet.tcp.ecn.enable from 0 to 2.
Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities. The actual values above match Linux, and the default
matches the current Linux default.
Reviewed by: eadler
MFC after: 1 month
MFH: yes
Sponsored by: https://reviews.freebsd.org/D6386
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).
Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D6303
objects with the same name in different sets.
Add optional manage_sets() callback to objects rewriting framework.
It is intended to implement handler for moving and swapping named
object's sets. Add ipfw_obj_manage_sets() function that implements
generic sets handler. Use new callback to implement sets support for
lookup tables.
External actions objects are global and they don't support sets.
Modify eaction_findbyname() to reflect this.
ipfw(8) now may fail to move rules or sets, because some named objects
in target set may have conflicting names.
Note that ipfw_obj_ntlv type was changed, but since lookup tables
actually didn't support sets, this change is harmless.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
There was the requirement that two structures are in sync,
which is not valid anymore. Therefore don't rely on this
in the code anymore.
Thanks to Radek Malcic for reporting the issue. He found this
when using the userland stack.
MFC after: 1 week