but not for IPv6. The current checks in nd6_nbr.c along with the
old version will result in ifa being NULL and subsequently the
packet will be dropped. This prevented NS/NA, from working and
with that IPv6.
Now return the ifa from the carp lookup function in two cases:
1) if the address matches, is a carp address, and we are MASTER
(as before),
2) if the address matches but it is not a carp address at all (new).
Reported by: Peter Wemm (new Y! FreeBSD cluster, eating our own dogfood)
Tested on: New Y! FreeBSD cluster machines
Reviewed by: glebius
tcp_mtudisc(), which in its turn may call tcp_output(). Under certain
conditions (must admit they are very special) an infinite recursion can
happen.
To avoid recursion we can pass struct route to ip_output() and obtain
correct mtu. This allows us not to use tcp_mtudisc() but call tcp_mss_update()
directly.
PR: kern/155585
Submitted by: Andrey Zonov <andrey zonov.org> (original version of patch)
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
- Add a note to the ipfw(8) man page about the rules no longer being
case sensitive.
- Fix some typos in the man page.
PR: docs/164772
Reviewed by: bz
Approved by: gabor (doc mentor, src committer)
MFC after: 2 weeks
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.
- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.
Build-tested with make universe.
30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE
Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe
Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe
Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.
MFC after: 3 days
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
There are currently no in-tree consumers; I'm adding it now for use by
vendor code. This matches the change OpenBSD made while implementing
MPLS in gif(4).
Properly protect the inp read access when handling the control code.
In the past this was expensive but given the rlock it's not so much
anymore.
Spotted while: optimizing udp6
Discussed with: rwatson (a few months ago)
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
queue the packet for LRO and tell the driver to directly pass it on.
This avoids re-assembly and later re-fragmentation problems when
forwarding.
It's not the best solution but the simplest and most effective for
the moment.
Should have been done: ages ago
Discussed with and by: many
MFC after: 3 days
Add code to handle pre-checked TCP checksums as indicated by mbuf
flags to save the entire computation for validation if not needed.
In the IPv6 TCP output path only compute the pseudo-header checksum,
set the checksum offset in the mbuf field along the appropriate flag
as done in IPv4.
In tcp_respond() just initialize the IPv6 payload length to 0 as
ip6_output() will properly set it.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Factor out the tcp_hc_getmtu() call. As the comments say it
applies to both v4 and v6, so only write it once making it easier
to read the protocol family specifc code.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Significantly update tcp_lro for mostly two things:
1) introduce basic support for IPv6 without extension headers.
2) try hard to also get the incremental checksum updates right,
especially also in the IPv4 case for the IP and TCP header.
Move variables around for better locality, factor things out into
functions, allow checksum updates to be compiled out, ...
Leave a few comments on further things to look at in the future,
though that is not the full list.
Update drivers with appropriate #includes as needed for IPv6 data
type in LRO.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Cleaner solution (e.g. adding another header) should be done here.
Original log:
Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h.
Remove ipfw/ip_fw_private.h header from non-ipfw code.
Requested by: luigi
Approved by: kib(mentor)
actually in it. This happens when SCTP receives an unknown chunk, which
requires the sending of an ERROR chunk, and there is no final padding but
the chunk is not 4-byte aligned.
Reported by yueting via rwatson@
MFC after: 3 days
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.
This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.
To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
MTU value, that is passed to tcp_mss_update() as mtuoffer.
Reported by: az
Reported by: Andrey Zonov <andrey zonov.org>
Reviewed by: andre (previous version of patch)
we look at count of addresses on a particular vhid, we
should account number of addresses on cif.
To achieve this we need to run carp_attach() and
carp_detach() under appropriate cif lock.
this commit is not enough to enable CARP operation on
if_bridge(4), because the latter doesn't handle or even
initialize its ifp->if_link_state.
Reported by: Alexander Lunev <sol289 gmail.com>
SCTP will only do IPv4 UDP checksum calculation as defined by the host
policy. When tunneling SCTP always calculates the inner checksum already
so not doing the outer UDP can save cycles.
While here virtualize the variable.
Requested by: tuexen
MFC after: 2 weeks
net.inet.ip.fw.tables_max is now read-write.
- Bump IPFW_TABLES_MAX to 65535
Default number of tables is still 128
- Remove IPFW_TABLES_MAX from ipfw(8) code.
Sponsored by Yandex LLC
Approved by: kib(mentor)
MFC after: 2 weeks
add in the netgraph interface to the list of
acceptable interfaces. A todo at the next
IETF code blitz, though is we need to review
why we screen interfaces, there was a reason ;-).
PR: 165210
MFC after: 1 week
- Add support for IPv6 and interface extended tables
- Make number of tables to be loader tunable in range 0..65534.
- Use IP_FW3 opcode for all new extended table cmds
No ABI changes are introduced. Old userland will see valid tables for
IPv4 tables and no entries otherwise. Flush works for any table.
IP_FW3 socket option is used to encapsulate all new opcodes:
/* IP_FW3 header/opcodes */
typedef struct _ip_fw3_opheader {
uint16_t opcode; /* Operation opcode */
uint16_t reserved[3]; /* Align to 64-bit boundary */
} ip_fw3_opheader;
New opcodes added:
IP_FW_TABLE_XADD, IP_FW_TABLE_XDEL, IP_FW_TABLE_XGETSIZE, IP_FW_TABLE_XLIST
ipfw(8) table argument parsing behavior is changed:
'ipfw table 999 add host' now assumes 'host' to be interface name instead of
hostname.
New tunable:
net.inet.ip.fw.tables_max controls number of table supported by ipfw in given
VNET instance. 128 is still the default value.
New syntax:
ipfw add skipto tablearg ip from any to any via table(42) in
ipfw add skipto tablearg ip from any to any via table(4242) out
This is a bit hackish, special interface name '\1' is used to signal interface
table number is passed in p.glob field.
Sponsored by Yandex LLC
Reviewed by: ae
Approved by: ae (mentor)
MFC after: 4 weeks
before vnet_mroute_init(), since vnet_mroute_init() depends on mfchashsize
tunable to be set, and that is done in in ip_mroute_modevent().
Apparently I broke that ordering with r208744 almost 2 years ago...
PR: kern/162201
Submitted by: Stevan Markovic (mcafee.com)
MFC after: 3 days
at which the lle_tbl pointer points to freed memory and the llt_free pointer is no longer
valid.
Move the free pointer in to the llentry itself and update the initalization sites.
MFC after: 2 weeks
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.
Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.
Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.
PR: kern/61404
Reviewed by: jhb, mav, rrs
Discussed with: silby, lstewart
Sponsored by: Sandvine Incorporated (originally in 2011)
MFC after: 6 weeks
to a deadlock of an association when an IPv6 socket was used to
communcate with IPv4 and an ICMPv4 fragmentation needed message
was received.
While there, simplify the code a bit.
MFC after: 3 days.
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.
Reviewed by: andre, bz, lstewart
in r179783 as (ab)using the concept of VRFs for this had not worked.
At this point SCTP in FreeBSD does not support multi-FIB, neither for
IPv4 nor for IPv6.
Discussed with: rrs
Sponsored by: Cisco Systems, Inc.
the original IPv4 implementation from r178888:
- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
ease MFCs. Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
are locked to the default FIB. Individual IPv6 addresses will only
appear in the default FIB, however redirect information and prefixes
of connected subnets are automatically propagated to all FIBs by
default (mimicking IPv4 behavior as closely as possible).
Sponsored by: Cisco Systems, Inc.
to cleanup routes from a single ifa.
o Implement carp_addroute()/carp_delroute() via above functions.
o Call carp_ifa_delroute() in the carp_detach() to avoid
junk routes left in routing table, in case if user
removes an address in a MASTER state. [1]
Reported by: az [1]
comments to longer, also refining strange ones.
Properly use #ifdef rather than #if defined() where possible. Four
#if defined(PCBGROUP) occurances (netinet and netinet6) were ignored to
avoid conflicts with eventually upcoming changes for RSS.
Reported by: bde (most)
Reviewed by: bde
MFC after: 3 days
o Make the pfsync.ko actually usable. Before this change loading it
didn't register protosw, so was a nop. However, a module /boot/kernel
did confused users.
o Rewrite the way we are joining multicast group:
- Move multicast initialization/destruction to separate functions.
- Don't allocate memory if we aren't going to join a multicast group.
- Use modern API for joining/leaving multicast group.
- Now the utterly wrong pfsync_ifdetach() isn't needed.
o Move module initialization from SYSINIT(9) to moduledata_t method.
o Refuse to unload module, unless asked forcibly.
o Improve a bit some FreeBSD porting code:
- Use separate malloc type.
- Simplify swi sheduling.
This change is probably wrong from VIMAGE viewpoint, however pfsync
wasn't VIMAGE-correct before this change, too.
Glanced at by: bz
in the ARP datagram generated by arprequest(). If caller doesn't
supply the address, then it is either picked from CARP or hardware
address of the interface is taken.
While here, make several minor fixes:
- Hold IF_ADDR_RLOCK(ifp) while traversing address list.
- Remove not true comment.
- Access internet address and mask via in_ifaddr fields,
rather than ifaddr.
If set to 1, no ABORT is sent back in response to an incoming
INIT. If set to 2, no ABORT is sent back in response to
an out of the blue packet. If set to 0 (the default), ABORTs
are sent.
Discussed with rrs@.
MFC after: 1 month.
than or equal to rcv_adv and fix tcp_twstart() to handle this case by
assuming the last window was zero rather than a negative value.
The code in tcp_input() already safely handled this case. It can happen
due to delayed ACKs along with a remote sender that sends data beyond
the window we previously advertised. If we have room in our socket buffer
for the extra data beyond the advertised window, we will accept it.
However, if the ACK for that segment is delayed, then we will not
effectively fixup rcv_adv to account for that extra data until the
next segment arrives and forces out an ACK. When that next segment
arrives, rcv_nxt will be beyond rcv_adv.
Tested by: pjd
MFC after: 1 week
missing interface address list locking and grab a reference on the
matching interface address after dropping the lock while it is used to
avoid a potential use after free.
Reviewed by: bz
MFC after: 1 week
reference on a group in the leaving state while iterating over the loop.
Instead, use the same approach used in igmp_ifdetach() and mld_ifdetach()
of placing the groups to free on pending release list and then releasing
the references after dropping the IF_ADDR_LOCK. This closes an ugly race
where the code was dropping the lock in the middle of iterating over the
list. It also fixes some additional potential use-after-free bugs since
the cancellation routine also applied other changes to the group after
dropping the reference. Now those changes are performed before the
reference is dropped and the group is potentially freed.
Prodded to fix by: glebius
Reviewed by: bz
MFC after: 1 week
asychronous task. This avoids tearing down multicast state including
sending IGMP leave messages and reprogramming MAC filters while holding
the per-protocol global pcbinfo lock that is used in the receive path of
packet processing.
Reviewed by: rwatson
MFC after: 1 month
7.x, 8.x and 9.x with pf(4) imports: pfsync(4) should suppress CARP
preemption, while it is running its bulk update.
However, reimplement the feature in more elegant manner, that is
partially inspired by newer OpenBSD:
- Rename term "suppression" to "demotion", to match with OpenBSD.
- Keep a global demotion factor, that can be raised by several
conditions, for now these are:
- interface goes down
- carp(4) has problems with ip_output() or ip6_output()
- pfsync performs bulk update
- Unlike in OpenBSD the demotion factor isn't a counter, but
is actual value added to advskew. The adjustment values for
particular error conditions are also configurable, and their
defaults are maximum advskew value, so a single failure bumps
demotion to maximum. This is for POLA compatibility, and should
satisfy most users.
- Demotion factor is a writable sysctl, so user can do
foot shooting, if he desires to.
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.
The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.
ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.
To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]
The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.
Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!
PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]
Fix a bug where the parameter length of a supported address types
parameter is set to a wrong value if the kernel is built with
with either INET or INET6, but not both.
MFC after: 3 days.
backup stack queue entry when the zone is exhausted, otherwise we leak a zone
allocation each time we plug a hole in the reassembly queue.
Reported by: many on freebsd-stable@ (thread: "TCP Reassembly Issues")
Tested by: many on freebsd-stable@ (thread: "TCP Reassembly Issues")
Reviewed by: bz (very brief sanity check)
MFC after: 3 days