(UTC) rather than the archaic (GMT) in comments. Except where the
comments are making fun of people doing this (and pedants who insist
on the new terms).
ipsec_getpolicybyaddr()
ipsec4_checkpolicy()
ip_ipsec_output()
ip6_ipsec_output()
The only flag used here was IP_FORWARDING.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
and make its prototype similar to ipsec6_process_packet.
The flags argument isn't used here, tunalready is always zero.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from
ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is
already handled by IPSEC code. This means that before IPSEC processing
it was destined to our address and security policy was checked in
the ip_ipsec_input(). After IPSEC processing packet has new IP
addresses and destination address isn't our own. So, anyway we can't
check security policy from the mbuf tag, because it corresponds
to different addresses.
We should check security policy that corresponds to packet
attributes in both cases - when it has a mbuf tag and when it has not.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
security policy. The changed block of code in ip*_ipsec_input() is
called when packet has ESP/AH header. Presence of
PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that
packet was already handled by IPSEC and reinjected in the netisr,
and it has another ESP/AH headers (encrypted twice?).
Since it was already processed by IPSEC code, the AH/ESP headers
was already stripped (and probably outer IP header was stripped too)
and security policy from the tdb_ident was applied to those headers.
It is incorrect to apply this security policy to current headers.
Also make ip_ipsec_input() prototype similar to ip6_ipsec_input().
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED mbuf tags. They aren't used in FreeBSD.
Instead check presence of PACKET_TAG_IPSEC_OUT_DONE mbuf tag. If it
is found, bypass security policy lookup as described in the comment.
PACKET_TAG_IPSEC_OUT_DONE tag added to mbuf when IPSEC code finishes
ESP/AH processing. Since it was already finished, this means the security
policy placed in the tdb_ident was already checked. And there is no reason
to check it again here.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
- Provide pru_ready function for TCP.
- Don't call tcp_output() from tcp_usr_send() if no ready data was put
into the socket buffer.
- In case of dropped connection don't try to m_freem() not ready data.
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
o Introduce a notion of "not ready" mbufs in socket buffers. These
mbufs are now being populated by some I/O in background and are
referenced outside. This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
freed.
o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
The sb_ccc stands for ""claimed character count", or "committed
character count". And the sb_acc is "available character count".
Consumers of socket buffer API shouldn't already access them directly,
but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
search.
o New function sbready() is provided to activate certain amount of mbufs
in a socket buffer.
A special note on SCTP:
SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs. Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them. The new
notion of "not ready" data isn't supported by SCTP. Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does. This was discussed with rrs@.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.
Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.
New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.
route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.
PR: 194238
MFC after: 1 month
CR: D1125
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Initially in_matrote() in_clsroute() in their current state was introduced by
r4105 20 years ago. Instead of deleting inactive routes immediately, we kept them
in route table, setting RTPRF_OURS flag and some expire time. After that, either
GC came or RTPRF_OURS got removed on first-packet. It was a good solution
in that days (and probably another decade after that) to keep TCP metrics.
However, after moving metrics to TCP hostcache in r122922, most of in_rmx
functionality became unused. It might had been used for flushing icmp-originated
routes before rte mutexes/refcounting, but I'm not sure about that.
So it looks like this is nearly impossible to make GC do its work nowadays:
in_rtkill() ignores non-RTPRF_OURS routes.
route can only become RTPRF_OURS after dropping last reference via rtfree()
which calls in_clsroute(), which, it turn, ignores UP and non-RTF_DYNAMIC routes.
Dynamic routes can still be installed via received redirect, but they
have default lifetime (no specific rt_expire) and no one has another trie walker
to call RTFREE() on them.
So, the changelist:
* remove custom rnh_match / rnh_close matching function.
* remove all GC functions
* partially revert r256695 (proto3 is no more used inside kernel,
it is not possible to use rt_expire from user point of view, proto3 support
is not complete)
* Finish r241884 (similar to this commit) and remove remaining IPv6 parts
MFC after: 1 month
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.
No objections from: net@
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.
Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.
Sponsored by: Yandex LLC
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.
gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
work at system startup. But this also removes support for having tunnels with
the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
Use our standard ioctls for tunnels.
me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);
PR: 164475
Differential Revision: D1023
No objections from: net@
Relnotes: yes
Sponsored by: Yandex LLC
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.
We currrently have 3 places where MTU is altered:
1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.
2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).
3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.
Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.
This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)
* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.
More changes will follow soon.
MFC after: 1 month
Sponsored by: Yandex LLC
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.
Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.
While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.
While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.
MFC after: 1 month
The check was recommened in the draft-ietf-ngtrans-mech-05.txt. But it isn't
clear, should it compare the source with all direct broadcast addresses in the
system or not.
RFC 4213 says it is enough to verify that the source address is the address
of the encapsulator, as configured on the decapsulator. And this verification
can be extended by administrator with any other forms of IPv4 ingress filtering.
Discussed with: glebius, melifaro
Sponsored by: Yandex LLC
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.
Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The kernel tracks syscall users so that modules can safely unregister them.
But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.
Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.
Reviewed by: kib (previous version)
MFC after: 2 weeks
that sctp_med_chunk_output() always initialized the reason_code
instead of relying on the caller.
The variable is only used for debugging purpose.
This issue was reported by Peter Bostroem from Google.
MFC after: 3 days
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
sent incoming stream reset request was responded with failed
or denied.
Thanks to Peter Bostroem from Google for reporting the issue.
MFC after: 3 days
o convert to if_transmit;
o use rmlock to protect access to gif_softc;
o use sx lock to protect from concurrent ioctls;
o remove a lot of unneeded and duplicated code;
o remove cached route support (it won't work with concurrent io);
o style fixes.
Reviewed by: melifaro
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
received any RST packet. Do not set error to ECONNRESET in this case.
Differential Revision: https://reviews.freebsd.org/D879
Reviewed by: rpaulo, adrian
Approved by: jhb (mentor)
Sponsored by: Verisign, Inc.
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.
(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)
Reviewed by: bz
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900
These are needed for the forthcoming vxlan implementation. The context
pointer means we do not have to use a spare pointer field in the inpcb,
and the source address is required to populate vxlan's forwarding table.
While I highly doubt there is an out of tree consumer of the UDP
tunneling callback, this change may be a difficult to eventually MFC.
Phabricator: https://reviews.freebsd.org/D383
Reviewed by: gnn
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.
The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
sys_sctp_peeloff
sys_sctp_generic_sendmsg
sys_sctp_generic_sendmsg_iov
sys_sctp_generic_recvmsg
The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.
As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file. A proper
prototype has been added to the sys/socketvar.h header file.
API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)
Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.
This partitular case is the only path where the mbuf could be NULL.
udp_append() checked for a NULL mbuf only after invoking the tunneling
callback. Our only in tree tunneling callback - SCTP - assumed a non
NULL mbuf, and it is a bit odd to make the callbacks responsible for
checking this condition.
This also reduces the differences between the IPv4 and IPv6 code.
MFC after: 1 month
outer header, consider it as new packet and clear the protocols flags.
This fixes problems when IPSEC traffic goes through various tunnels and router
doesn't send ICMP/ICMPv6 errors.
PR: 174602
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
consistent with the amount of data provided in the SCTP_RESET_STREAMS
socket option.
Thanks to Peter Bostroem from Google for drawing my attention to
this part of the code.
from xnu sources. If we encounter a network where ICMP is blocked
the Needs Frag indicator may not propagate back to us. Attempt to
downshift the mss once to a preconfigured value.
Default this feature to off for now while we do not have a full PLPMTUD
implementation in our stack.
Adds the following new sysctl's for control:
net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature
net.inet.tcp.pmtud_blackhole_mss -- mss to try for ipv4
net.inet.tcp.v6pmtud_blackhole_mss -- mss to try for ipv6
Adds the following new sysctl's for monitoring:
-- Number of times the code was activated to attempt a mss downshift
net.inet.tcp.pmtud_blackhole_activated
-- Number of times the blackhole mss was used in an attempt to downshift
net.inet.tcp.pmtud_blackhole_min_activated
-- Number of times that we failed to connect after we downshifted the mss
net.inet.tcp.pmtud_blackhole_failed
Phabricator: https://reviews.freebsd.org/D506
Reviewed by: rpaulo bz
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Limelight Networks
'if'/'else' case: it matches the simple 'else' case that follows.
This reduces awareness of external-storage mechanics outside of the
mbuf allocator.
Reviewed by: bz
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900
that this means full checksum coverage for received packets.
If an application is willing to accept packets with partial
coverage, it is expected to use the socekt option and provice
the minimum coverage it accepts.
Reviewed by: kevlo
MFC after: 3 days
- tcp_get_sav() - SADB key lookup
- tcp_signature_do_compute() - actual computation
* Fix TCP signature case for listening socket:
do not assume EVERY connection coming to socket
with TCP_SIGNATURE set to be md5 signed regardless
of SADB key existance for particular address. This
fixes the case for routing software having _some_
BGP sessions secured by md5.
* Simplify TCP_SIGNATURE handling in tcp_input()
MFC after: 2 weeks
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week
fibs. Use the mbuf's or the socket's fib instead of RT_ALL_FIBS. Fixes PR
187553. Also fixes netperf's UDP_STREAM test on a nondefault fib.
sys/netinet/ip_output.c
In ip_output, lookup the source address using the mbuf's fib instead
of RT_ALL_FIBS.
sys/netinet/in_pcb.c
in in_pcbladdr, lookup the source address using the socket's fib,
because we don't seem to have the mbuf fib. They should be the same,
though.
tests/sys/net/fibs_test.sh
Clear the expected failure on udp_dontroute.
PR: 187553
CR: https://reviews.freebsd.org/D772
MFC after: 3 weeks
Sponsored by: Spectra Logic
This fixes a bug which resulted in a warning on the userland
stack, when compiled on Windows.
Thanks to Peter Kasting from Google for reporting the issue and
provinding a potential fix.
MFC after: 3 days
towards blind SYN/RST spoofed attack.
Originally our stack used in-window checks for incoming SYN/RST
as proposed by RFC793. Later, circa 2003 the RST attack was
mitigated using the technique described in P. Watson
"Slipping in the window" paper [1].
After that, the checks were only relaxed for the sake of
compatibility with some buggy TCP stacks. First, r192912
introduced the vulnerability, just fixed by aforementioned SA.
Second, r167310 had slightly relaxed the default RST checks,
instead of utilizing net.inet.tcp.insecure_rst sysctl.
In 2010 a new technique for mitigation of these attacks was
proposed in RFC5961 [2]. The idea is to send a "challenge ACK"
packet to the peer, to verify that packet arrived isn't spoofed.
If peer receives challenge ACK it should regenerate its RST or
SYN with correct sequence number. This should not only protect
against attacks, but also improve communication with broken
stacks, so authors of reverted r167310 and r192912 won't be
disappointed.
[1] http://bandwidthco.com/whitepapers/netforensics/tcpip/TCP Reset Attacks.pdf
[2] http://www.rfc-editor.org/rfc/rfc5961.txt
Changes made:
o Revert r167310.
o Implement "challenge ACK" protection as specificed in RFC5961
against RST attack. On by default.
- Carefully preserve r138098, which handles empty window edge
case, not described by the RFC.
- Update net.inet.tcp.insecure_rst description.
o Implement "challenge ACK" protection as specificed in RFC5961
against SYN attack. On by default.
- Provide net.inet.tcp.insecure_syn sysctl, to turn off
RFC5961 protection.
The changes were tested at Netflix. The tested box didn't show
any anomalies compared to control box, except slightly increased
number of TCP connection in LAST_ACK state.
Reviewed by: rrs
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
MFC after: 1 week
Sponsored by: Mellanox Technologies
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.
sys/sys/param.h
Increment __FreeBSD_version
sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.
sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.
share/man/man9/ifnet.9
Document changed API.
CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic
A non-global IPv6 address can be used in more than one zone of the same
scope. This zone index is used to identify to which zone a non-global
address belongs.
Also we can have many foreign hosts with equal non-global addresses,
but from different zones. So, they can have different metrics in the
host cache.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
with no RSS hash.
When doing RSS:
* Create a new IPv4 netisr which expects the frames to have been verified;
it just directly dispatches to the IPv4 input path.
* Once IPv4 reassembly is done, re-calculate the RSS hash with the new
IP and L3 header; then reinject it as appropriate.
* Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash
function (rss_soft_m2cpuid) - this will do a software hash if the
hardware doesn't provide one.
NICs that don't implement hardware RSS hashing will now benefit from RSS
distribution - it'll inject into the correct destination netisr.
Note: the netisr distribution doesn't work out of the box - netisr doesn't
query RSS for how many CPUs and the affinity setup. Yes, netisr likely
shouldn't really be doing CPU stuff anymore and should be "some kind of
'thing' that is a workqueue that may or may not have any CPU affinity";
that's for a later commit.
Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan
and egress.
* rss_mbuf_software_hash_v4 - look at the IPv4 mbuf to fetch the IPv4 details
+ direction to calculate a hash.
* rss_proto_software_hash_v4 - hash the given source/destination IPv4 address,
port and direction.
* rss_soft_m2cpuid - map the given mbuf to an RSS CPU ("bucket" for now)
These functions are intended to be used by the stack to support
the following:
* Not all NICs do RSS hashing, so we should support some way of doing
a hash in software;
* The NIC / driver may not hash frames the way we want (eg UDP 4-tuple
hashing when the stack is only doing 2-tuple hashing for UDP); so we
may need to re-hash frames;
* .. same with IPv4 fragments - they will need to be re-hashed after
reassembly;
* .. and same with things like IP tunneling and such;
* The transmit path for things like UDP, RAW and ICMP don't currently
have any RSS information attached to them - so they'll need an
RSS calculation performed before transmit.
TODO:
* Counters! Everywhere!
* Add a debug mode that software hashes received frames and compares them
to the hardware hash provided by the hardware to ensure they match.
The IPv6 part of this is missing - I'm going to do some re-juggling of
where various parts of the RSS framework live before I add the IPv6
code (read: the IPv6 code is going to go into netinet6/in6_rss.[ch],
rather than living here.)
Note: This API is still fluid. Please keep that in mind.
Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan
information as part of recvmsg().
This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.
Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.
Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan
overriding an existing flowid/flowtype field in the outbound mbuf with
the inp_flowid/inp_flowtype details.
The upcoming RSS UDP support calculates a valid RSS value for outbound
mbufs and since it may change per send, it doesn't cache it in the inpcb.
So overriding it here would be wrong.
Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan
Kernel changes:
* Split kernel/userland nat structures eliminating IPFW_INTERNAL hack.
* Add IP_FW_NAT44_* codes resemblin old ones.
* Assume that instances can be named (no kernel support currently).
* Use both UH+WLOCK locks for all configuration changes.
* Provide full ABI support for old sockopts.
Userland changes:
* Use IP_FW_NAT44_* codes for nat operations.
* Remove undocumented ability to show ranges of nat "log" entries.
eliminiates some warnings when building in userland.
Thanks to Patrick Laimbock for reporting this issue.
Remove also some unnecessary casts.
There should be no functional change.
MFC after: 1 week
packets targeting a listening socket. Permit to reduce TCP input
processing starvation in context of high SYN load (e.g. short-lived TCP
connections or SYN flood).
Submitted by: Julien Charbon <jcharbon@verisign.com>
Reviewed by: adrian, hiren, jhb, Mike Bentkofsky
try to collapse adjacent pieces using m_catpkt(). In best case
scenario it copies data and frees mbufs, making mbuf exhaustion
attack harder.
Suggested by: Jonathan Looney <jonlooney gmail.com>
Security: Hardens against remote mbuf exhaustion attack.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
packets at all. Swapping byte order on SOCK_RAW was actually a bug, an
artifact from the BSD network stack, that used to convert a packet to
native byte order once it is received by kernel.
Other operating systems didn't follow this, and later other BSD
descendants fixed this, leaving us alone with the bug. Now it is
clear that we should fix the bug.
In collaboration with: Olivier Cochard-Labbé <olivier cochard.me>
See also: https://wiki.freebsd.org/SOCK_RAW
Sponsored by: Nginx, Inc.
This is the last major change in given branch.
Kernel changes:
* Use 64-bytes structures to hold multi-value variables.
* Use shared array to hold values from all tables (assume
each table algo is capable of holding 32-byte variables).
* Add some placeholders to support per-table value arrays in future.
* Use simple eventhandler-style API to ease the process of adding new
table items. Currently table addition may required multiple UH drops/
acquires which is quite tricky due to atomic table modificatio/swap
support, shared array resize, etc. Deal with it by calling special
notifier capable of rolling back state before actually performing
swap/resize operations. Original operation then restarts itself after
acquiring UH lock.
* Bump all objhash users default values to at least 64
* Fix custom hashing inside objhash.
Userland changes:
* Add support for dumping shared value array via "vlist" internal cmd.
* Some small print/fill_flags dixes to support u32 values.
* valtype is now bitmask of
<skipto|pipe|fib|nat|dscp|tag|divert|netgraph|limit|ipv4|ipv6>.
New values can hold distinct values for each of this types.
* Provide special "legacy" type which assumes all values are the same.
* More helpers/docs following..
Some examples:
3:41 [1] zfscurr0# ipfw table mimimi create valtype skipto,limit,ipv4,ipv6
3:41 [1] zfscurr0# ipfw table mimimi info
+++ table(mimimi), set(0) +++
kindex: 2, type: addr
references: 0, valtype: skipto,limit,ipv4,ipv6
algorithm: addr:radix
items: 0, size: 296
3:42 [1] zfscurr0# ipfw table mimimi add 10.0.0.5 3000,10,10.0.0.1,2a02:978:2::1
added: 10.0.0.5/32 3000,10,10.0.0.1,2a02:978:2::1
3:42 [1] zfscurr0# ipfw table mimimi list
+++ table(mimimi), set(0) +++
10.0.0.5/32 3000,0,10.0.0.1,2a02:978:2::1
is found, the first usable address is returned for legacy ioctls like
SIOCGIFBRDADDR, SIOCGIFDSTADDR, SIOCGIFNETMASK and SIOCGIFADDR.
While there also fix a subtle issue that a caller from a jail asking for
INADDR_ANY may get the first IP of the host that do not belong to the jail.
Submitted by: glebius
Differential Revision: https://reviews.freebsd.org/D667
socket options. This includes managing the correspoing stat counters.
Add the SCTP_DETAILED_STR_STATS kernel option to control per policy
counters on every stream. The default is off and only an aggregated
counter is available. This is sufficient for the RTCWeb usecase.
MFC after: 1 week
Most of the tablearg-supported opcodes does not accept 0 as valid value:
O_TAG, O_TAGGED, O_PIPE, O_QUEUE, O_DIVERT, O_TEE, O_SKIPTO, O_CALLRET,
O_NETGRAPH, O_NGTEE, O_NAT treats 0 as invalid input.
The rest are O_SETDSCP and O_SETFIB.
'Fix' them by adding high-order bit (0x8000) set for non-tablearg values.
Do translation in kernel for old clients (import_rule0 / export_rule0),
teach current ipfw(8) binary to add/remove given bit.
This change does not affect handling SETDSCP values, but limit
O_SETFIB values to 32767 instead of 65k. Since currently we have either
old (16) or new (2^32) max fibs, this should not be a big deal:
we're definitely OK for former and have to add another opcode to deal
with latter, regardless of tablearg value.
* Since there seems to be lack of consensus on strict value typing,
remove non-default value types. Use userland-only "value format type"
to print values.
Kernel changes:
* Add IP_FW_XMODIFY to permit table run-time modifications.
Currently we support changing limit and value format type.
Userland changes:
* Support IP_FW_XMODIFY opcode.
* Support specifying value format type (ftype) in tablble create/modify req
* Fine-print value type/value format type.
* Implement proper checks for switching between global and set-aware tables
* Split IP_FW_DEL mess into the following opcodes:
* IP_FW_XDEL (del rules matching pattern)
* IP_FW_XMOVE (move rules matching pattern to another set)
* IP_FW_SET_SWAP (swap between 2 sets)
* IP_FW_SET_MOVE (move one set to another one)
* IP_FW_SET_ENABLE (enable/disable sets)
* Add IP_FW_XZERO / IP_FW_XRESETLOG to finish IP_FW3 migration.
* Use unified ipfw_range_tlv as range description for all of the above.
* Check dynamic states IFF there was non-zero number of deleted dyn rules,
* Del relevant dynamic states with singe traversal instead of per-rule one.
Userland changes:
* Switch ipfw(8) to use new opcodes.
Kernel changes:
* Add opcode IP_FW_TABLE_XSWAP
* Add support for swapping 2 tables with the same type/ftype/vtype.
* Make skipto cache init after ipfw locks init.
Userland changes:
* Add "table X swap Y" command.
NRSACK extension. The default will still be off, since it
it not an RFC (yet).
Changing the sysctl name will be in a separate commit.
MFC after: 1 week
option for controlling ECN on future associations and get the
status on current associations.
A simialar pattern will be used for controlling SCTP extensions in
upcoming commits.
The rss_key[] array in netinet/in_rss.c has the bytes in incorrect
order. This results in the RSS test vectors in the Microsft RSS spec
and Intel NIC specs giving incorrect results, and making it difficult
to verify correct hash operation when RSS functionality is added to
new NICs.
CR: https://phabric.freebsd.org/D516
Reviewed by: adrian
Kernel changes:
* Add TEI_FLAGS_DONTADD entry flag to indicate that insert is not possible
* Support given flag in all algorithms
* Add "limit" field to ipfw_xtable_info
* Add actual limiting code into add_table_entry()
Userland changes:
* Add "limit" option as "create" table sub-option. Limit modification
is currently impossible.
* Print human-readable errors in table enry addition/deletion code.
* Add "flow:hash" algorithm
Kernel changes:
* Add O_IP_FLOW_LOOKUP opcode to support "flow" lookups
* Add IPFW_TABLE_FLOW table type
* Add "struct tflow_entry" as strage for 6-tuple flows
* Add "flow:hash" algorithm. Basically it is auto-growing chained hash table.
Additionally, we store mask of fields we need to compare in each instance/
* Increase ipfw_obj_tentry size by adding struct tflow_entry
* Add per-algorithm stat (ifpw_ta_tinfo) to ipfw_xtable_info
* Increase algoname length: 32 -> 64 (algo options passed there as string)
* Assume every table type can be customized by flags, use u8 to store "tflags" field.
* Simplify ipfw_find_table_entry() by providing @tentry directly to algo callback.
* Fix bug in cidr:chash resize procedure.
Userland changes:
* add "flow table(NAME)" syntax to support n-tuple checking tables.
* make fill_flags() separate function to ease working with _s_x arrays
* change "table info" output to reflect longer "type" fields
Syntax:
ipfw table fl2 create type flow:[src-ip][,proto][,src-port][,dst-ip][dst-port] [algo flow:hash]
Examples:
0:02 [2] zfscurr0# ipfw table fl2 create type flow:src-ip,proto,dst-port algo flow:hash
0:02 [2] zfscurr0# ipfw table fl2 info
+++ table(fl2), set(0) +++
kindex: 0, type: flow:src-ip,proto,dst-port
valtype: number, references: 0
algorithm: flow:hash
items: 0, size: 280
0:02 [2] zfscurr0# ipfw table fl2 add 2a02:6b8::333,tcp,443 45000
0:02 [2] zfscurr0# ipfw table fl2 add 10.0.0.92,tcp,80 22000
0:02 [2] zfscurr0# ipfw table fl2 list
+++ table(fl2), set(0) +++
2a02:6b8::333,6,443 45000
10.0.0.92,6,80 22000
0:02 [2] zfscurr0# ipfw add 200 count tcp from me to 78.46.89.105 80 flow 'table(fl2)'
00200 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
0:03 [2] zfscurr0# ipfw show
00200 0 0 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 617 59416 allow ip from any to any
0:03 [2] zfscurr0# telnet -s 10.0.0.92 78.46.89.105 80
Trying 78.46.89.105...
..
0:04 [2] zfscurr0# ipfw show
00200 5 272 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 682 66733 allow ip from any to any
Previously there was a race condition between the address addition
and associating it with the CARP which resulted in the interface
MAC, instead of the CARP MAC, being used for a brief amount of time.
This caused "is using my IP address" warnings as well as data being
sent to the wrong machine due to incorrect ARP entries being recorded
by other devices on the network.
Kernel changes:
* s/IPFW_TABLE_U32/IPFW_TABLE_NUMBER/
* Force "lookup <port|uid|gid|jid>" to be IPFW_TABLE_NUMBER
* Support "lookup" method for number tables
* Add number:array algorihm (i32 as key, auto-growing).
Userland changes:
* Support named tables in "lookup <tag> Table"
* Fix handling of "table(NAME,val)" case
* Support printing "number" table data.
* Rewrite interface tables to use interface indexes
Kernel changes:
* Add generic interface tracking API:
- ipfw_iface_ref (must call unlocked, performs lazy init if needed, allocates
state & bumps ref)
- ipfw_iface_add_ntfy(UH_WLOCK+WLOCK, links comsumer & runs its callback to
update ifindex)
- ipfw_iface_del_ntfy(UH_WLOCK+WLOCK, unlinks consumer)
- ipfw_iface_unref(unlocked, drops reference)
Additionally, consumer callbacks are called in interface withdrawal/departure.
* Rewrite interface tables to use iface tracking API. Currently tables are
implemented the following way:
runtime data is stored as sorted array of {ifidx, val} for existing interfaces
full data is stored inside namedobj instance (chained hashed table).
* Add IP_FW_XIFLIST opcode to dump status of tracked interfaces
* Pass @chain ptr to most non-locked algorithm callbacks:
(prepare_add, prepare_del, flush_entry ..). This may be needed for better
interaction of given algorithm an other ipfw subsystems
* Add optional "change_ti" algorithm handler to permit updating of
cached table_info pointer (happens in case of table_max resize)
* Fix small bug in ipfw_list_tables()
* Add badd (insert into sorted array) and bdel (remove from sorted array) funcs
Userland changes:
* Add "iflist" cmd to print status of currently tracked interface
* Add stringnum_cmp for better interface/table names sorting
so it really should not be under "optional inet". The fact that uipc_accf.c
lives under kern/ lends some weight to making it a "standard" file.
Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the
need for #ifdef INET in kern/uipc_socket.c.
Also, this meant the net.inet.accf.unloadable sysctl needed to move, as
net.inet does not exist without networking compiled in (as it lives in
netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable.
In order to support existing accept filter sysctls, the net.inet.accf node
has been added netinet/in_proto.c.
Submitted by: Steve Kiernan <stevek@juniper.net>
Obtained from: Juniper Networks, Inc.
PF_LINK, and multicast/broadcast flag should always be dropped because
the outer protocol uses unicast even when the inner address is not for
unicast. It had been broken since r236951 when gif_output() started to
use IFQ_HANDOFF().
by the stack.
Right now the stack isn't really setup for RSS with 4-tuple UDP hashing
for either IPv4 and IPv6.
The specifics:
* The UDP init path udp_init() and udplite_init() specify the hash as
2-tuple, so the PCBGROUPS code only tries a 2-tuple check;
* The PCBGROUPS and RSS code doesn't know about the UDP hash types
just yet, so they're never treated as valid hashes.
* For correctness, 4-tuple can't be enabled in the general case because
UDP datagrams can be more fragmented than IP datagrams may be.
Strictly speaking, TCP datagrams may also be fragmented and this could
cause issues with PCBGROUPS/RSS until the IP defragment path grows some
code to re-calculate the RSS hash.
I'll follow this commit up with awareness of the UDP 4-tuple for those
who wish to configure it, but for now it'll stay disabled.
No drivers (yet) know to use this function when RSS is enabled.
markedly better distribution of IPv6 address/ports than the previous key.
The previous key would hash large swaths of the port space for a given
source/destination IP address to the same low handful of bits, effectively
mapping them to the same queue. This made testing very .. special.
a source address was selected and cached, but it was not
stored that is was cached. This resulted in selecting
different source addresses for the INIT-ACK and COOKIE-ACK
when possible.
Thanks to Niu Zhixiong for reporting the issue.
MFC after: 1 week
awareness.
* Introduce IP_BINDMULTI - indicating that it's okay to bind multiple
sockets on the same bind details.
Although the PCB code has been taught about this (see below) this patch
doesn't introduce the rest of the PCB changes necessary to distribute
lookups among multiple PCB entries in the global wildcard table.
* Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the
given RSS bucket (and thus a single PCBGROUP hash.)
* Modify the PCB add path to be aware of IP_BINDMULTI:
+ Only allow further PCB entries to be added if the owner credentials
and IP_BINDMULTI has been specified. Ie, only allow further
IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI.
* Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries.
Instead of using the wildcard logic and hashing, these sockets are
simply placed into the PCBGROUP and _not_ in the wildcard hash.
* When doing a PCBGROUP lookup, also do a wildcard match as well.
This allows for an RSS bucket PCB entry to appear in a PCBGROUP
rather than having to exist in the wildcard list.
Tested:
* TCP IPv4 server testing with igb(4)
* TCP IPv4 server testing with ix(4)
TODO:
* The pcbgroup lookup code duplicated the wildcard and wildcard-PCB
logic. This could be refactored into a single function.
* This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't
yet know about this); nor does it yet fully work for UDP.
* Switch kernel to use per-cpu counters for rules.
* Keep ABI/API.
Kernel changes:
* Each rules is now exported as TLV with optional extenable
counter block (ip_fW_bcounter for base one) and
ip_fw_rule for rule&cmd data.
* Counters needs to be explicitly requested by IPFW_CFG_GET_COUNTERS flag.
* Separate counters from rules in kernel and clean up ip_fw a bit.
* Pack each rule in IPFW_TLV_RULE_ENT tlv to ease parsing.
* Introduce versioning in container TLV (may be needed in future).
* Fix ipfw_cfg_lheader broken u64 alignment.
Userland changes:
* Use set_mask from cfg header when requesting config
* Fix incorrect read accouting in ipfw_show_config()
* Use IPFW_RULE_NOOPT flag instead of playing with _pad
* Fix "ipfw -d list": do not print counters for dynamic states
* Some small fixes
Kernel changes:
* Change dump format for dynamic states:
each state is now stored inside ipfw_obj_dyntlv
last dynamic state is indicated by IPFW_DF_LAST flag
* Do not perform sooptcopyout() for !SOPT_GET requests.
Userland changes:
* Introduce foreach_state() function handler to ease work
with different states passed by ipfw_dump_config().
* Bump table dump format preserving old ABI.
Kernel size:
* Add IP_FW_TABLE_XFIND to handle "lookup" request from userland.
* Add ta_find_tentry() algorithm callbacks/handlers to support lookups.
* Fully switch to ipfw_obj_tentry for various table dumps:
algorithms are now required to support the latest (ipfw_obj_tentry) entry
dump format, the rest is handled by generic dump code.
IP_FW_TABLE_XLIST opcode version bumped (0 -> 1).
* Eliminate legacy ta_dump_entry algo handler:
dump_table_entry() converts data from current to legacy format.
Userland side:
* Add "lookup" table parameter.
* Change the way table type is guessed: call table_get_info() first,
and check value for IPv4/IPv6 type IFF table does not exist.
* Fix table_get_list(): do more tries if supplied buffer is not enough.
* Sparate table_show_entry() from table_show_list().