In case the first fragmented part (off=0) arrives we check for the
maximum packet size for each fragmented part we already queued with the
addition of the unfragmentable part from the first one.
For one we do not have to enter the loop at all if this is the first
fragmented part to arrive, and we can skip the check.
Should we encounter an error case we send an ICMPv6 message for any
fragment exceeding the maximum length limit. While dequeueing the
original packet and freeing it, statistics were not updated and leaked
both the reassembly queue count for the fragment and the global
fragment count. Found by code inspection and confirmed by tightening
test cases checking more statistical and system counters.
While here properly wrap a line.
MFC after: 3 weeks
Sponsored by: Netflix
When we are checking for the maximum reassembled packet size of the
fragmentable part and run into the error case (packet too big),
we are leaking the packet queue enntry if this was a first fragment
to arrive.
Properly cleanup, removing the queue entry from the bucket, decrementing
counters, and freeing the memory.
MFC after: 3 weeks
Sponsored by: Netflix
Per sepcification the upper layer header needs to be within the first
fragment. The check was not done so far and there is an open review for
related work, so just leave a note as to where to put it.
Move the extraction of frag offset up to this as it is needed to determine
whether this is a first fragment or not.
MFC after: 3 weeks
Sponsored by: Netflix
Check whether we are accepting more fragments (based on global limits)
before doing expensive operations of calculating the hash and taking the
bucket lock. This slightly increases a "race" between check time and
incrementing counters (which is already there) possibly allowing a few
more fragments than the maximum limits. However, when under attack,
we rather save this CPU time for other packets/work.
MFC after: 3 weeks
Sponsored by: Netflix
Rather than walking the mbuf chain manually use m_last() which doing
exactly that for us.
Defer initializing srcifp for longer as there are multiple exit paths
out of the function which do not need it set. Initialize before taking
the lock though.
Rename the mtx lock to match the type better.
MFC after: 3 weeks
Sponsored by: Netflix
The IP6_REASS_MBUF() macro did some pointer gynmastics to end up with the
same type as it gets in [*(cast **)&]. Spelling it out instead saves all
this and makes the code more readable and less obfuscated directly using
the structure field.
MFC after: 3 weeks
Sponsored by: Netflix
Add some ASCII relation of how the bits plug together. The terminology
difference of "fragmented packets" and "fragment packets" is subtle.
While here clear up more whitespace and comments.
No functional change.
MFC after: 3 weeks
Sponsored by: Netflix
Remove the KAME custom circular queue for fragments and fragmented packets
and replace them with a standard TAILQ.
This make the code a lot more understandable and maintainable and removes
further hand-rolled code from the the tree using a standard interface instead.
Hide the still public structures under #ifdef _KERNEL as there is no
use for them in user space.
The naming is a bit confusing now as struct ip6q and the ip6q[] buckets
array are not the same anymore; sadly struct ip6q is also used by the
MAC framework and we cannot rename it.
Submitted by: jtl (initally)
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16847 (jtl's original)
When shutting down a VNET we did not cleanup the fragmentation hashes.
This has multiple problems: (1) leak memory but also (2) leak on the
global counters, which might eventually lead to a problem on a system
starting and stopping a lot of vnets and dealing with a lot of IPv6
fragments that the counters/limits would be exhausted and processing
would no longer take place.
Unfortunately we do not have a useable variable to indicate when
per-VNET initialization of frag6 has happened (or when destroy happened)
so introduce a boolean to flag this. This is needed here as well as
it was in r353635 for ip_reass.c in order to avoid tripping over the
already destroyed locks if interfaces go away after the frag6 destroy.
While splitting things up convert the TRY_LOCK to a LOCK operation in
now frag6_drain_one(). The try-lock was derived from a manual hand-rolled
implementation and carried forward all the time. We no longer can afford
not to get the lock as that would mean we would continue to leak memory.
Assert that all the buckets are empty before destroying to lock to
ensure long-term stability of a clean shutdown.
Reported by: hselasky
Reviewed by: hselasky
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22054
Add a read-only sysctl exporting the global number of fragments
(base system and all vnets). This is helpful to (a) know how many
fragments are currently being processed, (b) if there are possible
leaks, (c) if vnet teardown is not working correctly, and lastly
(d) it can be used as part of test-suits to ensure (a) to (c).
MFC after: 3 weeks
Sponsored by: Netflix
partial fragmented packets before a network interface is detached.
When sending IPv4 or IPv6 fragmented packets and a fragment is lost
before the network device is freed, the mbuf making up the fragment
will remain in the temporary hashed fragment list and cause a panic
when it times out due to accessing a freed network interface
structure.
1) Make sure the m_pkthdr.rcvif always points to a valid network
interface. Else the rcvif field should be set to NULL.
2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for
the fully defragged packet, instead of the first received fragment.
Panic backtrace for IPv6:
panic()
icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx
icmp6_error()
frag6_freef()
frag6_slowtimo()
pfslowtimo()
softclock_call_cc()
softclock()
ithread_loop()
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D19622
MFC after: 1 week
Sponsored by: Mellanox Technologies
Move ip6asfrag and the accompanying IP6_REASS_MBUF macro from
ip6_var.h into frag6.c as they are not used outside frag6.c.
Sadly struct ip6q is all over the mac framework so we have to
leave it public.
This reduces the public KPI space.
MFC after: 3 months
X-MFC: possibly MFC the #define only to stable branches
Sponsored by: Netflix
Consitently put () around return values.
Do not assign variables at the time of variable declaration.
Sort variables. Rename ia to ia6, remove/reuse some variables used only
once or twice for temporary calculations.
No functional changes intended.
MFC after: 3 months
Sponsored by: Netflix
Cleanup some comments (start with upper case, ends in punctuation,
use width and do not consume vertical space). Update comments to
RFC8200. Some whitespace changes.
No functional changes.
MFC after: 3 months
Sponsored by: Netflix
The hash buckets array is called ip6q. The data structure ip6q is a
description of different object, the one the array holds these days
(since r337776). To clear some of this confusion, rename the array
to ip6qb.
When iterating over all buckets or addressing them directly, we
use at least the variables i, hash, and bucket. To keep the
terminology consistent use the variable name "bucket" and always
make it an uint32_t and not sometimes an int.
No functional behaviour changes intended.
MFC after: 3 months
Sponsored by: Netflix
Re-order functions within the file in preparation for an upcoming
code simplification.
No functional changes.
MFC after: 3 months
Sponsored by: Netflix
Bring back systm.h after r350532 and banish errno.h, time.h, and
machine/atomic.h.
Reported by: bde (Thank you!)
Pointyhat to: bz
MFC after: 12 weeks
X-MFC: with r350532
Sponsored by: Netflix
Removing the prototype from the header and making the function static
in r350533 makes architectures using gcc complain "function declaration
isn't a prototype". Add the missing void given the function has no
arguments.
Reported by: the CI machinery
Pointyhat to: bz
MFC after: 3 months
X-MFC with: r350533
Sponsored by: Netflix
Rename M_FTABLE to M_FRAG6 as the former sounds very much like the former
"flowtable" rather than anything to do with fragments and reassembly.
While here, let malloc( , .. | M_ZERO) do the zeroing rather than calling
bzero() ourselves.
MFC after: 3 months
Sponsored by: Netflix
Remove all the #if 0 and #if notyet blocks of dead code which have been
there for at least 18 years from what I can see.
No functional changes.
MFC after: 3 months
Sponsored by: Netflix
Move the sysctls and the related variables only used in frag6.c
into the file and out of in6_proto.c. That way everything belonging
together is in one place.
Sort the variables into global and per-vnet scopes and make
them static. No longer export the (helper) function
frag6_set_bucketsize() now also file-local only.
Should be no functional changes, only reduced public KPI/KBI surface.
MFC after: 3 months
Sponsored by: Netflix
Sort includes and remove duplicate kernel.h as well as the unneeded
systm.h.
Hide the mac framework incude behind #fidef MAC.
MFC after: 3 months
Sponsored by: Netflix
fragmented packets.
When sending IPv4 and IPv6 fragmented packets and a fragment is lost,
the mbuf making up the fragment will remain in the temporary hashed
fragment list for a while. If the network interface departs before the
so-called slow timeout clears the packet, the fragment causes a panic
when the timeout kicks in due to accessing a freed network interface
structure.
Make sure that when a network device is departing, all hashed IPv4 and
IPv6 fragments belonging to it, get freed.
Backtrace:
panic()
icmp6_reflect()
hlim = ND_IFINFO(m->m_pkthdr.rcvif)->chlim;
^^^^ rcvif->if_afdata[AF_INET6] is NULL.
icmp6_error()
frag6_freef()
frag6_slowtimo()
pfslowtimo()
softclock_call_cc()
softclock()
ithread_loop()
Differential Revision: https://reviews.freebsd.org/D19622
Reviewed by: bz (network), adrian
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add a stat counter to track ipv6 atomic fragments. Atomic fragments can be
generated in response to invalid path MTU values, but are also a potential
attack vector and considered harmful (see RFC6946 and RFC8021).
While here add tracking of the atomic fragment counter to netstat and systat.
Reviewed by: tuexen, jtl, bz
Approved by: jtl (mentor), bz (mentor)
Event: Aberdeen hackathon 2019
Differential Revision: https://reviews.freebsd.org/D17511
When dropping a fragment queue, account for the number of fragments in the
queue. This improves accounting between the number of fragments received and
the number of fragments dropped.
Reviewed by: jtl, bz, transport
Approved by: jtl (mentor), bz (mentor)
Differential Revision: https://review.freebsd.org/D17521
r337776 started hashing the fragments into buckets for faster lookup.
The hashkey is larger than intended. This results in random stack data being
included in the hashed data, which in turn means that fragments of the same
packet might end up in different buckets, causing the reassembly to fail.
Set the correct size for hashkey.
PR: 231045
Approved by: re (kib)
MFC after: 3 days
Currently, the limits are quite high. On machines with millions of
mbuf clusters, the reassembly queue limits can also run into
the millions. Lower these values.
Also, try to ensure that no bucket will have a reassembly
queue larger than approximately 100 items. This limits the cost to
find the correct reassembly queue when processing an incoming
fragment.
Due to the low limits on each bucket's length, increase the size of
the hash table from 64 to 1024.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
Currently, we process IPv6 fragments with 0 bytes of payload, add them
to the reassembly queue, and do not recognize them as duplicating or
overlapping with adjacent 0-byte fragments. An attacker can exploit this
to create long fragment queues.
There is no legitimate reason for a fragment with no payload. However,
because IPv6 packets with an empty payload are acceptable, allow an
"atomic" fragment with no payload.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
There is a hashing algorithm which should distribute IPv6 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv6 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.
Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet6.ip6.maxfragbucketsize).
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
The IPv4 fragment reassembly code supports a limit on the number of
fragments per packet. The default limit is currently 17 fragments.
Among other things, this limit serves to limit the number of fragments
the code must parse when trying to reassembly a packet.
Add a limit to the IPv6 reassembly code. By default, limit a packet
to 65 fragments (64 on the queue, plus one final fragment to complete
the packet). This allows an average fragment size of 1,008 bytes, which
should be sufficient to hold a fragment. (Recall that the IPv6 minimum
MTU is 1280 bytes. Therefore, this configuration allows a full-size
IPv6 packet to be fragmented on a link with the minimum MTU and still
carry approximately 272 bytes of headers before the fragmented portion
of the packet.)
Users can adjust this limit using the net.inet6.ip6.maxfragsperpacket
sysctl.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
The IPv6 reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
on enough VNETs), it is possible that the sum of all the VNET fragment
limits will exceed the number of mbuf clusters available in the system.
Given the fact that the fragment limits are intended (at least in part) to
regulate access to a global resource, the IPv6 fragment limit should
be applied on a global basis.
Note that it is still possible to disable fragmentation for a particular
VNET by setting the net.inet6.ip6.maxfragpackets sysctl to 0 for that
VNET. In addition, it is now possible to disable fragmentation globally
by setting the net.inet6.ip6.maxfrags sysctl to 0.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
Currently, all IPv6 fragment reassembly queues are kept in a flat
linked list. This has a number of implications. Two significant
implications are: all reassembly operations share a common lock,
and it is possible for the linked list to grow quite large.
Improve IPv6 reassembly performance by hashing fragments into buckets,
each of which has its own lock. Calculate the hash key using a Jenkins
hash with a random seed.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
Instead of returning pointer to the previous header, return its offset.
In frag6_input() use m_copyback() and determined offset to store next
header instead of accessing to it by pointer and assuming that the memory
is contiguous.
In rip6_input() use offset returned by ip6_get_prevhdr() instead of
calculating it from pointers arithmetic, because IP header can belong
to another mbuf in the chain.
Reported by: Maxime Villard <max at m00nbsd dot net>
Reviewed by: kp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14158
the first mbuf of the reassembled datagram should have a pkthdr.
This was discovered with cxgbe(4) + IPSEC + ping with payload more than
interface MTU. cxgbe can generate !M_WRITEABLE mbufs and this results
in m_unshare being called on the reassembled datagram, and it complains:
panic: m_unshare: m0 0xfffff80020f82600, m 0xfffff8005d054100 has M_PKTHDR
PR: 224922
Reviewed by: ae@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14009
that had the IPv6 fragmentation header:
o Neighbor Solicitation
o Neighbor Advertisement
o Router Solicitation
o Router Advertisement
o Redirect
Introduce M_FRAGMENTED mbuf flag, and set it after IPv6 fragment reassembly
is completed. Then check the presence of this flag in correspondig ND6
handling routines.
PR: 224247
MFC after: 2 weeks
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
and csum_flags using information from all fragments. This fixes
dropping of reassembled packets due to wrong checksum when the IPv6
checksum offloading is enabled on a network card.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
This is required for fragments and encapsulated data (eg tunneling) to be redistributed
to the RSS bucket based on the eventual IPv6 header and protocol (TCP, UDP, etc) header.
* Add an mbuf tag with the state of IPv6 options parsing before the frame is queued
into the direct dispatch handler;
* Continue processing and complete the frame reception in the correct RSS bucket /
netisr context.
Testing results are in the phabricator review.
Differential Revision: https://reviews.freebsd.org/D3563
Submitted by: Tiwei Bie <btw@mail.ustc.edu.cn>
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.
Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.
Sponsored by: Yandex LLC
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.