When we're at our vnode limit, getnewvnode will call into the vnode LRU
cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we
end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling
represents something with extended attributes, zfs_rmnode will call
zfs_zget which will attempt to allocate another vnode. If the next vnode we
try to recycle is also a ZFS vnode representing something with extended
attributes we can recurse further. This ends up being unbounded and can end
up overflowing the stack.
In order to avoid this, restructure zfs_rmnode to simply add the extended
attribute directory's object ID to the unlinked set, thus not requiring the
allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain
which will do the work of properly marking the vnodes for unlinking.
zfs_unlinked_drain is also called on mount so these will be cleaned up
there.
Reviewed by: avg, mav
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D15342
Rack includes the following features:
- A different SACK processing scheme (the old sack structures are not used).
- RACK (Recent acknowledgment) where counting dup-acks is no longer done
instead time is used to knwo when to retransmit. (see the I-D)
- TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
to try not to take a retransmit time-out. (see the I-D)
- Burst mitigation using TCPHTPS
- PRR (partial rate reduction) see the RFC.
Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.
Note that any connection that does not support SACK will be kicked
back to the "default" base FreeBSD stack (currently known as "default").
To build this into your kernel you will need to enable in your
kernel:
makeoptions WITH_EXTRA_TCP_STACKS=1
options TCPHPTS
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15525
pagetables.
physmap[] can be inconsistent with the physical memory limit due to
buggy bios, or to the hw.physmem tunable. Since bootstrap pagetables
are initialized by accesses through the DMAP, we must ensure that DMAP
really cover the selected pages. This is only relevant when machine
has less than 4G RAM and buggy BIOS, which is the combination on Acer
Chromebook 720.
The call to mp_bootaddress() is moved later to have Maxmem initialized.
An alternative could be to always cover 4G for DMAP, but this change
seems to be simpler.
Reported and tested by: grembo
Reviewed by: royger
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15675
Fix the behavior of ofw_fdt_getprop() and ofw_fdt_getprop() functions to match
the documentation as the non-fdt code.
Submitted by: Luis Pires <lffpires@ruabrasil.org>
Reviewed by: manu, jhibbits
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D15680
Expected NMI-s are those than are either generated by the software (such
as a CPU sending NMI to other CPU) or generated by the hardware after
the software configured it to do so (such as NMI-s on PMC events).
Some unexpected NMI-s can be caused by hardware failures and it is
possible to inquire the hardware about them (somewhat like MCA but much
more primitive) using an EISA mechanism. In some cases the origin of
the NMI can remain truly unknown.
This commit should not change any functionality. It just reorganizes
the code, so that it is easier to extend with new checks for the origin
of the NMI. Also, it frees the code that has nothing to do with ISA
from DEV_ISA.
MFC after: 3 weeks
On PowerNV systems, the rootfs is passed through kexec, which loads the rootfs
into memory and set two fdt entries to describe where the file is located in
the memory;
I need to pass this memory region to the md device as a mfs_root, but, current
md driver does not support two things:
* Just getting a pointer from an external (bootloader) memory. If I need to
workaround it, I would need to declare a static array and memcopy from this
external memory to this static variable.
* The size of the image. The usage of mfs_root_end, which is not a pointer,
seems to be not possible for this prestaged scenario.
This patch simply adds a new way to load mfs_root from memory.
Differential Revision: https://reviews.freebsd.org/D15625
Approved by: kib, jhibbits (mentor)
ixl(4) (when it switches over to using iflib) devices need the TCP header
length in order to do TCP checksum offload.
Reviewed by: gallatin@, shurd@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15558
controller that tries to handle early invocations of the controller,
in other words, invocations before the expected end of the interval.
However, there were some calculation errors in this early invocation
case. Notably, if an early invocation occurred while the error was
negative, the derivative term was off by a large amount. One visible
effect of this error was that processes were being killed by the
virtual memory system's OOM killer when in fact there was plentiful
free memory.
Correct a couple minor errors in the sysctl descriptions, and apply
some style fixes.
Reviewed by: jeff, markj
- add '-j' options to filter to enable converting native pmc
log format to json lines format to enable the use of scripts
and external tooling
% pmc filter -j pmc.log pmc.jsonl
- Record the tsc value in sampling interrupts as opposed to
recording nanotime when the sample is copied to a global log
in hardclock - potentially many milliseconds later.
- At initialize record the tsc_freq and the time of day to give
us an offset for translating the tsc values in callchain records
Silently dicard SCTP chunks which have been requested to be
authenticated but are received unauthenticated no matter if support
for SCTP authentication has been negotiated. This improves compliance
with RFC 4895.
When the application uses the SCTP_AUTH_CHUNK socket option to
request a chunk to be received in an authenticated way, enable
the SCTP authentication extension for the end-point. This improves
compliance with RFC 6458.
Discussed with: Peter Lei
MFC after: 3 days
While at it rename hlist_add_after() into hlist_add_behind().
Submitted by: Johannes Lundberg <johalun0@gmail.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Sponsored by: Limelight Networks
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).
This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
the LinuxKPI. Add a comment saying in which Linux version this change was made.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Sponsored by: Limelight Networks
Use m_copyback() function to write checksum when it isn't located
in the first mbuf of the chain. Handmade analog doesn't handle the
case when parts of checksum are located in different mbufs.
Also in case when mbuf is too short, m_copyback() will allocate new
mbuf in the chain instead of making out of bounds write.
Also wrap long line and remove now useless KASSERTs.
X-MFC after: r334705
This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.
On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.
It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
The length of the IP payload is normally equal to the UDP length, UDP Options
(draft-ietf-tsvwg-udp-options-02) suggests using the difference between IP
length and UDP length to create space for trailing data.
Correct checksum length calculation to use the UDP length rather than the IP
length when not offloading UDP checksums.
Approved by: jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D15222
of needed interface when many gif interfaces are present.
Remove rmlock from gif_softc, use epoch(9) and CK_LIST instead.
Move more AF-related code into AF-related locations.
Use hash table to speedup lookup of needed softc. Interfaces
with GIF_IGNORE_SOURCE flag are stored in plain CK_LIST.
Sysctl net.link.gif.parallel_tunnels is removed. The removal was planed
16 years ago, and actually it could work only for outbound direction.
Each protocol, that can be handled by if_gif(4) interface is registered
by separate encap handler, this helps avoid invoking the handler
for unrelated protocols (GRE, PIM, etc.).
This change allows dramatically improve performance when many gif(4)
interfaces are used.
Sponsored by: Yandex LLC
Currently it has several disadvantages:
- it uses single mutex to protect internal structures. It is used by
data- and control- path, thus there are no parallelism at all.
- it uses single list to keep encap handlers for both INET and INET6
families.
- struct encaptab keeps unneeded information (src, dst, masks, protosw),
that isn't used by code in the source tree.
- matches are prioritized and when many tunneling interfaces are
registered, encapcheck handler of each interface is invoked for each
packet. The search takes O(n) for n interfaces. All this work is done
with exclusive lock held.
What this patch includes:
- the datapath is converted to be lockless using epoch(9) KPI.
- struct encaptab now linked using CK_LIST.
- all unused fields removed from struct encaptab. Several new fields
addedr: min_length is the minimum packet length, that encapsulation
handler expects to see; exact_match is maximum number of bits, that
can return an encapsulation handler, when it wants to consume a packet.
- IPv6 and IPv4 handlers are stored in separate lists;
- added new "encap_lookup_t" method, that will be used later. It is
targeted to speedup lookup of needed interface, when gif(4)/gre(4) have
many interfaces.
- the need to use protosw structure is eliminated. The only pr_input
method was used from this structure, so I don't see the need to keep
using it.
- encap_input_t method changed to avoid using mbuf tags to store softc
pointer. Now it is passed directly trough encap_input_t method.
encap_getarg() funtions is removed.
- all sockaddr structures and code that uses them removed. We don't have
any code in the tree that uses them. All consumers use encap_attach_func()
method, that relies on invoking of encapcheck() to determine the needed
handler.
- introduced struct encap_config, it contains parameters of encap handler
that is going to be registered by encap_attach() function.
- encap handlers are stored in lists ordered by exact_match value, thus
handlers that need more bits to match will be checked first, and if
encapcheck method returns exact_match value, the search will be stopped.
- all current consumers changed to use new KPI.
Reviewed by: mmacy
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15617
Coverity complains about:
if (((flags) & M_WAITOK) || _malloc_item != NULL)
saying:
The expression
1 /* (2 | 0x100) & 2 */ || _malloc_item != NULL
is suspicious because it performs a Boolean operation
on a constant other than 0 or 1.
Although the code is correct, add "!= 0" to make it slightly
more legible and to silence hundreds(?) of Coverity warnings.
Reported by: Coverity
Discussed with: mjg
Sponsored by: Dell EMC