time through the mbuf chain during copy and TSO limiting.
It is used by both Rack and now the FreeBSD stack.
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D15937
Post r335356 it is possible to have an inpcb on the hash lists that is
partially torn down. Validate before using. Also as a side effect of this
change the lock ordering issue between hash lock and inpcb no longer exists
allowing some simplification.
Reported by: pho@
we started playing with the VNET sets. This
way we have verified the INP settings before
we go to the trouble of de-referencing it.
Reviewed by: and suggested by lstewart
Sponsored by: Netflix Inc.
- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
INP_INFO_RLOCK which can no longer fail
When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.
Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.
On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%
Tested by LLNW and pho@
Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.
Reviewed by: melifaro, olivier
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15789
enabled use an updated timestamp instead of reusing the one used in
the initial TCP SYN-ACK segment.
This patch ensures that an updated timestamp is used when sending the
SYN-ACK from the syncache code. It was already done if the
SYN-ACK was retransmitted from the generic code.
This makes the behaviour consistent and also conformant with
the TCP specification.
Reviewed by: jtl@, Jason Eggleston
MFC after: 1 month
Sponsored by: Neflix, Inc.
Differential Revision: https://reviews.freebsd.org/D15634
It is better to try allocate a big mbuf, than just silently drop a big
packet. A better solution could be reworking of libalias modules to be
able use m_copydata()/m_copyback() instead of requiring the single
contiguous buffer.
PR: 229006
MFC after: 1 week
Rack with respect to its handling of TCP Fast Open. Several
fixes all related to TFO are included in this commit:
1) Handling of non-TFO retransmissions
2) Building the proper send-map when we are doing TFO
3) Dealing with the ack that comes back that includes the
SYN and data.
It appears that with this commit TFO now works :-)
Thanks Larry for all your help!!
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15758
of needed interface when many gre interfaces are present.
Remove rmlock from gre_softc, use epoch(9) and CK_LIST instead.
Move more AF-related code into AF-related locations. Use hash table to
speedup lookup of needed softc.
When hash table lookups are not serialized with in_pcbfree it will be
possible for callers to find an inpcb that has been marked free. We
need to check for this and return NULL.
without this and running vnets with a TCP stack that uses
some of the features is a recipe for panic (without this commit).
Reported by: Larry Rosenman
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15757
Deferring the actual free of the inpcb until after a grace
period has elapsed will allow us to convert the inpcbinfo
info and hash read locks to epoch.
Reviewed by: gallatin, jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15510
time dependency.
At present, RACK requires the TCPHPTS option to run. However, because
modules can be moved from machine to machine, this dependency is really
best assessed at load time rather than at build time.
Reviewed by: rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15756
Rack includes the following features:
- A different SACK processing scheme (the old sack structures are not used).
- RACK (Recent acknowledgment) where counting dup-acks is no longer done
instead time is used to knwo when to retransmit. (see the I-D)
- TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
to try not to take a retransmit time-out. (see the I-D)
- Burst mitigation using TCPHTPS
- PRR (partial rate reduction) see the RFC.
Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.
Note that any connection that does not support SACK will be kicked
back to the "default" base FreeBSD stack (currently known as "default").
To build this into your kernel you will need to enable in your
kernel:
makeoptions WITH_EXTRA_TCP_STACKS=1
options TCPHPTS
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15525
Silently dicard SCTP chunks which have been requested to be
authenticated but are received unauthenticated no matter if support
for SCTP authentication has been negotiated. This improves compliance
with RFC 4895.
When the application uses the SCTP_AUTH_CHUNK socket option to
request a chunk to be received in an authenticated way, enable
the SCTP authentication extension for the end-point. This improves
compliance with RFC 6458.
Discussed with: Peter Lei
MFC after: 3 days
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).
This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
Use m_copyback() function to write checksum when it isn't located
in the first mbuf of the chain. Handmade analog doesn't handle the
case when parts of checksum are located in different mbufs.
Also in case when mbuf is too short, m_copyback() will allocate new
mbuf in the chain instead of making out of bounds write.
Also wrap long line and remove now useless KASSERTs.
X-MFC after: r334705
The length of the IP payload is normally equal to the UDP length, UDP Options
(draft-ietf-tsvwg-udp-options-02) suggests using the difference between IP
length and UDP length to create space for trailing data.
Correct checksum length calculation to use the UDP length rather than the IP
length when not offloading UDP checksums.
Approved by: jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D15222
of needed interface when many gif interfaces are present.
Remove rmlock from gif_softc, use epoch(9) and CK_LIST instead.
Move more AF-related code into AF-related locations.
Use hash table to speedup lookup of needed softc. Interfaces
with GIF_IGNORE_SOURCE flag are stored in plain CK_LIST.
Sysctl net.link.gif.parallel_tunnels is removed. The removal was planed
16 years ago, and actually it could work only for outbound direction.
Each protocol, that can be handled by if_gif(4) interface is registered
by separate encap handler, this helps avoid invoking the handler
for unrelated protocols (GRE, PIM, etc.).
This change allows dramatically improve performance when many gif(4)
interfaces are used.
Sponsored by: Yandex LLC
Currently it has several disadvantages:
- it uses single mutex to protect internal structures. It is used by
data- and control- path, thus there are no parallelism at all.
- it uses single list to keep encap handlers for both INET and INET6
families.
- struct encaptab keeps unneeded information (src, dst, masks, protosw),
that isn't used by code in the source tree.
- matches are prioritized and when many tunneling interfaces are
registered, encapcheck handler of each interface is invoked for each
packet. The search takes O(n) for n interfaces. All this work is done
with exclusive lock held.
What this patch includes:
- the datapath is converted to be lockless using epoch(9) KPI.
- struct encaptab now linked using CK_LIST.
- all unused fields removed from struct encaptab. Several new fields
addedr: min_length is the minimum packet length, that encapsulation
handler expects to see; exact_match is maximum number of bits, that
can return an encapsulation handler, when it wants to consume a packet.
- IPv6 and IPv4 handlers are stored in separate lists;
- added new "encap_lookup_t" method, that will be used later. It is
targeted to speedup lookup of needed interface, when gif(4)/gre(4) have
many interfaces.
- the need to use protosw structure is eliminated. The only pr_input
method was used from this structure, so I don't see the need to keep
using it.
- encap_input_t method changed to avoid using mbuf tags to store softc
pointer. Now it is passed directly trough encap_input_t method.
encap_getarg() funtions is removed.
- all sockaddr structures and code that uses them removed. We don't have
any code in the tree that uses them. All consumers use encap_attach_func()
method, that relies on invoking of encapcheck() to determine the needed
handler.
- introduced struct encap_config, it contains parameters of encap handler
that is going to be registered by encap_attach() function.
- encap handlers are stored in lists ordered by exact_match value, thus
handlers that need more bits to match will be checked first, and if
encapcheck method returns exact_match value, the search will be stopped.
- all current consumers changed to use new KPI.
Reviewed by: mmacy
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15617
Plenty of allocation sites pass M_ZERO and sizes which are small and known
at compilation time. Handling them internally in malloc loses this information
and results in avoidable calls to memset.
Instead, let the compiler take the advantage of it whenever possible.
Discussed with: jeff
without a RANDOM parameter but with a CHUNKS or HMAC-ALGO parameter.
Please note that sending this combination violates the specification.
Thnanks to Ronald E. Crane for reporting the issue for the userland
stack.
MFC after: 3 days
Use the same logic to handle the SYN-ACK retransmission when sent from
the syn cache code as when sent from the main code.
MFC after: 3 days
Sponsored by: Netflix, Inc.
If the sysctl variable is set to a value larger than TCP_MAXRXTSHIFT+1,
the array tcp_syn_backoff[] is accessed out of bounds.
Discussed with: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Fields owned by the generic code were being left uninitialized,
causing problems in clear_dumper() if an error occurred.
Coverity CID: 1391200
X-MFC with: r333283