This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).
Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away.
The zone_fini parameter also goes away. Previously no protocols (except
divert) supplied zone_fini function, so inpcb locks were leaked with slabs.
This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now
this is a leak. Fix that by suppling inpcb_fini() function as fini method
for all inpcb zones.
do for streaming sockets.
And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.
Suggested by: glebius
Tested by: Irina Liakh <spell at itl ua>
MFC after: 1 week
When checksums of received IP and UDP header already checked, UDP uses
sbappendaddr_locked() to pass received data to the socket.
sbappendaddr_locked() uses given mbuf as is, and if NIC supports checksum
offloading, mbuf contains csum_data and csum_flags that were calculated
for already stripped headers. Some NICs support only limited checksums
offloading and do not use CSUM_PSEUDO_HDR flag, and csum_data contains
some value that UDP/TCP should use for pseudo header checksum calculation.
When L2TP is used for tunneling with mpd5, ng_ksocket receives mbuf with
filled csum_flags and csum_data, that were calculated for outer headers.
When L2TP header is stripped, a packet that was tunneled goes to the IP
layer and due to presence of csum_flags (without CSUM_PSEUDO_HDR) and
csum_data, the UDP/TCP checksum check fails for this packet.
Reported by: Irina Liakh <spell at itl ua>
Tested by: Irina Liakh <spell at itl ua>
MFC after: 1 week
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet. Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast". The
optimization has been disabled for several months now with no progress
towards fixing it, so it needs to go.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.
The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.
This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.
Reviewed by: adrian, aw
Approved by: ae (mentor)
Sponsored by: rsync.net
Differential Revision: D9235
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.
The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.
This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.
Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
If multiple threads emit a UDP log_in_vain message concurrently,
the IP addresses could be garbage due to concurrent usage of a
single string buffer inside inet_ntoa(). Use inet_ntoa_r() with
two stack buffers instead.
Reported by: Mark Martinec <Mark.Martinec+freebsd@ijs.si>
MFC after: 3 days
Relnotes: yes
Sponsored by: Dell EMC
Small summary
-------------
o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.
Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352
is NULL and the function jumps to the "release:" label.
For this case, the "inp" was write locked, but the code attempted to
read unlock it. This patch fixes the problem.
This case could occur for NFS over UDP mounts, where the server was
down for a few minutes under certain circumstances.
Reported by: bde
Tested by: bde
Reviewed by: gnn
MFC after: 2 weeks
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet. Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".
Discussions on how to properly fix r304436 are ongoing, but in the meantime
disable the optimization to ensure that no existing network setups are broken.
Reported by: bms
The default value of the tunable introduced in r304436 couldn't be
effectively overrided on VIMAGE kernels, because instead of being
accessed via the appropriate VNET() accessor macro, it was accessed
via the VNET_NAME() macro, which resolves to the (should-be) read-only
master template of initial values of per-VNET data. Hence, while the
value of udp_require_l2_bcast could be altered on per-VNET basis, the
code in udp_input() would ignore it as it would always read the default
value (one) from the VNET master template.
Silence from: rstone
in_broadcast() can be quite expensive, so skip calling it if the
incoming mbuf wasn't sent to a broadcast L2 address in the first
place.
Reviewed by: gnn
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7309
specific order. VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.
Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.
Reviewed by: gnn, tuexen (sctp), jhb (kernel.h)
Obtained from: projects/vnet
MFC after: 2 weeks
X-MFC: do not remove pr_destroy
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6652
interested in having tunneled UDP and finding out about the
ICMP (tested by Michael Tuexen with SCTP.. soon to be using
this feature).
Differential Revision: http://reviews.freebsd.org/D5875
- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
using gcc happy [-Wreturn-type]
"type qualifiers ignored on function return type"
I am not entirely happy with this solution putting the u_int there
but it will do for now.
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.
Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306
do not do what one would expect by name. Prefix them with "udp_"
to at least obviously limit the scope.
This is a non-functional change.
Reviewed by: gnn, rwatson
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3505
lock on the INP before calling the tunnel protocol, else a LOR
may occur (it does with SCTP for sure). Instead we must acquire a
ref count and release the lock, taking care to allow for the case
where the UDP socket has gone away and *not* unlocking since the
refcnt decrement on the inp will do the unlock in that case.
Reviewed by: tuexen
MFC after: 3 weeks
bits.
The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.
* net/rss_config.[ch] takes care of the generic bits for doing
configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.
This should have no functional impact on anyone currently using
the RSS support.
Differential Revision: D1383
Reviewed by: gnn, jfv (intel driver bits)
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
These are needed for the forthcoming vxlan implementation. The context
pointer means we do not have to use a spare pointer field in the inpcb,
and the source address is required to populate vxlan's forwarding table.
While I highly doubt there is an out of tree consumer of the UDP
tunneling callback, this change may be a difficult to eventually MFC.
Phabricator: https://reviews.freebsd.org/D383
Reviewed by: gnn
This partitular case is the only path where the mbuf could be NULL.
udp_append() checked for a NULL mbuf only after invoking the tunneling
callback. Our only in tree tunneling callback - SCTP - assumed a non
NULL mbuf, and it is a bit odd to make the callbacks responsible for
checking this condition.
This also reduces the differences between the IPv4 and IPv6 code.
MFC after: 1 month
that this means full checksum coverage for received packets.
If an application is willing to accept packets with partial
coverage, it is expected to use the socekt option and provice
the minimum coverage it accepts.
Reviewed by: kevlo
MFC after: 3 days
information as part of recvmsg().
This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.
Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.
Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan