Small summary
-------------
o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.
Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.
Reported by: kan, lidl
structure:
if_gethwtsomax(), if_sethwtsomax() - if_hw_tsomax
if_gethwtsomaxsegcount(), if_sethwtsomaxsegcount() - if_hw_tsomaxsegcount
if_gethwtsomaxsegsize(), if_sethwtsomaxsegsize() - if_hw_tsomaxsegsize
Update em and vnic drivers which had already been coverted to use accessor
functions for the other ifnet structure members.
Reviewed by: erj
Approved by: sjg (mentor)
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D8544
This would enqueue an event to send the gratuitous arp on a dying lagg
interface without any physical ports attached to it.
Apart from that, the taskqueue_drain() on lagg_clone_destroy() runs too
late, when the ifp data structure is already freed. Fix that too.
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
convention for interfaces, because only one stf(4) interface can exist
in the system.
This disallow the use of unit numbers different than 0, however, it is
possible to create the clone without specify the unit number (wildcard).
In the wildcard case we must update the interface name before return.
This fix an infinite recursion in pf code that keeps track of network
interfaces and groups:
1 - a group for the cloned type of the interface is added (stf in this
case);
2 - the system will now try to add an interface named stf (instead of
stf0) to stf group;
3 - when pfi_kif_attach() tries to search for an already existing 'stf'
interface, the 'stf' group is returned and thus the group is added
as an interface of itself;
This will now cause a crash at the first attempt to traverse the groups
which the stf interface belongs (which loops over itself).
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
This interface type ("a parent interface of wlanX") is not used since
r287197
Reviewed by: adrian, glebius
Differential Revision: https://reviews.freebsd.org/D9308
Thank glebius for pointing this out:
"The network stuff shall not be added to sys/eventhandler.h"
Reviewed by: David_A_Bright_DELL.com, sephe, glebius
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D9345
We found routing performance dropped significantly when configuring
FreeBSD as a router, we are applying the following changes in order to
resolve those issues and hopefully perform better.
- don't prefetch the flags array, we usually don't need it
- prefetch the next cache line of each of the software descriptor arrays as
well as the first cache line of each of the next four packets' mbufs and
clusters
- reduce max copy size to 63 bytes
- convert rx soft descriptors from array of structures to a structure of arrays
- update copyrights
Submitted by: Matt Macy <mmacy@nextbsd.org>
This calls ioctl() handlers for the different interfaces in the bridge.
These handlers expect to get called in an ioctl context where it's safe
for them to sleep. We may not sleep with the bridge lock held.
However, we still need to protect the interface list, to ensure it
doesn't get changed while we iterate over it.
Use BRIDGE_XLOCK(), which prevents bridge members from being removed.
Adding bridge members is safe, because it uses LIST_INSERT_HEAD().
This caused panics when adding xen interfaces to a bridge.
PR: 216304
Reviewed by: ae
MFC after: 1 week
Sponsored by: RootBSD
Differential Revision: https://reviews.freebsd.org/D9290
(intentionally) deleted first and then completely added again (so all the
events, announces and hooks are given a chance to run).
This cause an issue with CARP where the existing CARP data structure is
removed together with the last address for a given VHID, which will cause
a subsequent fail when the address is later re-added.
This change fixes this issue by adding a new flag to keep the CARP data
structure when an address is not being removed.
There was an additional issue with IPv6 CARP addresses, where the CARP data
structure would never be removed after a change and lead to VHIDs which
cannot be destroyed.
Reviewed by: glebius
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
success and a good value. Only then try to use it and set the MSIX_ENABLE
bit.
With the current em(4) driver we have observed failures in this case in a
specific environment when pci_find_cap() would not return the assumed
value, which meant we ended up writing to PCI register 2 (PCI_DEVICE_ID)
which is read-only.
PR: 216456
Submitted by: bz
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reported by: Matt Macy <mmacy@nextbsd.org>
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reviewed by: Matt Macy <mmacy@nextbsd.org>
Hyper-V's NIC SR-IOV implementation needs a Hyper-V synthetic NIC and
a VF NIC to work together, mainly to support seamless live migration.
When the VF device becomes UP (or DOWN), the synthetic NIC driver needs
to switch the data path from the synthetic NIC to the VF (or the opposite).
So the synthetic NIC driver needs to know when a VF device is becoming
UP or DOWN and hence the patch is made.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8963
Variables "fast" and "active" are both constant in lacp_port_create(), but
comments mispleadingly suggest that "fast" can be changed via ioctl. The
constant values control the value of "lp->lp_state", so it too is constant,
and the code for assigning different value to it is essentially dead.
Remove both "fast" and "active", and set "lp->lp_state" unconditionally;
that gets rid of the dead code and misleading comments.
CID: 1305692
CID: 1305734
Reported by: asomers
Reviewed by: asomers
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D9302
When ixgbe receives an interrupt indicating that a new optical module
may have been inserted, it discards all of its current media types
by calling ifmedia_removeall() and then creates a new set of media
types for the supported media on the new module. However,
ifmedia_removeall() was maintaining a pointer to whatever the
current media type was before the call to ifmedia_removealL().
The result of this was that any attempt to read the current media
type of the interface (e.g. via ifconfig) would return potentially
garbage data from free memory (or if one were particularly unlucky
on an architecture that does not malloc() from a direct map, page
fault the kernel).
Fix this by NULL'ing out the current media field in if_media.c,
and have ixgbe update the current media type after recreating
them.
Submitted by: Matt Joras <matt.joras AT gmail DOT com>
Reviewed by: sbruno, erj
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9164
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
sys/net/iflib.c:
Add ctx to filter_info and don't skpi interrupt early on unless we're on an
SMP system
sys/kern/subr_gtaskqueue.c:
Skip smp check if we're running UP
Submitted by: Matt Macy <mmacy@nextbsd.org>
Reported by: emaste bde
This ensures the interface is initialized by the interface driver
before it can be used by the rest of the system.
Reviewed by: jhb, karels, gnn
MFC after: 3 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8905
- iflib - add checksum in place support (mmacy)
- iflib - initialize IP for TSO (going to be needed for e1000) (mmacy)
- iflib - move isc_txrx from shared context to softc context (mmacy)
- iflib - Normalize checks in TXQ drainage. (shurd)
- iflib - Fix queue capping checks (mmacy)
- iflib - Fix invalid assert, em can need 2 sentinels (mmacy)
- iflib - let the driver determine what capabilities are set and what
tx csum flags are used (mmacy)
- add INVARIANTS debugging hooks to gtaskqueue enqueue (mmacy)
- update bnxt(4) to support the changes to iflib (shurd)
Some other various, sundry updates. Slightly more verbose changelog:
Submitted by: mmacy@nextbsd.org
Reviewed by: shurd
mFC after:
Sponsored by: LimeLight Networks and Dell EMC Isilon
If you run "ifconfig lagg0 destroy" and "ifconfig lagg0" at the same time a
page fault may result. The first process will destroy ifp->if_lagg in
lagg_clone_destroy (called by if_clone_destroy). Then the second process
will observe that ifp->if_lagg is NULL at the top of lagg_port_ioctl and
goto fallback: where it will promptly dereference ifp->if_lagg anyway.
The solution is to repeat the NULL check for ifp->if_lagg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D8512
This is a work in progress and some of this stuff may change;
but hopefully I'm laying down enough stuff and space in fields
to allow it to grow without another major recompile.
We'll see!
* Add a net80211 PHY type for VHT 2G and VHT 5G.
Note - yes, VHT is supposed to be for 5GHZ, however some vendors
(*cough* most of them) support some subset of VHT rate support
in 2GHz. No - not 80MHz wide channels, but at least some MCS8-9
support, maybe some beamforming, and maybe some longer A-MPDU
aggregates. I don't want to even think about MU-MIMO on 2GHz.
* Add an ifmedia placeholder type for VHT rates.
* Add channel flags for VHT, VHT20/40U/40D/80/80+80/160
* Add channel macros for the above
* Add ieee80211_channel fields for the VHT information and flags,
along with some padding (so this struct definitely grows.)
* Add a phy type flag for VHT - 'v'
* Bump the number of channels to a much higher amount - until we get
something like the linux mac80211 chanctx abstraction (where the
stack provides a current channel configuration via callbacks,
versus the driver ever checking ic->ic_curchan or similar) we'll
have to populate VHT+HT combinations.
Eg, there'll likely be a full set of duplicate VHT20/40 channels to match
HT channels. There will also be a full set of duplicate VHT80 channels -
note that for VHT80, its assumed you're doing VHT40 as a base, so we
don't need a duplicate of VHT80 + 20MHz only primary channels, only
a duplicate of all the VHT40 combinations.
I don't want to think about VHT80+80 or VHT160 for now - and I won't,
as the current device I'm doing 11ac bringup on (QCA9880) only does
VHT80.
I'll likely revisit the channel configuration and scanning related
stuff after I get VHT20/40 up.
* Add vht flags and the basic MCS rate setup to ieee80211com, ieee80211vap
and ieee80211_node in preparation for 11ac configuration.
There is zero code that uses this right now.
* Whilst here, add some more placeholders in case I need to extend
out things by some uint32_t flag sized fields. Hopefully I won't!
What I haven't yet done:
* any of the code that uses this
* any of the beamforming related fields
* any of the MU-MIMO fields required for STA/AP operation
* any of the IE fields in beacon frame / probe request/response handling
and the calculations required for shifting beacon contents around
when the TIM grows/shrinks
This will require a full rebuild of net80211 related programs -
ifconfig, hostapd, wpa_supplicant.
Since the previous algorithm, based on bit shifting, does not scale
with large replay windows, the algorithm used here is based on
RFC 6479: IPsec Anti-Replay Algorithm without Bit Shifting.
The replay window will be fast to be updated, but will cost as many bits
in RAM as its size.
The previous implementation did not provide a lock on the replay window,
which may lead to replay issues.
Reviewed by: ae
Obtained from: emeric.poupon@stormshield.eu
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D8468
- reset gen on down
- initialize admin task statically
- drain mp_ring on down
- don't drop context lock on stop
- reset error stats on down
- fix typo in min_latency sysctl
- return ENOBUFS from if_transmit if the driver isn't running or the link is down
Submitted by: mmacy@nextbsd.org
Reviewed by: shurd
MFC after: 2 days
Sponsored by: Isilon and Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8558
Calling into an ifnet implementation with the if_addr_lock already
held can cause a LOR and potentially a deadlock, as ifnet
implementations typically can take the if_addr_lock after their
own locks during configuration. Refactor a sysctl handler that
was violating this to read if_counter data in a temporary buffer
before the if_addr_lock is taken, and then copying the data
in its final location later, when the if_addr_lock is held.
PR: 194109
Reported by: Jean-Sebastien Pedron
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8498
Reviewed by: sbruno
- use PCI_VENDOR and PCI_DEVICE ids from a publicly allocated range
(thanks to RedHat)
- export memory pool information through PCI registers
- improve mechanism for configuring passthrough on different hypervisors
Code is from Vincenzo Maffione as a follow up to his GSOC work.
Currently the network change is simulated by link status changes.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8295
These two ALU instructions first appeared on Linux. Then, libpcap adopted
and made them available since 1.6.2. Now more platforms including NetBSD
have them in kernel. So do we.
--이 줄 이하는 자동으로 제거됩니다--
The hashtype on an outgoing mbuf reflects the correct hash on the
transmit side of the connection. If this hash persists on loopback,
the receiving RSS/PCBGROUP code will use it to look up the pcbgroup
for the transmit side, which will often not match the pcbgroup for the
receive side of the connection. This leads to TCP connections
hanging, and dropping the SYN/ACK packet. This is essentially
the same as having a hardware network card generate mbufs with an
incorrect RSS hash.
There are a number of places which can set the hash on transmit,
so the simplest fix is to simply clear the hash at loopback time.
Clearing the hash allows a new, correct hash to be calculated in
software on the receive side.
Reviewed by: jtl
Discussed with: adrian
Sponsored by: Netflix
fix build on 32 bit platforms
simplify logic in netmap_virt.h
The commands (in net/netmap.h) to configure communication with the
hypervisor may be revised soon.
At the moment they are unused so this will not be a change of API.