messages before accessing message fields that may not be present,
removing dead/duplicate/misleading code along the way.
Document the message format for each routing socket message in
route.h.
Fix a bug in usr.bin/netstat introduced in r287351 that resulted in
pointer computation with essentially random 16-bit offsets and
dereferencing of the results.
Reviewed by: ae
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D10330
if_vlan(4) interfaces inherit IPv4 checksum offloading flags from the
parent when VLAN_HWCSUM and VLAN_HWTAGGING flags are present on the
parent interface. Do the same for IPv6 checksum offloading flags.
Reported by: Harry Schmalzbauer
Reviewed by: np, gnn
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10356
administrator.
This restores the behavior that was prior to r274246.
No objection from: #network
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10215
so that we can use it in iflib to detect pause frames.
The igb(4) driver definitely used to use this in its old timer function and
I see no reason to restrict it to that driver only.
Sponsored by: Limelight Networks
Make PFIL's lock global and use it for this purpose.
This reduces the number of locks needed to acquire for each packet.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
No objection from: #network
Differential Revision: https://reviews.freebsd.org/D10154
Prevent possible races in the pf_unload() / pf_purge_thread() shutdown
code. Lock the pf_purge_thread() with the new pf_end_lock to prevent
these races.
Use a shared/exclusive lock, as we need to also acquire another sx lock
(VNET_LIST_RLOCK). It's fine for both pf_purge_thread() and pf_unload()
to sleep,
Pointed out by: eri, glebius, jhb
Differential Revision: https://reviews.freebsd.org/D10026
- unconditionally enable BUS_DMA on non-x86 architectures
- speed up rxd zeroing via customized function
- support out of order updates to rxd's
- add prefetching to hardware descriptor rings
- only prefetch on 10G or faster hardware
- add seperate tx queue intr function
- preliminary rework of NETMAP interfaces, WIP
Submitted by: Matt Macy <mmacy@nextbsd.org>
Sponsored by: Limelight Networks
Currently are defined three scopes: global, ifnet, and pcb.
Generic security policies that IKE daemon can add via PF_KEY interface
or an administrator creates with setkey(8) utility have GLOBAL scope.
Such policies can be applied by the kernel to outgoing packets and checked
agains inbound packets after IPsec processing.
Security policies created by if_ipsec(4) interfaces have IFNET scope.
Such policies are applied to packets that are passed through if_ipsec(4)
interface.
And security policies created by application using setsockopt()
IP_IPSEC_POLICY option have PCB scope. Such policies are applied to
packets related to specific socket. Currently there is no way to list
PCB policies via setkey(8) utility.
Modify setkey(8) and libipsec(3) to be able distinguish the scope of
security policies in the `setkey -DP` listing. Add two optional flags:
'-t' to list only policies related to virtual *tunneling* interfaces,
i.e. policies with IFNET scope, and '-g' to list only policies with GLOBAL
scope. By default policies from all scopes are listed.
To implement this PF_KEY's sadb_x_policy structure was modified.
sadb_x_policy_reserved field is used to pass the policy scope from the
kernel to userland. SADB_SPDDUMP message extended to support filtering
by scope: sadb_msg_satype field is used to specify bit mask of requested
scopes.
For IFNET policies the sadb_x_policy_priority field of struct sadb_x_policy
is used to pass if_ipsec's interface if_index to the userland. For GLOBAL
policies sadb_x_policy_priority is used only to manage order of security
policies in the SPDB. For IFNET policies it is not used, so it can be used
to keep if_index.
After this change the output of `setkey -DP` now looks like:
# setkey -DPt
0.0.0.0/0[any] 0.0.0.0/0[any] any
in ipsec
esp/tunnel/87.250.242.144-87.250.242.145/unique:145
spid=7 seq=3 pid=58025 scope=ifnet ifname=ipsec0
refcnt=1
# setkey -DPg
::/0 ::/0 icmp6 135,0
out none
spid=5 seq=1 pid=872 scope=global
refcnt=1
No objection from: #network
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9805
that this shouldn't go in. I was unaware when I merged the pull
request. I don't wish to upset the status quo, so backout per
project practice.
Pull Request: https://github.com/freebsd/freebsd/pull/92
Noted by: hrs@
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
In particular, don't set the synchronized bit for the peer unless it truly
appears to be synchronized to us. Also, don't set our own synchronized bit
unless we have actually seen a remote system.
Prior to this change, we were seeing some strange behavior, such as:
1. We send an advertisement with the Activity, Aggregation, and Default
flags, followed by an advertisement with the Activity, Aggregation,
Synchronization, and Default flags. However, we hadn't seen an
advertisement from another peer and were still advertising the default
(NULL) peer. A closer examination of the in-kernel data structures (using
kgdb) showed that the system had added the default (NULL) peer as a valid
aggregator for the segment.
2. We were receiving an advertisement from a peer that included the
default (NULL) peer instead of including our system information. However,
we responded with an advertisement that included the Synchronization flag
for both our system and the peer. (Since the peer's advertisement did not
include our system information, we shouldn't add the synchronization bit
for the peer.)
This patch corrects those two items.
Reviewed by: smh
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9485
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.
Reviewed by: andreast, kan, lidl
Tested by: lidl(mips, sparc64), andreast(powerpc)
Differential Revision: https://reviews.freebsd.org/D9587
Small summary
-------------
o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.
Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.
Reported by: kan, lidl
structure:
if_gethwtsomax(), if_sethwtsomax() - if_hw_tsomax
if_gethwtsomaxsegcount(), if_sethwtsomaxsegcount() - if_hw_tsomaxsegcount
if_gethwtsomaxsegsize(), if_sethwtsomaxsegsize() - if_hw_tsomaxsegsize
Update em and vnic drivers which had already been coverted to use accessor
functions for the other ifnet structure members.
Reviewed by: erj
Approved by: sjg (mentor)
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D8544
This would enqueue an event to send the gratuitous arp on a dying lagg
interface without any physical ports attached to it.
Apart from that, the taskqueue_drain() on lagg_clone_destroy() runs too
late, when the ifp data structure is already freed. Fix that too.
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
convention for interfaces, because only one stf(4) interface can exist
in the system.
This disallow the use of unit numbers different than 0, however, it is
possible to create the clone without specify the unit number (wildcard).
In the wildcard case we must update the interface name before return.
This fix an infinite recursion in pf code that keeps track of network
interfaces and groups:
1 - a group for the cloned type of the interface is added (stf in this
case);
2 - the system will now try to add an interface named stf (instead of
stf0) to stf group;
3 - when pfi_kif_attach() tries to search for an already existing 'stf'
interface, the 'stf' group is returned and thus the group is added
as an interface of itself;
This will now cause a crash at the first attempt to traverse the groups
which the stf interface belongs (which loops over itself).
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
This interface type ("a parent interface of wlanX") is not used since
r287197
Reviewed by: adrian, glebius
Differential Revision: https://reviews.freebsd.org/D9308
Thank glebius for pointing this out:
"The network stuff shall not be added to sys/eventhandler.h"
Reviewed by: David_A_Bright_DELL.com, sephe, glebius
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D9345
We found routing performance dropped significantly when configuring
FreeBSD as a router, we are applying the following changes in order to
resolve those issues and hopefully perform better.
- don't prefetch the flags array, we usually don't need it
- prefetch the next cache line of each of the software descriptor arrays as
well as the first cache line of each of the next four packets' mbufs and
clusters
- reduce max copy size to 63 bytes
- convert rx soft descriptors from array of structures to a structure of arrays
- update copyrights
Submitted by: Matt Macy <mmacy@nextbsd.org>
This calls ioctl() handlers for the different interfaces in the bridge.
These handlers expect to get called in an ioctl context where it's safe
for them to sleep. We may not sleep with the bridge lock held.
However, we still need to protect the interface list, to ensure it
doesn't get changed while we iterate over it.
Use BRIDGE_XLOCK(), which prevents bridge members from being removed.
Adding bridge members is safe, because it uses LIST_INSERT_HEAD().
This caused panics when adding xen interfaces to a bridge.
PR: 216304
Reviewed by: ae
MFC after: 1 week
Sponsored by: RootBSD
Differential Revision: https://reviews.freebsd.org/D9290
(intentionally) deleted first and then completely added again (so all the
events, announces and hooks are given a chance to run).
This cause an issue with CARP where the existing CARP data structure is
removed together with the last address for a given VHID, which will cause
a subsequent fail when the address is later re-added.
This change fixes this issue by adding a new flag to keep the CARP data
structure when an address is not being removed.
There was an additional issue with IPv6 CARP addresses, where the CARP data
structure would never be removed after a change and lead to VHIDs which
cannot be destroyed.
Reviewed by: glebius
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
success and a good value. Only then try to use it and set the MSIX_ENABLE
bit.
With the current em(4) driver we have observed failures in this case in a
specific environment when pci_find_cap() would not return the assumed
value, which meant we ended up writing to PCI register 2 (PCI_DEVICE_ID)
which is read-only.
PR: 216456
Submitted by: bz
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reported by: Matt Macy <mmacy@nextbsd.org>
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reviewed by: Matt Macy <mmacy@nextbsd.org>
Hyper-V's NIC SR-IOV implementation needs a Hyper-V synthetic NIC and
a VF NIC to work together, mainly to support seamless live migration.
When the VF device becomes UP (or DOWN), the synthetic NIC driver needs
to switch the data path from the synthetic NIC to the VF (or the opposite).
So the synthetic NIC driver needs to know when a VF device is becoming
UP or DOWN and hence the patch is made.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8963
Variables "fast" and "active" are both constant in lacp_port_create(), but
comments mispleadingly suggest that "fast" can be changed via ioctl. The
constant values control the value of "lp->lp_state", so it too is constant,
and the code for assigning different value to it is essentially dead.
Remove both "fast" and "active", and set "lp->lp_state" unconditionally;
that gets rid of the dead code and misleading comments.
CID: 1305692
CID: 1305734
Reported by: asomers
Reviewed by: asomers
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D9302
When ixgbe receives an interrupt indicating that a new optical module
may have been inserted, it discards all of its current media types
by calling ifmedia_removeall() and then creates a new set of media
types for the supported media on the new module. However,
ifmedia_removeall() was maintaining a pointer to whatever the
current media type was before the call to ifmedia_removealL().
The result of this was that any attempt to read the current media
type of the interface (e.g. via ifconfig) would return potentially
garbage data from free memory (or if one were particularly unlucky
on an architecture that does not malloc() from a direct map, page
fault the kernel).
Fix this by NULL'ing out the current media field in if_media.c,
and have ixgbe update the current media type after recreating
them.
Submitted by: Matt Joras <matt.joras AT gmail DOT com>
Reviewed by: sbruno, erj
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9164
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
sys/net/iflib.c:
Add ctx to filter_info and don't skpi interrupt early on unless we're on an
SMP system
sys/kern/subr_gtaskqueue.c:
Skip smp check if we're running UP
Submitted by: Matt Macy <mmacy@nextbsd.org>
Reported by: emaste bde
This ensures the interface is initialized by the interface driver
before it can be used by the rest of the system.
Reviewed by: jhb, karels, gnn
MFC after: 3 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8905
- iflib - add checksum in place support (mmacy)
- iflib - initialize IP for TSO (going to be needed for e1000) (mmacy)
- iflib - move isc_txrx from shared context to softc context (mmacy)
- iflib - Normalize checks in TXQ drainage. (shurd)
- iflib - Fix queue capping checks (mmacy)
- iflib - Fix invalid assert, em can need 2 sentinels (mmacy)
- iflib - let the driver determine what capabilities are set and what
tx csum flags are used (mmacy)
- add INVARIANTS debugging hooks to gtaskqueue enqueue (mmacy)
- update bnxt(4) to support the changes to iflib (shurd)
Some other various, sundry updates. Slightly more verbose changelog:
Submitted by: mmacy@nextbsd.org
Reviewed by: shurd
mFC after:
Sponsored by: LimeLight Networks and Dell EMC Isilon
If you run "ifconfig lagg0 destroy" and "ifconfig lagg0" at the same time a
page fault may result. The first process will destroy ifp->if_lagg in
lagg_clone_destroy (called by if_clone_destroy). Then the second process
will observe that ifp->if_lagg is NULL at the top of lagg_port_ioctl and
goto fallback: where it will promptly dereference ifp->if_lagg anyway.
The solution is to repeat the NULL check for ifp->if_lagg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D8512
This is a work in progress and some of this stuff may change;
but hopefully I'm laying down enough stuff and space in fields
to allow it to grow without another major recompile.
We'll see!
* Add a net80211 PHY type for VHT 2G and VHT 5G.
Note - yes, VHT is supposed to be for 5GHZ, however some vendors
(*cough* most of them) support some subset of VHT rate support
in 2GHz. No - not 80MHz wide channels, but at least some MCS8-9
support, maybe some beamforming, and maybe some longer A-MPDU
aggregates. I don't want to even think about MU-MIMO on 2GHz.
* Add an ifmedia placeholder type for VHT rates.
* Add channel flags for VHT, VHT20/40U/40D/80/80+80/160
* Add channel macros for the above
* Add ieee80211_channel fields for the VHT information and flags,
along with some padding (so this struct definitely grows.)
* Add a phy type flag for VHT - 'v'
* Bump the number of channels to a much higher amount - until we get
something like the linux mac80211 chanctx abstraction (where the
stack provides a current channel configuration via callbacks,
versus the driver ever checking ic->ic_curchan or similar) we'll
have to populate VHT+HT combinations.
Eg, there'll likely be a full set of duplicate VHT20/40 channels to match
HT channels. There will also be a full set of duplicate VHT80 channels -
note that for VHT80, its assumed you're doing VHT40 as a base, so we
don't need a duplicate of VHT80 + 20MHz only primary channels, only
a duplicate of all the VHT40 combinations.
I don't want to think about VHT80+80 or VHT160 for now - and I won't,
as the current device I'm doing 11ac bringup on (QCA9880) only does
VHT80.
I'll likely revisit the channel configuration and scanning related
stuff after I get VHT20/40 up.
* Add vht flags and the basic MCS rate setup to ieee80211com, ieee80211vap
and ieee80211_node in preparation for 11ac configuration.
There is zero code that uses this right now.
* Whilst here, add some more placeholders in case I need to extend
out things by some uint32_t flag sized fields. Hopefully I won't!
What I haven't yet done:
* any of the code that uses this
* any of the beamforming related fields
* any of the MU-MIMO fields required for STA/AP operation
* any of the IE fields in beacon frame / probe request/response handling
and the calculations required for shifting beacon contents around
when the TIM grows/shrinks
This will require a full rebuild of net80211 related programs -
ifconfig, hostapd, wpa_supplicant.
Since the previous algorithm, based on bit shifting, does not scale
with large replay windows, the algorithm used here is based on
RFC 6479: IPsec Anti-Replay Algorithm without Bit Shifting.
The replay window will be fast to be updated, but will cost as many bits
in RAM as its size.
The previous implementation did not provide a lock on the replay window,
which may lead to replay issues.
Reviewed by: ae
Obtained from: emeric.poupon@stormshield.eu
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D8468
- reset gen on down
- initialize admin task statically
- drain mp_ring on down
- don't drop context lock on stop
- reset error stats on down
- fix typo in min_latency sysctl
- return ENOBUFS from if_transmit if the driver isn't running or the link is down
Submitted by: mmacy@nextbsd.org
Reviewed by: shurd
MFC after: 2 days
Sponsored by: Isilon and Limelight Networks
Differential Revision: https://reviews.freebsd.org/D8558
Calling into an ifnet implementation with the if_addr_lock already
held can cause a LOR and potentially a deadlock, as ifnet
implementations typically can take the if_addr_lock after their
own locks during configuration. Refactor a sysctl handler that
was violating this to read if_counter data in a temporary buffer
before the if_addr_lock is taken, and then copying the data
in its final location later, when the if_addr_lock is held.
PR: 194109
Reported by: Jean-Sebastien Pedron
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8498
Reviewed by: sbruno
- use PCI_VENDOR and PCI_DEVICE ids from a publicly allocated range
(thanks to RedHat)
- export memory pool information through PCI registers
- improve mechanism for configuring passthrough on different hypervisors
Code is from Vincenzo Maffione as a follow up to his GSOC work.
Currently the network change is simulated by link status changes.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8295
These two ALU instructions first appeared on Linux. Then, libpcap adopted
and made them available since 1.6.2. Now more platforms including NetBSD
have them in kernel. So do we.
--이 줄 이하는 자동으로 제거됩니다--
The hashtype on an outgoing mbuf reflects the correct hash on the
transmit side of the connection. If this hash persists on loopback,
the receiving RSS/PCBGROUP code will use it to look up the pcbgroup
for the transmit side, which will often not match the pcbgroup for the
receive side of the connection. This leads to TCP connections
hanging, and dropping the SYN/ACK packet. This is essentially
the same as having a hardware network card generate mbufs with an
incorrect RSS hash.
There are a number of places which can set the hash on transmit,
so the simplest fix is to simply clear the hash at loopback time.
Clearing the hash allows a new, correct hash to be calculated in
software on the receive side.
Reviewed by: jtl
Discussed with: adrian
Sponsored by: Netflix
fix build on 32 bit platforms
simplify logic in netmap_virt.h
The commands (in net/netmap.h) to configure communication with the
hypervisor may be revised soon.
At the moment they are unused so this will not be a change of API.
This commit, long overdue, contains contributions in the last 2 years
from Stefano Garzarella, Giuseppe Lettieri, Vincenzo Maffione, including:
+ fixes on monitor ports
+ the 'ptnet' virtual device driver, and ptnetmap backend, for
high speed virtual passthrough on VMs (bhyve fixes in an upcoming commit)
+ improved emulated netmap mode
+ more robust error handling
+ removal of stale code
+ various fixes to code and documentation (some mixup between RX and TX
parameters, and private and public variables)
We also include an additional tool, nmreplay, which is functionally
equivalent to tcpreplay but operating on netmap ports.
So that everyone in this task have consistent view of link state.
Reviewed by: ae
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8214
The _correct_ way to identify the supported checksum offloading and
TSO parameters is to query OID_TCP_OFFLOAD_HARDWARE_CAPABILITIES.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8088
Since if_addrlist is used only for ipfilter(4), add a macro if_addrlist
in ip_compat.h.
Reviewed by: cy
Differential Revision: https://reviews.freebsd.org/D8059
While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D8051
Fragmented UDP and ICMP packets were corrupted if a firewall with reassembling
feature (like pf'scrub) is enabled on the bridge. This patch fixes corrupted
packet problem and the panic (triggered easly with low RAM) as explain in PR
185633.
bridge_pfil and bridge_fragment relationship:
bridge_pfil() receive (IN direction) packets and sent it to the firewall The
firewall can be configured for reassembling fragmented packet (like pf'scrubing)
in one mbuf chain when bridge_pfil() need to send this reassembled packet to the
outgoing interface, it needs to re-fragment it by using bridge_fragment()
bridge_fragment() had to split this mbuf (using ip_fragment) first then
had to M_PREPEND each packet in the mbuf chain for adding Ethernet
header.
But M_PREPEND can sometime create a new mbuf on the begining of the mbuf chain,
then the "main" pointer of this mbuf chain should be updated and this case is
tottaly forgotten. The original bridge_fragment code (Revision 158140,
2006 April 29) came from OpenBSD, and the call to bridge_enqueue was
embedded. But on FreeBSD, bridge_enqueue() is done after bridge_fragment(),
then the original OpenBSD code can't work as-it of FreeBSD.
PR: 185633
Submitted by: Olivier Cochard-Labbé
Differential Revision: https://reviews.freebsd.org/D7780
They are defined by NDIS spec, so the NDIS prefix.
Reviewed by: hps
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7717
Actually all OIDs defined in net/rndis.h are standard NDIS OIDs.
While I'm here, use the verbose macro name as in NDIS spec.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7679
And use new RNDIS set to configure NDIS offloading parameters.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7641
And switch MAC address query to use new RNDIS query function.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7639
While I'm here, sort the RNDIS status in ascending order.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7594
So that Hyper-V can leverage them instead of rolling its own definition.
Discussed with: hps
Reviewed by: hps
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D7592
Use netisr_get_cpuid() in netisr_select_cpuid() to limit cpuid value
returned by protocol to be sure that it is not greather than nws_count.
PR: 211836
Reviewed by: adrian
MFC after: 3 days
- Move group task queue into kern/subr_gtaskqueue.c
- Change intr_enable to return an int so it can be detected if it's not
implemented
- Allow different TX/RX queues per set to be different sizes
- Don't split up TX mbufs before transmit
- Allow a completion queue for TX as well as RX
- Pass the RX budget to isc_rxd_available() to allow an earlier return
and avoid multiple calls
Submitted by: shurd
Reviewed by: gallatin
Approved by: scottl
Differential Revision: https://reviews.freebsd.org/D7393
turn them into a shared definition.
Set M_MCAST/M_BCAST appropriately upon packet reception in net80211, just
before they are delivered up to the ethernet stack.
Submitted by: rstone
The "rtentry" zone does not use UMA_ZONE_ZINIT, so it is invalid to assume the
mutex's memory will be zero. Without MTX_NEW, garbage backing memory may
trigger the "re-initializing a mutex" assertion.
PR: 200991
Submitted by: Chang-Hsien Tsai <luke.tw AT gmail.com>
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI. The variables were renamed to avoid shadowing issues with local
variables of the same name.
Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure. As a
preparation, this commit only introduces the interface.
Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance. Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.
Tested by: pho (as part of the bigger patch)
Reviewed by: jhb (same)
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302
It looks like these sysctls were copy-pasted from netmap. Most were changed
from 'ixl_' prefix to 'iflib_', but this one was missed.
Fix the "can't re-use a leaf (ixl_rx_miss_bufs)!" warning.
Reported by: dim@ and others
Sponsored by: EMC / Isilon Storage Division
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
will cal if_free() in case of conflict, error, ..
if_free() however sets the VNET instance from the ifp->if_vnet which
was not yet initialized but would only in if_attach(). Fix this by
setting the curvnet from where we allocate the interface in if_alloc().
if_attach() will later overwrite this as needed. We do not set the home_vnet
early on as we only want to prevent the if_free() panic but not change any
of the other housekeeping, e.g., triggered through ifioctl()s.
Reviewed by: brooks
Approved by: re (gjb)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7010
proper virtualisation, teardown, avoiding use-after-free, race conditions,
no longer creating a thread per VNET (which could easily be a couple of
thousand threads), gracefully ignoring global events (e.g., eventhandlers)
on teardown, clearing various globally cached pointers and checking
them before use.
Reviewed by: kp
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6924
use. Update comments regarding the spare fields in struct inpcb.
Bump __FreeBSD_version for the changes to the size of the structures.
Reviewed by: gnn@
Approved by: re@ (gjb@)
Sponsored by: Chelsio Communications
some fields to match the order in the struct. Especially needed
if_pf_kif to do pf(4) VNET debugging.
Approved by: re (marius)
Obtained from: projects/vnet
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
without VIMAGE support would dereference a NULL point unconditionally
leading to a panic. Wrap the entire VIMAGE related code with #ifdefs
rather than just the decision making part to save an extra bit of
resources.
Reported by: np
Sponsored by: The FreeBSD Foundation
MFC After: 13 days
Approved by: re (marius)
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.
Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.
Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.
For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.
Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.
For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).
Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.
Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747
Adopt the OpenBSD syntax for setting and filtering on VLAN PCP values. This
introduces two new keywords: 'set prio' to set the PCP value, and 'prio' to
filter on it.
Reviewed by: allanjude, araujo
Approved by: re (gjb)
Obtained from: OpenBSD (mostly)
Differential Revision: https://reviews.freebsd.org/D6786
Due to an accidental mismatch between allocation and release in the slow path
of iflib_if_transmit, if a caller passed 9-16 mbufs to the routine, the mbuf
array would be leaked.
Fix the mismatch by removing the magic numbers in favor of nitems() on the
stack array. According to mmacy, this leak is unlikely.
Reported by: Coverity
Discussed with: mmacy
CID: 1356040
Sponsored by: EMC / Isilon Storage Division
Support for compression has been available from July 2007 but it
was never imported due to concerns with patents once held by
STAC/HiFn. The issues have clearly been resolved so bring it
in now.
Special thanks to Brett Glass for preserving the code and
pointing documentation for the expiration case.
Obtained from: mav (through Brett Glass)
Relnotes: yes
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6739
to NULL to avoid it being mis-treated on a possible re-attach but also
to get a clean NULL pointer derefence in case of errors due to
unexpected race conditions elsewhere in the code, e.g., callouts.
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
due to called functions (as in other parts of the stack, leave a comment).
Put around a lock the removal of the ifa from the list however to
reduce the possible race with other places.
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
family as an argument as well.
This will be used to cleanup individual protocols during VNET teardown.
Obtained from: projects/vnet
Sponsored by: The FreeBSD Foundation
which refers to IEEE 802.1p class of service and maps to the frame
priority level.
Values in order of priority are: 1 (Background (lowest)),
0 (Best effort (default)), 2 (Excellent effort),
3 (Critical applications), 4 (Video, < 100ms latency),
5 (Video, < 10ms latency), 6 (Internetwork control) and
7 (Network control (highest)).
Example of usage:
root# ifconfig em0.1 create
root# ifconfig em0.1 vlanpcp 3
Note:
The review D801 includes the pf(4) part, but as discussed with kristof,
we won't commit the pf(4) bits for now.
The credits of the original code is from rwatson.
Differential Revision: https://reviews.freebsd.org/D801
Reviewed by: gnn, adrian, loos
Discussed with: rwatson, glebius, kristof
Tested by: many including Matthew Grooms <mgrooms__shrew.net>
Obtained from: pfSense
Relnotes: Yes
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.
Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.
The change should be transparent for non-VIMAGE kernels.
Reviewed by: gnn (, hiren)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6691
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.
Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262
to use TRYLOCK rather than just acquire the lock, so just do that.
Reviewed by: markj
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6578
changing the type on the mtu field in struct tuninfo from short to
unsigned short.
This is used, for example, by packetdrill to test with MTUs up to the
maximum value.
Differential Revision: 6452
function in vnet.c move it to if.c where it logically belongs and put
it under a VNET_SYSUNINIT() call.
To not change the current behaviour make sure it runs first thing
during teardown. In the future this will allow us more flexibility
on changing the order on when we want to get rid of interfaces.
Stop exporting if_vmove() and make it file static.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6438
This is set to the SI_SUB_* value before executing any VNET_SYSINIT
or VNET_SYSUNINT. While good for debugging especially VNET teardown
problems having a chance to know at which level during teardown we are,
it will also be used to identify to detcted a "stable state"
(as in fully up and running) later on.
Obtained from: projects/vnet
Sponsored by: The FreeBSD Foundation
"iflib is a library to eliminate the need for frequently duplicated device
independent logic propagated (poorly) across many network drivers."
Participation is purely optional. The IFLIB kernel config option is
provided for drivers that want to transition between legacy and iflib
modes of operation. ixl and ixgbe driver conversions will be committed
shortly. We hope to see participation from the Broadcom and maybe
Chelsio drivers in the near future.
Submitted by: mmacy@nextbsd.org
Reviewed by: gallatin
Differential Revision: D5211
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.
This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed). This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP. It also permits all CPUs to be available for
handling interrupts before any devices are probed.
This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot. Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.
However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system. In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU. Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.
Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code. This removes the need to treat the single-CPU boot environment
as a special case.
As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP). This will allow the option to be turned off
if need be during initial testing. I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0. Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.
These changes have only been tested on x86. Other platform maintainers
are encouraged to port their architectures over as well. The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).
PR: kern/199321
Reviewed by: markj, gnn, kib
Sponsored by: Netflix
Two new functions are provided, bit_ffs_at() and bit_ffc_at(), which allow
for efficient searching of set or cleared bits starting from any bit offset
within the bit string.
Performance is improved by operating on longs instead of bytes and using
ffsl() for searches within a long. ffsl() is a compiler builtin in both
clang and gcc for most architectures, converting what was a brute force
while loop search into a couple of instructions.
All of the bitstring(3) API continues to be contained in the header file.
Some of the functions are large enough that perhaps they should be uninlined
and moved to a library, but that is beyond the scope of this commit.
sys/sys/bitstring.h:
Convert the majority of the existing bit string implementation from
macros to inline functions.
Properly protect the implementation from inadvertant macro expansion
when included in a user's program by prefixing all private
macros/functions and local variables with '_'.
Add bit_ffs_at() and bit_ffc_at(). Implement bit_ffs() and
bit_ffc() in terms of their "at" counterparts.
Provide a kernel implementation of bit_alloc(), making the full API
usable in the kernel.
Improve code documenation.
share/man/man3/bitstring.3:
Add pre-exisiting API bit_ffc() to the synopsis.
Document new APIs.
Document the initialization state of the bit strings
allocated/declared by bit_alloc() and bit_decl().
Correct documentation for bitstr_size(). The original code comments
indicate the size is in bytes, not "elements of bitstr_t". The new
implementation follows this lead. Only hastd assumed "elements"
rather than bytes and it has been corrected.
etc/mtree/BSD.tests.dist:
tests/sys/Makefile:
tests/sys/sys/Makefile:
tests/sys/sys/bitstring.c:
Add tests for all existing and new functionality.
include/bitstring.h
Include all headers needed by sys/bitstring.h
lib/libbluetooth/bluetooth.h:
usr.sbin/bluetooth/hccontrol/le.c:
Include bitstring.h instead of sys/bitstring.h.
sbin/hastd/activemap.c:
Correct usage of bitstr_size().
sys/dev/xen/blkback/blkback.c
Use new bit_alloc.
sys/kern/subr_unit.c:
Remove hard-coded assumption that sizeof(bitstr_t) is 1. Get rid of
unrb.busy, which caches the number of bits set in unrb.map. When
INVARIANTS are disabled, nothing needs to know that information.
callapse_unr can be adapted to use bit_ffs and bit_ffc instead.
Eliminating unrb.busy saves memory, simplifies the code, and
provides a slight speedup when INVARIANTS are disabled.
sys/net/flowtable.c:
Use the new kernel implementation of bit-alloc, instead of hacking
the old libc-dependent macro.
sys/sys/param.h
Update __FreeBSD_version to indicate availability of new API
Submitted by: gibbs, asomers
Reviewed by: gibbs, ngie
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6004
interested in having tunneled UDP and finding out about the
ICMP (tested by Michael Tuexen with SCTP.. soon to be using
this feature).
Differential Revision: http://reviews.freebsd.org/D5875
It seems rn_dupedkey may be NULL, because of the NULL check inside the loop.
(Also, the rt gets assigned from rn_dupedkey and NULL checked at top of loop.)
However, the for-loop update condition happens before the top-of-loop check and
dereferences 'rt' unconditionally.
Instead, NULL-check before dereferencing.
If rn_dupedkey cannot in fact be NULL, or something else protects this, feel
free to revert this and add an ASSERT of some kind instead.
This was introduced in r191080 (2009) and moved around slightly in r293657.
Reported by: Coverity
CID: 1348482
Sponsored by: EMC / Isilon Storage Division
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
R_Zalloc is essentially a malloc(M_NOWAIT) wrapper. It is possible that 'rnh'
failed to allocate, but 'rmh' succeeds. In that case, we bail out of
rn_inithead() but previously did not free 'rmh'.
Introduced in r287073 (projects/routing) / MFP r294706.
Reported by: Coverity
CID: 1350258
Sponsored by: EMC / Isilon Storage Division
'lst' is allocated with 'n1' members. 'n' indexes 'lst'. So 'n == n1' is an
invalid 'lst' index. This is a follow-up to r296009.
Reported by: Coverity
CID: 1352743
Sponsored by: EMC / Isilon Storage Division
handler notifying about interface departure and one of the consumers will
detach if_bpf.
There is no way for us to re-attach this easily as the DLT and hdrlen are
only given on interface creation.
Add a function to allow us to query the DLT and hdrlen from a current
BPF attachment and after if_attach_internal() manually re-add the if_bpf
attachment using these values.
Found by panics triggered by nd6 packets running past BPF_MTAP() with no
proper if_bpf pointer on the interface.
Also add a basic DDB show function to investigate the if_bpf attachment
of an interface.
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5896
- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
using gcc happy [-Wreturn-type]
"type qualifiers ignored on function return type"
I am not entirely happy with this solution putting the u_int there
but it will do for now.
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.
Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306
Unlike buf_ring_peek, it only supports single consumer mode, and it
clears the cons_head if DEBUG_BUFRING/INVARIANTS is defined.
The normal use case of drbr_peek for network drivers is:
m = drbr_peek(br);
err = hw_spec_encap(&m); /* could m_defrag/m_collapse */
(*)
if (err) {
if (m == NULL)
drbr_advance(br);
else
drbr_putback(br, m);
/* break the loop */
}
drbr_advance(br);
The race is:
If hw_spec_encap() m_defrag or m_collapse the mbuf, i.e. the old mbuf
was freed, or like the Hyper-V's network driver, that transmission-
done does not even require the TX lock; then on the other CPU at the
(*) time, the freed mbuf could be recycled and being drbr_enqueue even
before the current CPU had the chance to call drbr_{advance,putback}.
This triggers a panic in drbr_enqueue duplicated element check, if
DEBUG_BUFRING/INVARIANTS is defined.
Use buf_ring_peek_clear_sc() in drbr_peek() to fix the above race.
This change is a NO-OP, if neither DEBUG_BUFRING nor INVARIANTS are
defined.
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5416
Copy the data into temprorary malloced buffer and drop the lock for
copyout.
Reported, reviewed and tested by: cem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Fix a panic that occurs when a vnet interface is unavailable at the time the
vnet jail referencing said interface is stopped.
Sponsored by: FIS Global, Inc.
back and harmize the use cases among RIB, IPFW, PF yet but it's also not
the scope of this work. Prevents instant panics on teardown and frees
the FIB bits again.
Sponsored by: The FreeBSD Foundation
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
with different requirements. In fact, first 3 don't have _any_ requirements
and first 2 does not use radix locking. On the other hand, routing
structure do have these requirements (rnh_gen, multipath, custom
to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.
So, radix code now uses tiny 'struct radix_head' structure along with
internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
Existing consumers still uses the same 'struct radix_node_head' with
slight modifications: they need to pass pointer to (embedded)
'struct radix_head' to all radix callbacks.
Routing code now uses new 'struct rib_head' with different locking macro:
RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
information base).
New net/route_var.h header was added to hold routing subsystem internal
data. 'struct rib_head' was placed there. 'struct rtentry' will also
be moved there soon.
sent using roundrobin protocol and set a better granularity and distribution
among the interfaces. Tuning the number of packages sent by interface can
increase throughput and reduce unordered packets as well as reduce SACK.
Example of usage:
# ifconfig bge0 up
# ifconfig bge1 up
# ifconfig lagg0 create
# ifconfig lagg0 laggproto roundrobin laggport bge0 laggport bge1 \
192.168.1.1 netmask 255.255.255.0
# ifconfig lagg0 rr_limit 500
Reviewed by: thompsa, glebius, adrian (old patch)
Approved by: bapt (mentor)
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D540
easier. Note: this is currently not in a usable state as certain
teardown parts are not called and the DOMAIN rework is missing.
More to come soon and find its way to head.
Obtained from: P4 //depot/user/bz/vimage/...
Sponsored by: The FreeBSD Foundation
Move actual rte selection process from rtalloc_mpath_fib()
to the rt_path_selectrte() function. Add public
rt_mpath_select() to use in fibX_lookup_ functions.
The only piece of information that is required is rt_flags subset.
In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags
to check if this particular mbuf needs to be dropped (and what
error should be returned).
Note that if_loop() will always return EHOSTUNREACH for "reject" routes
regardless of RTF_HOST flag existence. This is due to upcoming routing
changes where RTF_HOST value won't be available as lookup result.
All other functions require RTF_GATEWAY flag to check if they need
to return EHOSTUNREACH instead of EHOSTDOWN error.
There are 11 places where non-zero 'struct route' is passed to if_output().
For most of the callers (forwarding, bpf, arp) does not care about exact
error value. In fact, the only place where this result is propagated
is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()).
Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and
RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary
rte flags to ro_flags. Call this function in ip_output() after looking up/
verifying rte.
Reviewed by: ae
Such handler should pass different set of variables, instead
of directly providing 2 locked route entries.
Given that it hasn't been really used since at least 2012, remove
current code.
Will re-add it after finishing most major routing-related changes.
Discussed with: np
Last consumer using RTF_RNH_LOCKED flag was eliminated in r291643.
Restrict passing RTF_RNH_LOCKED to rtrequest1_fib() and do better
locking for RTM_ADD / RTM_DELETE cases.
entries data in unified format.
There are control plane functions that require information other than
just next-hop data (e.g. individual rtentry fields like flags or
prefix/mask). Given that the goal is to avoid rte reference/refcounting,
re-use rt_addrinfo structure to store most rte fields. If caller wants
to retrieve key/mask or gateway (which are sockaddrs and are allocated
separately), it needs to provide sufficient-sized sockaddrs structures
w/ ther pointers saved in passed rt_addrinfo.
Convert:
* lltable new records checks (in_lltable_rtcheck(),
nd6_is_new_addr_neighbor().
* rtsock pre-add/change route check.
* IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because
1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should
not be multiple host routes for such hosts 2) if we have multiple
routes we should inspect them (which is not done). 3) the entire idea
of abusing KRT as storage for ND proxy seems odd. Userland programs
should be used for that purpose).
Add ro_mtu field to 'struct route' to be able to pass lookup MTU back to
the caller.
Currently, ip6_getpmtu() has 2 totally different use cases:
1) control plane (IPV6_PATHMTU req), where we just need to calculate MTU
and return it, w/o any reusability.
2) Actual ip6_output() data path where we (nearly) always use the provided
route lookup data. If this data is not 'valid' we need to perform another
lookup and save the result (which cannot be re-used by ip6_output()).
Given that, handle 1) by calling separate function doing rte lookup itself.
Resulting MTU is calculated by (newly-added) ip6_calcmtu() used by both
ip6_getpmtu_ctl() and ip6_getpmtu().
For 2) instead of storing ref'ed rte, store mtu (the only needed data
from the lookup result) inside newly-added ro_mtu field.
'struct route' was shrinked by 8(or 4 bytes) in r292978. Grow it again
by 4 bytes. New ro_mtu field will be used in other places like
ip/tcp_output (EMSGSIZE handling from output routines).
Reviewed by: ae
Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).
Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.
These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.
ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.
BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).
Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.
Differential Revision: https://reviews.freebsd.org/D4102
implemented to lower the compiler warnings.
It fix the case of unused-but-set-variable spotted by gcc4.9.
Reviewed by: ngie, ae
Approved by: bapt (mentor)
Differential Revision: https://reviews.freebsd.org/D4720
epair(4), we may hit if_detach_internal() without holding a lock and by
the time we aquire it the interface might be gone.
We should not panic() in this case as it is our fault for not holding
the lock all the way. It is not ideal to return silently without error
to user space, but other callers will all ignore the return values so
do not change the entire KPI for little benefit for now.
The ifp will be dealt with one way or another still.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4529
creation will print extra lines on the console. We are generally not
interested in this (repeated) information for each VNET. Thus only
print it for the default VNET. Virtual interfaces on the base system
will remain printing information, but e.g. each loopback in each vnet
will no longer cause a "bpf attached" line.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4531
initialization.
Mfp4 @180384,180385:
There is no need for a dedicated SYSINIT here. The
list can be initialized statically.
Sponsored by: CK Software GmbH
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4528
Before the change, things like lle state were queried via
SIOCGNBRINFO_IN6 by ndp(8) for _each_ lle entry in dump.
This ioctl was added in 1999, probably to avoid touching rtsock code.
This change maps SIOCGNBRINFO_IN6 data to standard rtsock dump the
following way:
expire (already) maps to rtm_rmx.rmx_expire
isrouter -> rtm_flags & RTF_GATEWAY
asked -> rtm_rmx.rmx_pksent
state -> rtm_rmx.rmx_state (maps to rmx_weight via define)
Reviewed by: ae
When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
Neighbour Advertisements (IPv6) are sent to notify other nodes that the
address may have moved.
This results is slow failover, dropped packets and network outages for the
lagg interface when the primary link goes down.
We now use the new if_link_state_change_cond with the force param set to
allow lagg to force through link state changes and hence fire a
ifnet_link_event which are now monitored by rip and nd6.
Upon receiving these events each protocol trigger the relevant
notifications:
* inet4 => Gratuitous ARP
* inet6 => Unsolicited Neighbour Announce
This also fixes the carp IPv6 NA's that stopped working after r251584 which
added the ipv6_route__llma route.
The new behavour can be controlled using the sysctls:
* net.link.ether.inet.arp_on_link
* net.inet6.icmp6.nd6_on_link
Also removed unused param from lagg_port_state and added descriptions for the
sysctls while here.
PR: 156226
MFC after: 1 month
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4111
Before r291643, adding new interface prefix had the following logic:
try_add:
EEXIST && (PINNED) {
try_del(w/o PINNED flag)
if (OK)
try_add(PINNED)
}
In r291643, deletion was performed w/ PINNED flag held which leaded
to new interface prefixes (like ::1) overriding older ones.
Fix this by requesting deletion w/o RTF_PINNED.
PR: kern/205285
Submitted by: Fabian Keil <fk at fabiankeil.de>
LLE structure is mostly unchanged during its lifecycle: there are only 2
things relevant for fast path lookup code:
1) link-level address change. Since r286722, these updates are performed
under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
we send NS to perform reachability verification instead of expiring entry.
The only signal that is needed from fast path is something like binary
yes/no.
The latter is solved by the following changes:
Special r_skip_req (introduced in D3688) value is used for fast path feedback.
It is read lockless by fast path, but updated under req_mutex mutex. If this
field is non-zero, then fast path will acquire lock and set it back to 0.
After transitioning to STALE state, callout timer is armed to run each
V_nd6_delay seconds to make sure that if packet was transmitted at the start
of given interval, we would be able to switch to PROBE state in V_nd6_delay
seconds as user expects.
(in STALE state) timer is rescheduled until original V_nd6_gctimer expires
keeping lle in STALE state (remaining timer value stored in lle_remtime).
(in STALE state) timer is rescheduled if packet was transmitted less that
V_nd6_delay seconds ago to make sure we transition to PROBE state exactly
after V_n6_delay seconds.
As a result, all packets towards lle in REACHABLE/STALE/PROBE states are handled
by fast path without acquiring lle read lock.
Differential Revision: https://reviews.freebsd.org/D3780
Vast majority of rtalloc(9) users require only basic info from
route table (e.g. "does the rtentry interface match with the interface
I have?". "what is the MTU?", "Give me the IPv4 source address to use",
etc..).
Instead of hand-rolling lookups, checking if rtentry is up, valid,
dealing with IPv6 mtu, finding "address" ifp (almost never done right),
provide easy-to-use API hiding all the complexity and returning the
needed info into small on-stack structure.
This change also helps hiding route subsystem internals (locking, direct
rtentry accesses).
Additionaly, using this API improves lookup performance since rtentry is not
locked.
(This is safe, since all the rtentry changes happens under both radix WLOCK
and rtentry WLOCK).
Sponsored by: Yandex LLC
LLE structure is mostly unchanged during its lifecycle.
To be more specific, there are 2 things relevant for fast path
lookup code:
1) link-level address change. Since r286722, these updates are performed
under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
we re-send arp request to perform reachability verification instead of
expiring entry. The only signal that is needed from fast path is something
like binary yes/no.
The latter is solved by the following changes:
1) introduce special r_skip_req field which is read lockless by fast path,
but updated under (new) req_mutex mutex. If this field is non-zero, then
fast path will acquire lock and set it back to 0.
2) introduce simple state machine: incomplete->reachable<->verify->deleted.
Before that we implicitely had incomplete->reachable->deleted state machine,
with V_arpt_keep between "reachable" and "deleted". Verification was performed
in runtime 5 seconds before V_arpt_keep expire.
This is changed to "change state to verify 5 seconds before V_arpt_keep,
set r_skip_req to non-zero value and check it every second". If the value
is zero - then send arp verification probe.
These changes do not introduce any signifficant control plane overhead:
typically lle callout timer would fire 1 time more each V_arpt_keep (1200s)
for used lles and up to arp_maxtries (5) for dead lles.
As a result, all packets towards "reachable" lle are handled by fast path without
acquiring lle read lock.
Additional "req_mutex" is needed because callout / arpresolve_slow() or eventhandler
might keep LLE lock for signifficant amount of time, which might not be feasible
for fast path locking (e.g. having rmlock as ether AFDATA or lltable own lock).
Differential Revision: https://reviews.freebsd.org/D3688
by filter function instead of picking into routing table details in
each consumer.
Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED
user).
This simplifies future nexthops/mulitipath changes and rtrequest1_fib()
locking refactoring.
Actual changes:
Add "rt_chain" field to permit rte grouping while doing batched delete
from routing table (thus growing rte 200->208 on amd64).
Add "rti_filter" / "rti_filterdata" / "rti_spare" fields to rt_addrinfo
to pass filter function to various routing subsystems in standard way.
Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate
rt_expunge().
Use hhook(9) framework to achieve ability of loading and unloading
if_enc(4) kernel module. INET and INET6 code on initialization registers
two helper hooks points in the kernel. if_enc(4) module uses these helper
hook points and registers its hooks. IPSEC code uses these hhook points
to call helper hooks implemented in if_enc(4).
new return codes of -1 were mistakenly being considered "true". Callout_stop
now returns -1 to indicate the callout had either already completed or
was not running and 0 to indicate it could not be stopped. Also update
the manual page to make it more consistent no non-zero in the callout_stop
or callout_reset descriptions.
MFC after: 1 Month with associated callout change.
Add net.link.lagg.lacp.default_strict_mode which defines
the default value for LACP strict compliance for created
lagg devices.
Also:
* Add lacp_strict option to ifconfig(8).
* Fix lagg(4) creation examples.
* Minor style(9) fix.
MFC after: 1 week
sysctl and will always be on. The former split between default and
fast forwarding is removed by this commit while preserving the ability
to use all network stack features.
Differential Revision: https://reviews.freebsd.org/D4042
Reviewed by: ae, melifaro, olivier, rwatson
MFC after: 1 month
Sponsored by: Rubicon Communications (Netgate)