When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
When the software send queue gets filled up, callbacks to
if_transmit will stop. Make sure the transmit callback
routine checks the send queue and outputs any remaining
mbufs. Else the remaining mbufs may simply sit in the
output queue blocking the transmit path.
MFC after: 3 days
Sponsored by: Mellanox Technologies
This was enumerated with exhaustive search for sys/eventhandler.h includes,
cross-referenced against EVENTHANDLER_* usage with the comm(1) utility. Manual
checking was performed to avoid redundant includes in some drivers where a
common os_bsd.h (for example) included sys/eventhandler.h indirectly, but it is
possible some of these are redundant with driver-specific headers in ways I
didn't notice.
(These CUs did not show up as missing eventhandler.h in tinderbox.)
X-MFC-With: r347984
ib_req_notify_cq may return negative value which will indicate a
failure. In the case of uncorrectable error, we will end up in an
endless loop. Fix that, by going to another loop with poll_more
only if there is anything left to poll.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
Modify QP can fail and it can be acceptable, like when moving from RST to
ERR state, all the rest are not acceptable and a message to the log
should be printed.
The current code prints on all failures and many messages like:
"Failed to modify QP to ERROR state" appear, even when supported by the
state machine of the QP object.
Linux commit:
5dc78ad1904db597bdb4427f3ead437aae86f54c
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When a packet needs fragmentation, it might generate more than 3 fragments.
With the queue length 3, all fragments are generated faster than the
queue is drained, which effectively drops fourth and later fragments on
the floor.
Submitted by: kib@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When changing the MTU of ibX network interfaces, check that the MTU was really
changed before requesting an update of the multicast rules. Else we might go
into an infinite loop joining and leaving ibX multicast groups towards the
opensm master interface.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
It is not enough to set ifnet->if_mtu to change the interface MTU.
System saves the MTU for route in the radix tree, and route cache keeps
the interface MTU as well. Since addition of the multicast group causes
recalculation of MTU, even bringing the interface up changes MTU from
4042 to 1500, which makes the system configuration inconsistent. Worse,
ip_output() prefers route MTU over interface MTU, so large packets are
not fragmented and dropped on floor.
Fix it for ipoib(4) using the same approach (or hack) as was applied
for it_tun/if_tap in r339012. Thanks to bz@ for giving the hint.
Submitted by: kib@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Also fix the validate_ipv4_net_dev() and validate_ipv6_net_dev() functions
which had source and destination addresses swapped, and didn't set the
scope ID for IPv6 link-local addresses.
This allows applications like krping to work using IPoIB devices.
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
The original implementation used a reference to ifr_data and a cast to
do the equivalent of accessing ifr_addr. This was copied multiple
times since 1996.
Approved by: kib
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14873
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
Set the IPOIB_FLAG_INITIALIZED on dev_open and clear it on dev_stop to
avoid a race between ipoib load and the underlying device driver.
The device module must dispatch the IB_EVENT_PORT_ACTIVE event before ipoib
module is loaded. Otherwise, the flush will fail since no one set the
IPOIB_FLAG_INITIALIZED.
Submitted by: Slava Shwartsman <slavash@mellanox.com>
Sponsored by: Mellanox Technologies
MFC after: 1 week
This patch currently supports:
- ibcore as a kernel module only
- krping as a kernel module only
- ipoib as a kernel module only
Sponsored by: Mellanox Technologies
In pathological situations where the master subnet manager becomes
unresponsive for an extended period, we may otherwise end up queuing all
of the system's mbufs while waiting for a response to a path record lookup.
This addresses the same issue as commit 1e85b806f9 in Linux.
Reviewed by: cem, ngie
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
tunable SYSCTL's. Linux module parameters are associated with the
module they belong to. FreeBSD does not share this concept of a parent
module. Instead add macros which define the prefix to use for the
module parameters in the LinuxKPI consumers.
While at it convert all "bool" LinuxKPI module parameters to "byte"
type, because we don't have a "bool" type of SYSCTL in FreeBSD.
Sponsored by: Mellanox Technologies
MFC after: 1 week
This fixes a kernel panic when using IPoIB with VIMAGE and infiniband.
PR: 208957
Sponsored by: Mellanox Technologies
Tested by: Justin Clift <justin@postgresql.org>
MFC after: 1 week
ipoib module.
The bpfdetach() function is trying to turn off promiscious mode on the
network interface it is attached to while holding a mutex. The fix
consists of ignoring any further calls to the ipoib_ioctl() function
when the network interface is going to be detached. The ipoib_ioctl()
function might sleep.
Sponsored by: Mellanox Technologies
MFC after: 1 week
The only piece of information that is required is rt_flags subset.
In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags
to check if this particular mbuf needs to be dropped (and what
error should be returned).
Note that if_loop() will always return EHOSTUNREACH for "reject" routes
regardless of RTF_HOST flag existence. This is due to upcoming routing
changes where RTF_HOST value won't be available as lookup result.
All other functions require RTF_GATEWAY flag to check if they need
to return EHOSTUNREACH instead of EHOSTDOWN error.
There are 11 places where non-zero 'struct route' is passed to if_output().
For most of the callers (forwarding, bpf, arp) does not care about exact
error value. In fact, the only place where this result is propagated
is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()).
Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and
RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary
rte flags to ro_flags. Call this function in ip_output() after looking up/
verifying rte.
Reviewed by: ae
Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).
Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.
These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.
ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.
BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).
Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.
Differential Revision: https://reviews.freebsd.org/D4102
Problem description:
How do we currently perform layer 2 resolution and header imposition:
For IPv4 we have the following chain:
ip_output() -> (ether|atm|whatever)_output() -> arpresolve()
Lookup is done in proper place (link-layer output routine) and it is possible
to provide cached lle data.
For IPv6 situation is more complex:
ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
nd6_storelladdr()
We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
* checks if lle exists, creates it if needed (similar to arpresolve())
* performes lle state transitions (similar to arpresolve())
* calls nd6_output_ifp() which pushes packets to link output routine along
with running SeND/MAC hooks regardless of lle state
(e.g. works as run-hooks placeholder).
After that, iface output routine like ether_output() calls nd6_storelladdr()
which performs lle lookup once again.
As a result, we perform lookup twice for each outgoing packet for most types
of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
interfaces (see nd6_need_cache()).
Fix this behavior by eliminating first ND lookup. To be more specific:
* make all nd6_output() consumers use nd6_output_ifp() instead
* rename nd6_output[_slow]() to nd6_resolve_[slow]()
* convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
e.g. copy L2 address to buffer instead of pushing packet towards lower
layers
* Make all nd6_storelladdr() users use nd6_resolve()
* eliminate nd6_storelladdr()
The resulting callchain is the following:
ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()
Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
-> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
nd6_resolve() which will return EWOULDBLOCK, and that result
will be converted to 0.
(And EWOULDBLOCK is actually used by IB/TOE code).
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D1469
Highlights:
- Multiple verbs API updates
- Support for RoCE, RDMA over ethernet
All hardware drivers depending on the common infiniband stack has been
updated aswell.
Discussed with: np @
Sponsored by: Mellanox Technologies
MFC after: 1 month
"MODULE_VERSION" macro definition. Remove the redefinition of the
"MODULE_VERSION" macro from the Linux kernel compatibility API.
MFC after: 1 month
Reported by: np@
Sponsored by: Mellanox Technologies
by dumbbell@ to be able to compile this layer as a dependency module.
Clean up some Makefiles and remove the no longer used OFED define.
Currently only i386 and amd64 targets are supported.
MFC after: 1 month
Sponsored by: Mellanox Technologies
Mellanox hardware driver(s):
- Properly name an inclusion guard
- Fix compile warnings regarding unsigned enums
- Add two new sysctl nodes
- Remove all empty linux header files
- Make an error printout more verbose
- Use "mod_delayed_work()" instead of
cancelling and starting a timeout.
- Implement more Linux scatterlist
functions.
MFC after: 3 days
Sponsored by: Mellanox Technologies
hardware driver update from Mellanox Technologies.
- Remove empty files from the OFED Linux Emulation layer.
- Fix compile warnings related to printf() and the "%lld" and "%llx"
format specifiers.
- Add some missing 2-clause BSD copyrights.
- Add "Mellanox Technologies, Ltd." to list of copyright holders.
- Add some new compatibility files.
- Fix order of uninit in the mlx4ib module to avoid crash at unload
using the new module_exit_order() function.
MFC after: 1 week
Sponsored by: Mellanox Technologies
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.
o Remove the if_baudrate_pf crutch.
o Make all fields of struct if_data fixed machine independent size. The
notion of data (packet counters, etc) are by no means MD. And it is a
bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
which at modern speeds overflow within a second.
This also removes quite a lot of COMPAT_FREEBSD32 code.
o Give 16 bit for the ifi_datalen field. This field was provided to
make future changes to if_data less ABI breaking. Unfortunately the
8 bit size of it had effectively limited sizeof if_data to 256 bytes.
o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.
__FreeBSD_version bumped.
Discussed with: emax
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
callback providers. link_init_sdl() function can be used to
fill most of the parameters. Use caller stack instead of
allocation / freing memory for each request. Do not drop support
for extra-long (probably non-existing) link-layer protocols by
introducing link_alloc_sdl() (used by if_resolvemulti() callback)
and link_free_sdl() (used by caller).
Since this change breaks KBI, MFC requires slightly different approach
(link_init_sdl() auto-allocating buffer if necessary to handle cases
with unmodified if_resolvemulti() callers).
MFC after: 2 weeks
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.