LLE structure is mostly unchanged during its lifecycle.
To be more specific, there are 2 things relevant for fast path
lookup code:
1) link-level address change. Since r286722, these updates are performed
under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
we re-send arp request to perform reachability verification instead of
expiring entry. The only signal that is needed from fast path is something
like binary yes/no.
The latter is solved by the following changes:
1) introduce special r_skip_req field which is read lockless by fast path,
but updated under (new) req_mutex mutex. If this field is non-zero, then
fast path will acquire lock and set it back to 0.
2) introduce simple state machine: incomplete->reachable<->verify->deleted.
Before that we implicitely had incomplete->reachable->deleted state machine,
with V_arpt_keep between "reachable" and "deleted". Verification was performed
in runtime 5 seconds before V_arpt_keep expire.
This is changed to "change state to verify 5 seconds before V_arpt_keep,
set r_skip_req to non-zero value and check it every second". If the value
is zero - then send arp verification probe.
These changes do not introduce any signifficant control plane overhead:
typically lle callout timer would fire 1 time more each V_arpt_keep (1200s)
for used lles and up to arp_maxtries (5) for dead lles.
As a result, all packets towards "reachable" lle are handled by fast path without
acquiring lle read lock.
Additional "req_mutex" is needed because callout / arpresolve_slow() or eventhandler
might keep LLE lock for signifficant amount of time, which might not be feasible
for fast path locking (e.g. having rmlock as ether AFDATA or lltable own lock).
Differential Revision: https://reviews.freebsd.org/D3688
Rename arp_ifinit2() into arp_announce_ifaddr().
Eliminate zeroing ifa_rtrequest: it was used for calling arp_rtrequest()
which was responsible for handling route cloning requests. It became
obsolete since r186119 (L2/L3 split).
unconditionally, we end up with an mbuf chain of two mbufs, which
later in in_arpreply() is rewritten from ARP request to ARP reply
and is sent out. Looks like igb(4) (at least mine, and at least
at my network) fails on such mbuf chain, so ARP reply doesn't go
out wire. Thus, make the m_pullup() call conditional, as it is
everywhere. Of course, the bug in igb(?) should be investigated,
but better first fix the head. And unconditional m_pullup() was
suboptimal, anyway.
without holding afdata wlock
* convert per-af delete_address callback to global lltable_delete_entry() and
more low-level "delete this lle" per-af callback
* fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3573
Before that, the logic besides lle_create() was the following:
return existing if found, create if not. This behaviour was error-prone
since we had to deal with 'sudden' static<>dynamic lle changes.
This commit fixes bunch of different issues like:
- refcount leak when lle is converted to static.
Simple check case:
console 1:
while true;
do for i in `arp -an|awk '$4~/incomp/{print$2}'|tr -d '()'`;
do arp -s $i 00:22:44:66:88:00 ; arp -d $i;
done;
done
console 2:
ping -f any-dead-host-in-L2
console 3:
# watch for memory consumption:
vmstat -m | awk '$1~/lltable/{print$2}'
- possible problems in arptimer() / nd6_timer() when dropping/reacquiring
lock.
New logic explicitly handles use-or-create cases in every lla_create
user. Basically, most of the changes are purely mechanical. However,
we explicitly avoid using existing lle's for interface/static LLE records.
* While here, call lle_event handlers on all real table lle change.
* Create lltable_free_entry() calling existing per-lltable
lle_free_t callback for entry deletion
This change isolates the most common case (e.g. successful lookup)
from more complicates scenarios. It also (tries to) make code
more simple by avoiding retry: cycle.
The actual goal is to prepare code to the upcoming change that will
allow LL address retrieval without acquiring LLE lock at all.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D3383
separate bunch of functions. The goal is to isolate actual lle
updates to permit more fine-grained locking.
Do all lle link-level update under AFDATA wlock.
Sponsored by: Yandex LLC
This permits us having all (not fully true yet) all the info
needed in lookup process in first 64 bytes of 'struct llentry'.
struct llentry layout:
BEFORE:
[rwlock .. state .. state .. MAC ] (lle+1) [sockaddr_in[6]]
AFTER
[ in[6]_addr MAC .. state .. rwlock ]
Currently, address part of struct llentry has only 16 bytes for the key.
However, lltable does not restrict any custom lltable consumers with long
keys use the previous approach (store key at (lle+1)).
Sponsored by: Yandex LLC
* Move lle creation/deletion from lla_lookup to separate functions:
lla_lookup(LLE_CREATE) -> lla_create
lla_lookup(LLE_DELETE) -> lla_delete
lla_create now returns with LLE_EXCLUSIVE lock for lle.
* Provide typedefs for new/existing lltable callbacks.
Reviewed by: ae
Both are used to protect access to IP addresses lists and they can be
acquired for reading several times per packet. To reduce lock contention
it is better to use rmlock here.
Reviewed by: gnn (previous version)
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3149
to be transmitted but the arp cache entry expired, which triggers an arp request
to be sent, the bpf code might want to sleep but crash the system due
to a non sleep lock held from the arp entry not released properly.
Release the lock before calling the arp request code to solve the issue
as is done on all the other code paths.
PR: 200323
Approved by: ae, gnn(mentor)
MFC after: 1 week
Sponsored by: Netgate
Differential Revision: https://reviews.freebsd.org/D2828
and arp were being used. They basically would pass in the
mutex to the callout_init. Because they used this method
to the callout system, it was possible to "stop" the callout.
When flushing the table and you stopped the running callout, the
callout_stop code would return 1 indicating that it was going
to stop the callout (that was about to run on the callout_wheel blocked
by the function calling the stop). Now when 1 was returned, it would
lower the reference count one extra time for the stopped timer, then
a few lines later delete the memory. Of course the callout_wheel was
stuck in the lock code and would then crash since it was accessing
freed memory. By using callout_init(c, 1) we always get a 0 back
and the reference counting bug does not rear its head. We do have
to make a few adjustments to the callouts themselves though to make
sure it does the proper thing if rescheduled as well as gets the lock.
Commented upon by hiren and sbruno
See Phabricator D1777 for more details.
Commented upon by hiren and sbruno
Reviewed by: adrian, jhb and bz
Sponsored by: Netflix Inc.
handle it in arc_output() instead of nd6_storelladdr().
* Remove IFT_ARCNET check from arpresolve() since arc_output() does not
use arpresolve() to handle broadcast/multicast. This check was there
since r84931. It looks like it was not used since r89099 (initial
import of Arcnet support where multicast is handled separately).
* Remove IFT_IEEE1394 case from nd6_storelladdr() since firewire_output()
calles nd6_storelladdr() for unicast addresses only.
* Remove IFT_ARCNET case from nd6_storelladdr() since arc_output() now
handles multicast by itself.
As a result, we have the following pattern: all non-ethernet-style
media have their own multicast map handling inside their appropriate
routines. On the other hand, arpresolve() (and nd6_storelladdr()) which
meant to be 'generic' ones de-facto handles ethernet-only multicast maps.
MFC after: 3 weeks
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:
- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().
This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.
Differential Revision: https://reviews.freebsd.org/D1436
Reviewed by: glebius, trasz, gnn, bz
Sponsored by: EMC / Isilon Storage Division
restricted to a single FIB in a multifib system.
Restricting an interface's routes to the FIB to which it is assigned (by
setting net.add_addr_allfibs=0) causes ARP updates to fail with "arpresolve:
can't allocate llinfo for x.x.x.x". This is due to the ARP update code hard
coding it's lookup for existing routing entries to FIB 0.
sys/netinet/in.c:
When dealing with RTM_ADD (add route) requests for an interface, use
the interface's assigned FIB instead of the default (FIB 0).
sys/netinet/if_ether.c:
In arpresolve(), enhance error message generated when an
lla_lookup() fails so that the interface causing the error is
visible in logs.
tests/sys/netinet/fibs_test.sh
Clear ATF expected error.
PR: kern/167947
Submitted by: Nikolay Denev <ndenev@gmail.com> (previous version)
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.
Reviewed by: glebius, adrian
MFC after: 1 week
Sponsored by: Yandex LLC
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.
Reported by: Ian FREISLICH <ianf clue.co.za>
with multicast bit set. FreeBSD refuses to install such
entries since 9.0, and this broke installations running
Microsoft NLB, which are violating standards.
Tested by: Tarasov Oleg <oleg_tarasov sg-tea.com>
llentry_free() and arptimer():
o Use callout_init_rw() for lle timeout, this allows us safely
disestablish them.
- This allows us to simplify the arptimer() and make it
race safe.
o Consistently use ifp->if_afdata_lock to lock access to
linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
is attached to the hash.
- Use LLE_LINKED to avoid double unlinking via consequent
calls to llentry_free().
- Mark lle with LLE_DELETED via |= operation istead of =,
so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
consistent and provide more informative KASSERTs.
The patch is a collaborative work of all submitters and myself.
PR: kern/165863
Submitted by: Andrey Zonov <andrey zonov.org>
Submitted by: Ryan Stone <rysto32 gmail.com>
Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.
- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.
Build-tested with make universe.
30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE
Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe
Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe
Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
comments to longer, also refining strange ones.
Properly use #ifdef rather than #if defined() where possible. Four
#if defined(PCBGROUP) occurances (netinet and netinet6) were ignored to
avoid conflicts with eventually upcoming changes for RSS.
Reported by: bde (most)
Reviewed by: bde
MFC after: 3 days
in the ARP datagram generated by arprequest(). If caller doesn't
supply the address, then it is either picked from CARP or hardware
address of the interface is taken.
While here, make several minor fixes:
- Hold IF_ADDR_RLOCK(ifp) while traversing address list.
- Remove not true comment.
- Access internet address and mask via in_ifaddr fields,
rather than ifaddr.