Use RTM_PINNED flag to mark route as immutable.
Forbid deleting immutable routes without special rtrequest1_fib() flag.
Adding interface address with prefix already in route table is handled
by atomically deleting old prefix and adding interface one.
Discussed with: andre, eri
MFC after: 3 weeks
length packets, which was actually harmless.
Note that peers with different version of head/ may grow this
counter, but it is harmless - all pfsync data is processed.
Reported & tested by: Anton Yuzhaninov <citrin citrin.ru>
Sponsored by: Nginx, Inc
of helper functions:
- carp_master() - boolean function which is true if an address
is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
and is aware of CARP.
Utilize ifa_preferred() in ifa_ifwithnet().
The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.
Reported & tested by: Anton Yuzhaninov <citrin citrin.ru>
Sponsored by: Nginx, Inc
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.
The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.
Reviewed by: jhb@freebsd.org, jlv@freebsd.org
Since ARP and routing are separated, "proxy only" entries
don't have any meaning, thus we don't need additional field
in sockaddr to pass SIN_PROXY flag.
New kernel is binary compatible with old tools, since sizes
of sockaddr_inarp and sockaddr_in match, and sa_family are
filled with same value.
The structure declaration is left for compatibility with
third party software, but in tree code no longer use it.
Reviewed by: ru, andre, net@
lle_event replaced arp_update_event after the ARP rewrite and ended up
in if_ether.h simply because arp_update_event used to be there too.
IPv6 neighbor discovery is going to grow lle_event support and this is a
good time to move it to if_llatbl.h.
The two in-tree consumers of this event - OFED and toecore - are not
affected.
Reviewed by: bz@
in from kernel-pppd which is long gone) so that ZFS and DTRACE play nice.
This is a horrible hack to get freefall to compile, and is in dire need
of reconciliation. This antique zlib-1.04 code needs to go away.
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.
Reported by: Ian FREISLICH <ianf clue.co.za>
cause kernel panics.
Add a flag to the bpf descriptor to indicate whether the hold buffer
is in use. In bpfread(), set the "hold buffer in use" flag before
dropping the descriptor lock during the call to bpf_uiomove().
Everywhere else the hold buffer is used or changed, wait while
the hold buffer is in use by bpfread(). Add a KASSERT in bpfread()
after re-acquiring the descriptor lock to assist uncovering any
additional hold buffer races.
- Deembed scope id in L3 address in in6_lltable_dump().
- Simplify scope id recovery in rtsock routines.
- Remove embedded scope id handling in ndp(8) and route(8) completely.
- Check V_deembed_scopeid before checking if sa_family == AF_INET6.
- Fix scope id handing in route(8)[2] and ifconfig(8).
Reported by: rpaulo[1], Mateusz Guzik[1], peter[2]
for bridge interface.
- If we found a collision we can break the loop - only one collision is
possible and one is exactly enough to need to renegerate.
Obtained from: WHEEL Systems
MFC after: 1 week
userland via routing socket or sysctl. This eliminates the following
KAME-specific sin6_scope_id handling routine from each userland utility:
sin6.sin6_scope_id = ntohs(*(u_int16_t *)&sin6.sin6_addr.s6_addr[2]);
This behavior can be controlled by net.inet6.ip6.deembed_scopeid. This is
set to 1 by default (sin6_scope_id will be filled in the kernel).
Reviewed by: bz
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
After this change CSUM_DELAY_IP vanishes from the stack.
Submitted by: Sebastian Kuzminsky <seb lineratesystems.com>
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
filters (ipfw and PF) use the same ruleset with the same lock for both
AF_INET and AF_INET6 there is no need in more fine-grade locking.
However, it is possible to request personal lock by specifying
PFIL_FLAG_PRIVATE_LOCK flag in pfil_head structure (see pfil.9 for
more details).
Export PFIL lock via rw_lock(9)/rm_lock(9)-like API permitting pfil consumers
to use this lock instead of own lock. This help reducing locks on main
traffic path.
pfil_assert() is currently not implemented due to absense of rm_assert().
Waiting for some kind of r234648 to be merged in HEAD.
This change is part of bigger patch reducing routing locking.
Sponsored by: Yandex LLC
Reviewed by: glebius, ae
OK'd by: silence on net@
MFC after: 3 weeks
around the problem where high speed interfaces (such as ixgbe(4))
are not able to report real ifi_baudrate. bascially, take a spare
byte from struct if_data and use it to store ifi_baudrate power
factor. in other words,
real ifi_baudrate = ifi_baudrate * 10 ^ ifi_baudrate power factor
this should be backwards compatible with old binaries. use ixgbe(4)
as an example on how drivers would set ifi_baudrate power factor
Discussed with: kib, scottl, glebius
MFC after: 1 week
now use function calls:
if_clone_simple()
if_clone_advanced()
to initialize a cloner, instead of macros that initialize if_clone
structure.
Discussed with: brooks, bz, 1 year ago
- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.
Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)
The base netmap pointer and offsets involved are provided by the kernel
side of the netmap interface and will have appropriate alignment.
Sponsored by: ADARA Networks
MFC After: 2 weeks
the CAM "enc" peripheral (part of ses(4)). Previously the two modules
used the same name, so only one was included in a linked kernel causing
enc0 to not be created if you added IPSEC to GENERIC. The new module
name follows the pattern of other network interfaces (e.g. "if_loop").
MFC after: 1 week
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.
o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
ALTQ queueing or buf_ring(9) queueing.
- It doesn't touch the buf_ring stats any more.
- It doesn't touch ifnet stats anymore.
- drbr_stats_update() no longer exists.
o buf_ring(9) handles its stats itself:
- It handles br_drops itself.
- br_prod_bytes stats are dropped. Rationale: no one ever
reads them but update of a common counter on every packet
negatively affects performance due to excessive cache
invalidation.
- buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
we no longer account bytes.
o Drivers handle their stats theirselves: if_obytes, if_omcasts.
o mlx4(4), igb(4), em(4), vxge(4), oce(4) and ixv(4) no longer
use drbr_stats_update(), and update ifnet stats theirselves.
o bxe(4) was the most correct driver, it didn't call
drbr_stats_update(), thus it was the only driver accurate under
moderate load. Now it also maintains stats itself.
o ixgbe(4) had already taken stats from hardware, so just
- drop software stats updating.
- take multicast packet count from hardware as well.
o mxge(4) just no longer needs NO_SLOW_STATS define.
o cxgb(4), cxgbe(4) need no change, since they obtain stats
from hardware.
Reviewed by: jfv, gnn
The requirement (implied by the KASSERT in tap_destroy) that the tap is
closed isn't valid; destroy_dev will block in devdrn while other threads
are in d_* functions.
Note: if_tun had the same issue, addressed in SVN revisions r186391,
r186483 and r186497. The use of the condvar there appears to be
redundant with the functionality provided by destroy_dev.
Sponsored by: ADARA Networks
Reviewed by: dwhite
MFC after: 2 weeks
1) Ruleset parser uses a global variable for anchor stack.
2) When processing a wildcard anchor, matching anchors are marked.
To fix the first one:
o Allocate anchor processing stack on stack. To make this allocation
as small as possible, following measures taken:
- Maximum stack size reduced from 64 to 32.
- The struct pf_anchor_stackframe trimmed by one pointer - parent.
We can always obtain the parent via the rule pointer.
- When pf_test_rule() calls pf_get_translation(), the former lends
its stack to the latter, to avoid recursive allocation 32 entries.
The second one appeared more tricky. The code, that marks anchors was
added in OpenBSD rev. 1.516 of pf.c. According to commit log, the idea
is to enable the "quick" keyword on an anchor rule. The feature isn't
documented anywhere. The most obscure part of the 1.516 was that code
examines the "match" mark on a just processed child, which couldn't be
put here by current frame. Since this wasn't documented even in the
commit message and functionality of this is not clear to me, I decided
to drop this examination for now. The rest of 1.516 is redone in a
thread safe manner - the mark isn't put on the anchor itself, but on
current stack frame. To avoid growing stack frame, we utilize LSB
from the rule pointer, relying on kernel malloc(9) returning pointer
aligned addresses.
Discussed with: dhartmei
reside, and move there ipfw(4) and pf(4).
o Move most modified parts of pf out of contrib.
Actual movements:
sys/contrib/pf/net/*.c -> sys/netpfil/pf/
sys/contrib/pf/net/*.h -> sys/net/
contrib/pf/pfctl/*.c -> sbin/pfctl
contrib/pf/pfctl/*.h -> sbin/pfctl
contrib/pf/pfctl/pfctl.8 -> sbin/pfctl
contrib/pf/pfctl/*.4 -> share/man/man4
contrib/pf/pfctl/*.5 -> share/man/man5
sys/netinet/ipfw -> sys/netpfil/ipfw
The arguable movement is pf/net/*.h -> sys/net. There are
future plans to refactor pf includes, so I decided not to
break things twice.
Not modified bits of pf left in contrib: authpf, ftp-proxy,
tftp-proxy, pflogd.
The ipfw(4) movement is planned to be merged to stable/9,
to make head and stable match.
Discussed with: bz, luigi
Merge ether_ipfw_chk() and part of bridge_pfil() into
unified ipfw_check_frame() function called by PFIL.
This change was suggested by rwatson? @ DevSummit.
Remove ipfw headers from ether/bridge code since they are unneeded now.
Note this thange introduce some (temporary) performance penalty since
PFIL read lock has to be acquired for every link-level packet.
MFC after: 3 weeks
- Provide missing function that can do hashing of arbitrary sized buffer.
- Refetch lookup3.c and do only minimal edits to it, so that diff between
our jenkins_hash.c and lookup3.c is minimal.
- Add declarations for jenkins_hash(), jenkins_hash32() to sys/hash.h.
- Document these functions in hash(9)
Obtained from: http://burtleburtle.net/bob/c/lookup3.c
- In the !wildcard case, return ENOSPC instead of confusing EEXIST
in case if ifc->ifc_maxunit reached.
- Fix unit leak, that I've introduced in previous revision.
Submitted by: Daan Vreeken <Daan vitsch.nl>
if_delmulti() when clearing the configuration for a subinterface when
the parent interface is being detached. The current code was still
triggering an assertion in if_delmulti() due to the parent interface being
partially detached. Fix this by not calling if_delmulti() at all if the
parent interface is being detached. Warn if if_delmulti() fails when the
parent is not being detached (but similar to 208212, still proceed with
tearing down the vlan state).
Tested by: ae@
MFC after: 1 month
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.
Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.
Suggested and reviewed by: alc
MFC after: 2 weeks
llentry_free() and arptimer():
o Use callout_init_rw() for lle timeout, this allows us safely
disestablish them.
- This allows us to simplify the arptimer() and make it
race safe.
o Consistently use ifp->if_afdata_lock to lock access to
linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
is attached to the hash.
- Use LLE_LINKED to avoid double unlinking via consequent
calls to llentry_free().
- Mark lle with LLE_DELETED via |= operation istead of =,
so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
consistent and provide more informative KASSERTs.
The patch is a collaborative work of all submitters and myself.
PR: kern/165863
Submitted by: Andrey Zonov <andrey zonov.org>
Submitted by: Ryan Stone <rysto32 gmail.com>
Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>
checking. This allows the FreeBSD 9.1 release process to move forward.
Work around the problem that loopback connections to local addresses
not on loopback interfaces and not on interfaces w/ IPv6 checksum offloading
enabled would not work.
A proper fix to allow us to disable the "checksum offload" on loopback
for testing, measurements, ... as we allow for IPv4 needs to put in
place later.
Reported by: tuexen, Matthew Seaman (m.seaman infracaninophile.co.uk)
Reported by: Mike Andrews (mandrews bit0.com), kib, ...
PR: kern/170070
MFC after: 1 day
X-MFC after: re approval
This behavior is recommended by RFC 4213 clause 3.2.
Sometimes fragmentation is the least evil.
For example, some Linux IPVS kernels forwards
ICMPv6 checksums to real servers incorrectly.
Reviewed by: hrs(previous version)
Approved by: kib(mentor)
MFC after: 1 week
If an error occurs when transmitting one mbuf in a chain of fragments,
free the subsequent fragments instead of leaking them.
Sponsored by: ADARA Networks
switch to its vnet before calling ether_ifdetach(). Otherwise if the
second half resides in a different vnet, if_detach() silently fails
leaving a stale pointer in V_ifnet list, and the system crashes trying
to access this pointer later.
Another solution could be not to allow to destroy epair unless both
ends are in the home vnet.
Discussed with: bz
Tested by: delphij
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
across in_gif_output() and in6_gif_output() anyway, and once it is held
across those it might as well be held for the entire loop. This simplifies
the code and removes the need for the custom IFF_GIF_WANTED flag (which
belonged in the softc and not as an IFF_* flag anyway).
Tested by: Vincent Hoffman vince unsane co uk
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.
- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.
Build-tested with make universe.
30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE
Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe
Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe
Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
There is no mac_addr in the mbuf for BSD.. cheat like
we are supposed to and use the csum field since our friend
the gif tunnel itself will never use offload.