revision 1.122
date: 2009/05/13 01:01:34; author: dlg; state: Exp; lines: +6 -4
only keep track of the number of updates on tcp connections. state sync on
all the other protocols is simply pushing the timeouts along which has a
resolution of 1 second, so it isnt going to be hurt by pfsync taking up
to a second to send it over.
keep track of updates on tcp still though, their windows need constant
attention.
revision 1.120
date: 2009/04/04 13:09:29; author: dlg; state: Exp; lines: +5 -5
use time_uptime instead of time_second internally. time_uptime isnt
affected by adjusting the clock.
revision 1.175
date: 2011/11/25 12:52:10; author: dlg; state: Exp; lines: +3 -3
use time_uptime to set state creation values as time_second can be
skewed at runtime by things like date(1) and ntpd. time_uptime is
monotonic and therefore more useful to compare against.
revision 1.118
date: 2009/03/23 06:19:59; author: dlg; state: Exp; lines: +8 -6
wait an appropriate amount of time before giving up on a bulk update,
rather than giving up after a hardcoded 5 seconds (which is generally much
too short an interval for a bulk update).
pointed out by david@, eyeballed by mcbride@
revision 1.171
date: 2011/10/31 22:02:52; author: mikeb; state: Exp; lines: +2 -1
Don't forget to cancel bulk update failure timeout when destroying an
interface. Problem report and fix from Erik Lax, thanks!
Start a brief note of revisions merged from OpenBSD.
7.x, 8.x and 9.x with pf(4) imports: pfsync(4) should suppress CARP
preemption, while it is running its bulk update.
However, reimplement the feature in more elegant manner, that is
partially inspired by newer OpenBSD:
- Rename term "suppression" to "demotion", to match with OpenBSD.
- Keep a global demotion factor, that can be raised by several
conditions, for now these are:
- interface goes down
- carp(4) has problems with ip_output() or ip6_output()
- pfsync performs bulk update
- Unlike in OpenBSD the demotion factor isn't a counter, but
is actual value added to advskew. The adjustment values for
particular error conditions are also configurable, and their
defaults are maximum advskew value, so a single failure bumps
demotion to maximum. This is for POLA compatibility, and should
satisfy most users.
- Demotion factor is a writable sysctl, so user can do
foot shooting, if he desires to.
of scheduling next run pfsync_bulk_update(), pfsync_bulk_fail()
was scheduled.
This lead to instant 100% state leak after first bulk update
request.
- After above fix, it appeared that pfsync_bulk_update() lacks
locking. To fix this, sc_bulk_tmo callout was converted to an
mtx one. Eventually, all pf/pfsync callouts should be converted
to mtx version, since it isn't possible to stop or drain a
non-mtx callout without risk of race.
- Add comment that callout_stop() in pfsync_clone_destroy() lacks
locking. Since pfsync0 can't be destroyed (yet), let it be here.
The root of problem is re-locking at the end of pfsync_sendout().
Several functions are calling pfsync_sendout() holding pointers
to pf data on stack, and these functions expect this data to be
consistent.
To fix this, the following approach was taken:
- The pfsync_sendout() doesn't call ip_output() directly, but
enqueues the mbuf on sc->sc_ifp's interfaces queue, that
is currently unused. Then pfsync netisr is scheduled. PF_LOCK
isn't dropped in pfsync_sendout().
- The netisr runs through queue and ip_output()s packets
on it.
Apart from fixing race, this also decouples stack, fixing
potential issues, that may happen, when sending pfsync(4)
packets on input path.
Reviewed by: eri (a quick review)
number of packets can be queued on sc, while we are in ip_output(), and then
we wipe the accumulated sc_len. On next pfsync_sendout() that would lead to
writing beyond our mbuf cluster.
to document where we are expecting to be called with a lock held to
more easily catch unnoticed code paths.
This does not neccessarily improve locking in pfsync, it just tries
to avoid the panics reported.
PR: kern/159390, kern/158873
Submitted by: pluknet (at least something that partly resembles
my patch ignoring other cleanup, which I only saw
too late on the 2nd PR)
MFC After: 3 days
and virtualization it is not helpful but complicates things.
Current state of art is to not virtualize these kinds of locks -
inp_group/hash/info/.. are all not virtualized either.
MFC after: 3 days
pfsync also depends on pf to be initialized already so pf goes at
FIRST and the interfaces go at ANY.
Then the (VNET_)SYSINIT startups for pf stays at SI_SUB_PROTO_BEGIN
and for pfsync we move to the later SI_SUB_PROTO_IF.
This is not ideal either but at least an order that should work for
the moment and can be re-fined with the VIMAGE merge, once this will
actually work with more than one network stack.
MFC after: 3 days
and never remove state.
This fixes the problem some people are seeing that state is removed when pf
is loaded as a module but not in situations when compiled into the kernel.
Reported by: many on freebsd-pf
Tested by: flo
MFC after: 3 days
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
to not only compile bu load as well for testing with IPv6-only kernels.
For the moment we ignore the csum change in pf_ioctl.c given the
pending update to pf45.
Reported by: dru
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 20 days
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
for interfaces with TSO enabled, otherwise one would see an extra
ICMP unreach, frag needed pre matching packet on lo0.
This syncs pf code to ip_output.c r162084.
PR: kern/144311
Submitted by: yongari via mlaier
Reviewed by: eri
Tested by: kib
MFC after: 8 days
Server Return mode, where not all packets would be visible to the load
balancer or gateway.
This commit should be reverted when we merge future pf versions. The
benefit it would provide is that this version does not break any existing
public interface and thus won't be a problem if we want to MFC it to
earlier FreeBSD releases.
Discussed with: mlaier
Obtained from: OpenBSD
Sponsored by: iXsystems, Inc.
MFC after: 1 month
restart the scan after acquiring the lock the hard way. Otherwise we might
end up with a dead reference.
Reported by: pfsense
Reviewed by: eri
Initial patch by: eri
Tested by: pfsense
Approved by: re (kib)
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.
Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.
The following modules are affected by this change:
if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf
In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.
Reviewed by: bz
Approved by: re (kib)
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.
Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
macros: ICMPSTAT_ADD(), ICMPSTAT_INC(), ICMP6STAT_ADD(), and
ICMP6STAT_INC(), rather than directly manipulating the fields
of these structures across the kernel. This will make it
easier to change the implementation of these statistics,
such as using per-CPU versions of the data structures.
In on case, icmp6stat members are manipulated indirectly, by
icmp6_errcount(), and this will require further work to fix
for per-CPU stats.
MFC after: 3 days