The original code from KAME did not take care of address
aliases or multiple ip addresses that have the same
prefix.
Reviewed by: rwatson, gnn, sam, kmacy, julian
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,
route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2
The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"
Multiple default routes can also be inserted. Here is the netstat
output:
default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0
When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,
route delete default
would fail and trigger the following error message:
"route: writing to routing socket: No such process"
"delete net default: not in table"
On the other hand,
route delete default 10.2.5.2
would be successful: "delete net default: gateway 10.2.5.2"
One does not have to specify a gateway if there is only a single
route for a particular destination.
I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.
Reviewed by: robert, sam, gnn, julian, kmacy
buffer kernel descriptors, which is used to allow the buffer
currently in the BPF "store" position to be assigned to userspace
when it fills, even if userspace hasn't acknowledged the buffer
in the "hold" position yet. To implement this, notify the buffer
model when a buffer becomes full, and check that the store buffer
is writable, not just for it being full, before trying to append
new packet data. Shared memory buffers will be assigned to
userspace at most once per fill, be it in the store or in the
hold position.
This removes the restriction that at most one shared memory can
by owned by userspace, reducing the chances that userspace will
need to call select() after acknowledging one buffer in order to
wait for the next buffer when under high load. This more fully
realizes the goal of zero system calls in order to process a
high-speed packet stream from BPF.
Update bpf.4 to reflect that both buffers may be owned by userspace
at once; caution against assuming this.
from clearing the IFF_NEEDSGIANT flag on Giant-locked interfaces.
In particular, wpa_supplicant was doing this on USB interfaces,
causing panics when Giant-locked code was then called without Giant.
Submitted by: Alexey Popov
Reviewed by: rwatson
MFC after: 3 days
zero-copy to the store buffer position on the BPF descriptor,
and the 'b' buffer as the free buffer in order to fill them in
the order documented in bpf(4).
MFC after: 4 months
Suggested by: csjp
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.
Reviewed by: arch
There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
bpf_canfreebuf() in order to avoid potentially calling a non-inlinable
but trivial function in zero-copy buffer mode for every packet
received when we couldn't free the buffer anyway.
MFC after: 4 months
overhead of packet capture by allowing a user process to directly "loan"
buffer memory to the kernel rather than using read(2) to explicitly copy
data from kernel address space.
The user process will issue new BPF ioctls to set the shared memory
buffer mode and provide pointers to buffers and their size. The kernel
then wires and maps the pages into kernel address space using sf_buf(9),
which on supporting architectures will use the direct map region. The
current "buffered" access mode remains the default, and support for
zero-copy buffers must, for the time being, be explicitly enabled using
a sysctl for the kernel to accept requests to use it.
The kernel and user process synchronize use of the buffers with atomic
operations, avoiding the need for system calls under load; the user
process may use select()/poll()/kqueue() to manage blocking while
waiting for network data if the user process is able to consume data
faster than the kernel generates it. Patchs to libpcap are available
to allow libpcap applications to transparently take advantage of this
support. Detailed information on the new API may be found in bpf(4),
including specific atomic operations and memory barriers required to
synchronize buffer use safely.
These changes modify the base BPF implementation to (roughly) abstrac
the current buffer model, allowing the new shared memory model to be
added, and add new monitoring statistics for netstat to print. The
implementation, with the exception of some monitoring hanges that break
the netstat monitoring ABI for BPF, will be MFC'd.
Zerocopy bpf buffers are still considered experimental are disabled
by default. To experiment with this new facility, adjust the
net.bpf.zerocopy_enable sysctl variable to 1.
Changes to libpcap will be made available as a patch for the time being,
and further refinements to the implementation are expected.
Sponsored by: Seccuris Inc.
In collaboration with: rwatson
Tested by: pwood, gallatin
MFC after: 4 months [1]
[1] Certain portions will probably not be MFCed, specifically things
that can break the monitoring ABI.
This one line change makes the following code found in many ethernet device drivers
(at least em, igb, ixgbe, and cxgb) gratuitous
case SIOCSIFADDR:
if (ifa->ifa_addr->sa_family == AF_INET) {
/*
* XXX
* Since resetting hardware takes a very long time
* and results in link renegotiation we only
* initialize the hardware only when it is absolutely
* required.
*/
ifp->if_flags |= IFF_UP;
if (!(ifp->if_drv_flags & IFF_DRV_RUNNING)) {
EM_CORE_LOCK(adapter);
em_init_locked(adapter);
EM_CORE_UNLOCK(adapter);
}
arp_ifinit(ifp, ifa);
} else
error = ether_ioctl(ifp, command, data);
break;
this means that it no longer grabs the lagg rwlock. Use two port table arrays
which list the active ports for Tx and switch between them with an atomic op.
Now the lagg rwlock is only exclusively locked for management (ioctls) and
queuing of lacp control frames isnt needed.
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
for all network interfaces, not just ethernet-like ones.
Upgrade it to a louder WARNING and be explicit that the flag is obsolete.
Support for IFF_NEEDSGIANT will be removed in a few months (see arch@ for
details) and will not appear in 8.0.
Upgrade if_watchdog to a WARNING.
- Set M_BCAST|M_MCAST for incoming frames
- Send the frame to a local interface if the bridge returns the mbuf
Submitted by: Eugene Grosbein
Tested by: Boris Kochergin
specified in Table 7-10 in their destination address field shall not be relayed
by the Bridge. Add a check in bridge_forward() to adhere to this.
PR: kern/119744
functions. It is easily triggered by running routed, and, I expect, by
running any other daemon that uses routing sockets.
Reviewed by: net@
MFC after: 1 week
(dummynet), ipsec_filter() would return the empty error code and the ipsec code
would continue to forward/deference the null mbuf.
Found by: m0n0wall
Reviewed by: bz
MFC after: 3 days
link has been marked discarding by Spanning Tree. This would cause the bridge
to see duplicate packets to itself even if STP has correctly calculated the
topology and blocked redundant links.
Reported by: trasz
Tested by: trasz
MFC after: 3 days
administratively down (!IFF_UP)
- Use the same parameters to lagg_link_active() to get the backup port as in
the output path, this didnt actually matter in practice as sc_primary is
always the first on the port list.
MFC after: 3 days
bpf will see inner and outer headers or just inner or outer
headers for incoming and outgoing IPsec packets.
This is useful in bpf to not have over long lines for debugging
or selcting packets based on the inner headers.
It also properly defines the behavior of what the firewalls see.
Last but not least it gives you if_enc(4) for IPv6 as well.
[ As some auxiliary state was not available in the later
input path we save it in the tdbi. That way tcpdump can give a
consistent view of either of (authentic,confidential) for both
before and after states. ]
Discussed with: thompsa (2007-04-25, basic idea of unifying paths)
Reviewed by: thompsa, gnn
This has the benefit that rmlocks have proper support for reader recursion
(in contrast to rwlock(9) which could potential lead to writer stravation).
It also means a significant performance gain, eventhough only visible in
microbenchmarks at the moment.
Discussed on: -arch, -net
as up if at least one of its ports also has a link up. This fixes using
carp+lagg together and any other system that relies on linkstate events.
PR: kern/113956
MFC after: 3 days
2) Alter packet flow inside dummynet: allow certain packets to bypass
dummynet scheduler. Benefits are:
- lower latency: if packet flow does not exceed pipe bandwidth, packets
will not be (up to tick) delayed (due to dummynet's scheduler granularity).
- lower overhead: if packet avoids dummynet scheduler it shouldn't reenter ip
stack later. Such packets can be fastforwarded.
- recursion (which can lead to kernel stack exhaution) eliminated. This fix
long existed panic, which can be triggered this way:
kldload dummynet
sysctl net.inet.ip.fw.one_pass=0
ipfw pipe 1 config bw 0
for i in `jot 30`; do ipfw add 1 pipe 1 icmp from any to any; done
ping -c 1 localhost
3) Three new sysctl nodes are added:
net.inet.ip.dummynet.io_pkt - packets passed to dummynet
net.inet.ip.dummynet.io_pkt_fast - packets avoided dummynet scheduler
net.inet.ip.dummynet.io_pkt_drop - packets dropped by dummynet
P.S. Above comments are true only for layer 3 packets. Layer 2 packet flow
is not changed yet.
MFC after: 3 month
interface. Once the limit is reached packets with unknown source addresses are
dropped until an existing host cache entry expires or is removed. Useful to
use with the STICKY cache option.
Sponsored by: miniSuperHappyDevHouse NZ
a private softc list is needed neither for tracking clones in general
nor for destroying all clones before the module unload -- if_clone
takes care of all that. (Note that some other interface drivers do
need a softc list to be able to scan it for their private purposes.)
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
Specifically, if two threads were doing concurrent lookups and the existing
gateway was marked down, the the first thread would drop a reference on the
gateway route and then unlock the "root" route while it tried to allocate
a new route. The second thread could then also drop a reference on the
same gateway route resulting in a reference underflow. Fix this by
clearing the gateway route pointer after dropping the reference count but
before dropping the lock. Secondly, in this same case, the second thread
would overwrite the gateway route pointer w/o free'ing a reference to the
route installed by the first thread. In practice this would probably just
fix a lost reference that would result in a route never being freed.
This fixes panics observed in rt_check() and rtexpunge().
MFC after: 1 week
PR: kern/112490
Insight from: mehuljv at yahoo.com
Reviewed by: ru (found the "not-setting it to NULL" part)
Tested by: several
queue so the output network card must support the same tagging mechanism as
how the frame was input (prepended Ethernet header tag or stripped HW mflag).
Now the vlan Ethernet header is _always_ stripped in ether_input and the mbuf
flagged, only only network cards with VLAN_HWTAGGING enabled would properly
re-tag any outgoing vlan frames.
If the outgoing interface does not support hardware tagging then readd the vlan
header to the front of the frame. Move the common vlan encapsulation in to
ether_vlanencap().
Reported by: Erik Osterholm, Jon Otterholm
MFC after: 1 week
This fixes the process portion of the bpf(4) stats if the peer forks
into the background after it's opened the descriptor. This bug
results in the following behavior for netstat -B:
# netstat -B
Pid Netif Flags Recv Drop Match Sblen Hblen Command
netstat: kern.proc.pid failed: No such process
78023 em0 p--s-- 2237404 43119 2237404 13986 0 ??????
MFC after: 1 week
1. The locking was changed to shared but roundrobin mode still updated a
pointer in the softc with the next tx interface to use. This will panic
under high load. Change this to an atomically incremented sequence number in
order to choose the tx port in round robin.
2. IFQ_HANDOFF will free the mbuf if the queue is full, this will then be freed
again by lagg_start() and panic. Reorganised the error handling and freeing
to fix this.
MFC after: 3 days
route and once they are done with it, call rtfree(). rtfree() should
only be used when we are certain we hold the last reference to the
route. This bug results in console messages like the following:
rtfree: 0xc40f7000 has 1 refs
This patch switches the rtfree() to use RTFREE_LOCKED() instead,
which should handle the reference counting on the route better.
Approved by: re@ (gnn)
Reviewed by: bms
Reported by: many via net@ and current@
Tested by: many
matches the BPF registers (which are the only thing that is assigned
to/from BPF memory). This is a pedantic change that shouldn't change
any behaviour.
PR: 115931
Submitted by: Matthew Luckie <mjl@luckie.org.nz>
Approved by: re (bmah)
MFC after: 3 weeks
flags, the absense of these flags causes problems in other areas such as
bridging which expect them to be correct.
At the moment only Ethernet DLTs are checked.
Reviewed by: bms, csjp, sam
Approved by: re (bmah)
active in failover mode rather than all interfaces with a link. This makes it
clear if the master interface is in use or one of the backup links.
Found by: Writing the Handbook section
Approved by: re (kensmith)
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.
While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.
Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)
- If the path cost is calculated when the link is down, set a pending flag so
it is calculated again when it comes back up.
- To not use 00:00:00:00:00:00 as the bridge id, all interfaces are scanned and
the lowest number wins. All zeros is too low.
Approved by: re (rwatson)
communicate with another private port.
All unicast/broadcast/multicast layer2 traffic is blocked so it works much the
same way as using firewall rules but scales better and is generally easier as
firewall packages usually do not allow ARP blocking.
An example usage would be having a number of customers on separate vlans
bridged with a server network. All the vlans are marked private, they can all
communicate with the server network unhindered, but can not exchange any
traffic whatsoever with each other.
Approved by: re (rwatson)
ports to the lagg interface.
- Use the MTU from the first interface as the lagg MTU, all extra interfaces
must be the same.
This fixes using a lagg interface for a vlan or enabling jumbo frames, etc.
Approved by: re (kensmith)
MFC After: 3 days
framework for non-MPSAFE network protocols:
- Remove debug_mpsafenet variable, sysctl, and tunable.
- Remove NET_NEEDS_GIANT() and associate SYSINITSs used by it to force
debug.mpsafenet=0 if non-MPSAFE protocols are compiled into the kernel.
- Remove logic to automatically flag interrupt handlers as non-MPSAFE if
debug.mpsafenet is set for an INTR_TYPE_NET handler.
- Remove logic to automatically flag netisr handlers as non-MPSAFE if
debug.mpsafenet is set.
- Remove references in a few subsystems, including NFS and Cronyx drivers,
which keyed off debug_mpsafenet to determine various aspects of their own
locking behavior.
- Convert NET_LOCK_GIANT(), NET_UNLOCK_GIANT(), and NET_ASSERT_GIANT into
no-op's, as their entire behavior was determined by the value in
debug_mpsafenet.
- Alias NET_CALLOUT_MPSAFE to CALLOUT_MPSAFE.
Many remaining references to NET_.*_GIANT() and NET_CALLOUT_MPSAFE are still
present in subsystems, and will be removed in followup commits.
Reviewed by: bz, jhb
Approved by: re (kensmith)
will intialize the the header length and re-initialize the mbuf pointer
to reference the mbuf that is allocated after moving user supplied packet
data in.
ioctl routines if we are running with !mpsafenet
- Change un-conditional Giant acquisition around ifpromisc
to occur only if we are running with !mpsafenet
With these locking bits in place, we can now remove the Giant
requirement from BPF, so drop the D_NEEDGIANT device flag.
This change removes Giant acquisitions around BPF device
handlers (read, write, ioctl etc).
MFC after: 1 month
Discussed with: rwatson
bridged, previously legitimate traffic was not passed as the bridge could not
tell that it was on a different Ethernet segment.
All non-tagged traffic is treated as vlan1 as per IEEE 802.1Q-2003
tunnels, and was not MPSAFE. The code can be easily restored in the
event that someone with an IPX over IP tunnel configuration can work
with me to test patches.
This removes one of five remaining consumers of NET_NEEDS_GIANT.
Approved by: re (kensmith)
o major overhaul of the way channels are handled: channels are now
fully enumerated and uniquely identify the operating characteristics;
these changes are visible to user applications which require changes
o make scanning support independent of the state machine to enable
background scanning and roaming
o move scanning support into loadable modules based on the operating
mode to enable different policies and reduce the memory footprint
on systems w/ constrained resources
o add background scanning in station mode (no support for adhoc/ibss
mode yet)
o significantly speedup sta mode scanning with a variety of techniques
o add roaming support when background scanning is supported; for now
we use a simple algorithm to trigger a roam: we threshold the rssi
and tx rate, if either drops too low we try to roam to a new ap
o add tx fragmentation support
o add first cut at 802.11n support: this code works with forthcoming
drivers but is incomplete; it's included now to establish a baseline
for other drivers to be developed and for user applications
o adjust max_linkhdr et. al. to reflect 802.11 requirements; this eliminates
prepending mbufs for traffic generated locally
o add support for Atheros protocol extensions; mainly the fast frames
encapsulation (note this can be used with any card that can tx+rx
large frames correctly)
o add sta support for ap's that beacon both WPA1+2 support
o change all data types from bsd-style to posix-style
o propagate noise floor data from drivers to net80211 and on to user apps
o correct various issues in the sta mode state machine related to handling
authentication and association failures
o enable the addition of sta mode power save support for drivers that need
net80211 support (not in this commit)
o remove old WI compatibility ioctls (wicontrol is officially dead)
o change the data structures returned for get sta info and get scan
results so future additions will not break user apps
o fixed tx rate is now maintained internally as an ieee rate and not an
index into the rate set; this needs to be extended to deal with
multi-mode operation
o add extended channel specifications to radiotap to enable 11n sniffing
Drivers:
o ath: add support for bg scanning, tx fragmentation, fast frames,
dynamic turbo (lightly tested), 11n (sniffing only and needs
new hal)
o awi: compile tested only
o ndis: lightly tested
o ipw: lightly tested
o iwi: add support for bg scanning (well tested but may have some
rough edges)
o ral, ural, rum: add suppoort for bg scanning, calibrate rssi data
o wi: lightly tested
This work is based on contributions by Atheros, kmacy, sephe, thompsa,
mlaier, kevlo, and others. Much of the scanning work was supported by
Atheros. The 11n work was supported by Marvell.
the value of ph_nhooks to zero, not the address. This removes
extranious calls to pfil_run_hooks (and an rw lock) from the
network stack's critical path when no pfil hooks are active.
Reviewed by: csjp
Sponsored by: Myricom Inc.