Now that the pseudo-interface cloner has an internal list of instances,
there is no need to create a softc. The softc only contains a pointer to
the ifp, which means there is no valid reason to keep it. While there,
remove the corresponding malloc-pool.
Approved by: philip (mentor)
unsynchronized. While races were extremely rare, we've now had a
couple of reports of panics in environments involving large numbers of
IPSEC tunnels being added very quickly on an active system.
- Add accessor functions ifnet_byindex(), ifaddr_byindex(),
ifdev_byindex() to replace existing accessor macros. These functions
now acquire the ifnet lock before derefencing the table.
- Add IFNET_WLOCK_ASSERT().
- Add static accessor functions ifnet_setbyindex(), ifdev_setbyindex(),
which set values in the table either asserting of acquiring the ifnet
lock.
- Use accessor functions throughout if.c to modify and read
ifindex_table.
- Rework ifnet attach/detach to lock around ifindex_table modification.
Note that these changes simply close races around use of ifindex_table,
and make no attempt to solve the probem of disappearing ifnets. Further
refinement of this work, including with respect to ifindex_table
resizing, is still required.
In a future change, the ifnet lock should be converted from a mutex to an
rwlock in order to reduce contention.
Reviewed and tested by: brooks
Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.
Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.
This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.
The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.
Approved by: philip (mentor)
- verified that the ifp->if_snd.ifq_mtx was initalized for
all attached interfaces. This was pointless because it was
initalized for all interfaces in if_attach() so I've removed it.
- Checked that ifp->if_snd.ifq_maxlen is initalized and set it to
ifqmaxlen if unset. This makes more sense in if_attach() so
I moved it there.
- The first call of if_slowtimo(). Delete if_check() and call
if_slowtimo() directly from the SYSINIT().
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
to profile outoing packets for a number of mbuf chain
related parameters
e.g. number of mbufs, wasted space.
probably will do with further work later.
Reviewed by: various
we're certain the allocation will entierly succeed. This fixes a leak in a
fairly unlikely case.
Reported by: vijay singh <vijjus at rocketmail dot com>
MFC after: 1 week
The original code from KAME did not take care of address
aliases or multiple ip addresses that have the same
prefix.
Reviewed by: rwatson, gnn, sam, kmacy, julian
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,
route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2
The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"
Multiple default routes can also be inserted. Here is the netstat
output:
default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0
When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,
route delete default
would fail and trigger the following error message:
"route: writing to routing socket: No such process"
"delete net default: not in table"
On the other hand,
route delete default 10.2.5.2
would be successful: "delete net default: gateway 10.2.5.2"
One does not have to specify a gateway if there is only a single
route for a particular destination.
I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.
Reviewed by: robert, sam, gnn, julian, kmacy
buffer kernel descriptors, which is used to allow the buffer
currently in the BPF "store" position to be assigned to userspace
when it fills, even if userspace hasn't acknowledged the buffer
in the "hold" position yet. To implement this, notify the buffer
model when a buffer becomes full, and check that the store buffer
is writable, not just for it being full, before trying to append
new packet data. Shared memory buffers will be assigned to
userspace at most once per fill, be it in the store or in the
hold position.
This removes the restriction that at most one shared memory can
by owned by userspace, reducing the chances that userspace will
need to call select() after acknowledging one buffer in order to
wait for the next buffer when under high load. This more fully
realizes the goal of zero system calls in order to process a
high-speed packet stream from BPF.
Update bpf.4 to reflect that both buffers may be owned by userspace
at once; caution against assuming this.
from clearing the IFF_NEEDSGIANT flag on Giant-locked interfaces.
In particular, wpa_supplicant was doing this on USB interfaces,
causing panics when Giant-locked code was then called without Giant.
Submitted by: Alexey Popov
Reviewed by: rwatson
MFC after: 3 days
zero-copy to the store buffer position on the BPF descriptor,
and the 'b' buffer as the free buffer in order to fill them in
the order documented in bpf(4).
MFC after: 4 months
Suggested by: csjp
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.
Reviewed by: arch
There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
bpf_canfreebuf() in order to avoid potentially calling a non-inlinable
but trivial function in zero-copy buffer mode for every packet
received when we couldn't free the buffer anyway.
MFC after: 4 months
overhead of packet capture by allowing a user process to directly "loan"
buffer memory to the kernel rather than using read(2) to explicitly copy
data from kernel address space.
The user process will issue new BPF ioctls to set the shared memory
buffer mode and provide pointers to buffers and their size. The kernel
then wires and maps the pages into kernel address space using sf_buf(9),
which on supporting architectures will use the direct map region. The
current "buffered" access mode remains the default, and support for
zero-copy buffers must, for the time being, be explicitly enabled using
a sysctl for the kernel to accept requests to use it.
The kernel and user process synchronize use of the buffers with atomic
operations, avoiding the need for system calls under load; the user
process may use select()/poll()/kqueue() to manage blocking while
waiting for network data if the user process is able to consume data
faster than the kernel generates it. Patchs to libpcap are available
to allow libpcap applications to transparently take advantage of this
support. Detailed information on the new API may be found in bpf(4),
including specific atomic operations and memory barriers required to
synchronize buffer use safely.
These changes modify the base BPF implementation to (roughly) abstrac
the current buffer model, allowing the new shared memory model to be
added, and add new monitoring statistics for netstat to print. The
implementation, with the exception of some monitoring hanges that break
the netstat monitoring ABI for BPF, will be MFC'd.
Zerocopy bpf buffers are still considered experimental are disabled
by default. To experiment with this new facility, adjust the
net.bpf.zerocopy_enable sysctl variable to 1.
Changes to libpcap will be made available as a patch for the time being,
and further refinements to the implementation are expected.
Sponsored by: Seccuris Inc.
In collaboration with: rwatson
Tested by: pwood, gallatin
MFC after: 4 months [1]
[1] Certain portions will probably not be MFCed, specifically things
that can break the monitoring ABI.
This one line change makes the following code found in many ethernet device drivers
(at least em, igb, ixgbe, and cxgb) gratuitous
case SIOCSIFADDR:
if (ifa->ifa_addr->sa_family == AF_INET) {
/*
* XXX
* Since resetting hardware takes a very long time
* and results in link renegotiation we only
* initialize the hardware only when it is absolutely
* required.
*/
ifp->if_flags |= IFF_UP;
if (!(ifp->if_drv_flags & IFF_DRV_RUNNING)) {
EM_CORE_LOCK(adapter);
em_init_locked(adapter);
EM_CORE_UNLOCK(adapter);
}
arp_ifinit(ifp, ifa);
} else
error = ether_ioctl(ifp, command, data);
break;
this means that it no longer grabs the lagg rwlock. Use two port table arrays
which list the active ports for Tx and switch between them with an atomic op.
Now the lagg rwlock is only exclusively locked for management (ioctls) and
queuing of lacp control frames isnt needed.
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
for all network interfaces, not just ethernet-like ones.
Upgrade it to a louder WARNING and be explicit that the flag is obsolete.
Support for IFF_NEEDSGIANT will be removed in a few months (see arch@ for
details) and will not appear in 8.0.
Upgrade if_watchdog to a WARNING.
- Set M_BCAST|M_MCAST for incoming frames
- Send the frame to a local interface if the bridge returns the mbuf
Submitted by: Eugene Grosbein
Tested by: Boris Kochergin
specified in Table 7-10 in their destination address field shall not be relayed
by the Bridge. Add a check in bridge_forward() to adhere to this.
PR: kern/119744
functions. It is easily triggered by running routed, and, I expect, by
running any other daemon that uses routing sockets.
Reviewed by: net@
MFC after: 1 week
(dummynet), ipsec_filter() would return the empty error code and the ipsec code
would continue to forward/deference the null mbuf.
Found by: m0n0wall
Reviewed by: bz
MFC after: 3 days
link has been marked discarding by Spanning Tree. This would cause the bridge
to see duplicate packets to itself even if STP has correctly calculated the
topology and blocked redundant links.
Reported by: trasz
Tested by: trasz
MFC after: 3 days
administratively down (!IFF_UP)
- Use the same parameters to lagg_link_active() to get the backup port as in
the output path, this didnt actually matter in practice as sc_primary is
always the first on the port list.
MFC after: 3 days
bpf will see inner and outer headers or just inner or outer
headers for incoming and outgoing IPsec packets.
This is useful in bpf to not have over long lines for debugging
or selcting packets based on the inner headers.
It also properly defines the behavior of what the firewalls see.
Last but not least it gives you if_enc(4) for IPv6 as well.
[ As some auxiliary state was not available in the later
input path we save it in the tdbi. That way tcpdump can give a
consistent view of either of (authentic,confidential) for both
before and after states. ]
Discussed with: thompsa (2007-04-25, basic idea of unifying paths)
Reviewed by: thompsa, gnn
This has the benefit that rmlocks have proper support for reader recursion
(in contrast to rwlock(9) which could potential lead to writer stravation).
It also means a significant performance gain, eventhough only visible in
microbenchmarks at the moment.
Discussed on: -arch, -net
as up if at least one of its ports also has a link up. This fixes using
carp+lagg together and any other system that relies on linkstate events.
PR: kern/113956
MFC after: 3 days
2) Alter packet flow inside dummynet: allow certain packets to bypass
dummynet scheduler. Benefits are:
- lower latency: if packet flow does not exceed pipe bandwidth, packets
will not be (up to tick) delayed (due to dummynet's scheduler granularity).
- lower overhead: if packet avoids dummynet scheduler it shouldn't reenter ip
stack later. Such packets can be fastforwarded.
- recursion (which can lead to kernel stack exhaution) eliminated. This fix
long existed panic, which can be triggered this way:
kldload dummynet
sysctl net.inet.ip.fw.one_pass=0
ipfw pipe 1 config bw 0
for i in `jot 30`; do ipfw add 1 pipe 1 icmp from any to any; done
ping -c 1 localhost
3) Three new sysctl nodes are added:
net.inet.ip.dummynet.io_pkt - packets passed to dummynet
net.inet.ip.dummynet.io_pkt_fast - packets avoided dummynet scheduler
net.inet.ip.dummynet.io_pkt_drop - packets dropped by dummynet
P.S. Above comments are true only for layer 3 packets. Layer 2 packet flow
is not changed yet.
MFC after: 3 month
interface. Once the limit is reached packets with unknown source addresses are
dropped until an existing host cache entry expires or is removed. Useful to
use with the STICKY cache option.
Sponsored by: miniSuperHappyDevHouse NZ
a private softc list is needed neither for tracking clones in general
nor for destroying all clones before the module unload -- if_clone
takes care of all that. (Note that some other interface drivers do
need a softc list to be able to scan it for their private purposes.)
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
Specifically, if two threads were doing concurrent lookups and the existing
gateway was marked down, the the first thread would drop a reference on the
gateway route and then unlock the "root" route while it tried to allocate
a new route. The second thread could then also drop a reference on the
same gateway route resulting in a reference underflow. Fix this by
clearing the gateway route pointer after dropping the reference count but
before dropping the lock. Secondly, in this same case, the second thread
would overwrite the gateway route pointer w/o free'ing a reference to the
route installed by the first thread. In practice this would probably just
fix a lost reference that would result in a route never being freed.
This fixes panics observed in rt_check() and rtexpunge().
MFC after: 1 week
PR: kern/112490
Insight from: mehuljv at yahoo.com
Reviewed by: ru (found the "not-setting it to NULL" part)
Tested by: several
queue so the output network card must support the same tagging mechanism as
how the frame was input (prepended Ethernet header tag or stripped HW mflag).
Now the vlan Ethernet header is _always_ stripped in ether_input and the mbuf
flagged, only only network cards with VLAN_HWTAGGING enabled would properly
re-tag any outgoing vlan frames.
If the outgoing interface does not support hardware tagging then readd the vlan
header to the front of the frame. Move the common vlan encapsulation in to
ether_vlanencap().
Reported by: Erik Osterholm, Jon Otterholm
MFC after: 1 week
This fixes the process portion of the bpf(4) stats if the peer forks
into the background after it's opened the descriptor. This bug
results in the following behavior for netstat -B:
# netstat -B
Pid Netif Flags Recv Drop Match Sblen Hblen Command
netstat: kern.proc.pid failed: No such process
78023 em0 p--s-- 2237404 43119 2237404 13986 0 ??????
MFC after: 1 week
1. The locking was changed to shared but roundrobin mode still updated a
pointer in the softc with the next tx interface to use. This will panic
under high load. Change this to an atomically incremented sequence number in
order to choose the tx port in round robin.
2. IFQ_HANDOFF will free the mbuf if the queue is full, this will then be freed
again by lagg_start() and panic. Reorganised the error handling and freeing
to fix this.
MFC after: 3 days
route and once they are done with it, call rtfree(). rtfree() should
only be used when we are certain we hold the last reference to the
route. This bug results in console messages like the following:
rtfree: 0xc40f7000 has 1 refs
This patch switches the rtfree() to use RTFREE_LOCKED() instead,
which should handle the reference counting on the route better.
Approved by: re@ (gnn)
Reviewed by: bms
Reported by: many via net@ and current@
Tested by: many
matches the BPF registers (which are the only thing that is assigned
to/from BPF memory). This is a pedantic change that shouldn't change
any behaviour.
PR: 115931
Submitted by: Matthew Luckie <mjl@luckie.org.nz>
Approved by: re (bmah)
MFC after: 3 weeks
flags, the absense of these flags causes problems in other areas such as
bridging which expect them to be correct.
At the moment only Ethernet DLTs are checked.
Reviewed by: bms, csjp, sam
Approved by: re (bmah)
active in failover mode rather than all interfaces with a link. This makes it
clear if the master interface is in use or one of the backup links.
Found by: Writing the Handbook section
Approved by: re (kensmith)
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.
While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.
Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)
- If the path cost is calculated when the link is down, set a pending flag so
it is calculated again when it comes back up.
- To not use 00:00:00:00:00:00 as the bridge id, all interfaces are scanned and
the lowest number wins. All zeros is too low.
Approved by: re (rwatson)
communicate with another private port.
All unicast/broadcast/multicast layer2 traffic is blocked so it works much the
same way as using firewall rules but scales better and is generally easier as
firewall packages usually do not allow ARP blocking.
An example usage would be having a number of customers on separate vlans
bridged with a server network. All the vlans are marked private, they can all
communicate with the server network unhindered, but can not exchange any
traffic whatsoever with each other.
Approved by: re (rwatson)
ports to the lagg interface.
- Use the MTU from the first interface as the lagg MTU, all extra interfaces
must be the same.
This fixes using a lagg interface for a vlan or enabling jumbo frames, etc.
Approved by: re (kensmith)
MFC After: 3 days
framework for non-MPSAFE network protocols:
- Remove debug_mpsafenet variable, sysctl, and tunable.
- Remove NET_NEEDS_GIANT() and associate SYSINITSs used by it to force
debug.mpsafenet=0 if non-MPSAFE protocols are compiled into the kernel.
- Remove logic to automatically flag interrupt handlers as non-MPSAFE if
debug.mpsafenet is set for an INTR_TYPE_NET handler.
- Remove logic to automatically flag netisr handlers as non-MPSAFE if
debug.mpsafenet is set.
- Remove references in a few subsystems, including NFS and Cronyx drivers,
which keyed off debug_mpsafenet to determine various aspects of their own
locking behavior.
- Convert NET_LOCK_GIANT(), NET_UNLOCK_GIANT(), and NET_ASSERT_GIANT into
no-op's, as their entire behavior was determined by the value in
debug_mpsafenet.
- Alias NET_CALLOUT_MPSAFE to CALLOUT_MPSAFE.
Many remaining references to NET_.*_GIANT() and NET_CALLOUT_MPSAFE are still
present in subsystems, and will be removed in followup commits.
Reviewed by: bz, jhb
Approved by: re (kensmith)
will intialize the the header length and re-initialize the mbuf pointer
to reference the mbuf that is allocated after moving user supplied packet
data in.
ioctl routines if we are running with !mpsafenet
- Change un-conditional Giant acquisition around ifpromisc
to occur only if we are running with !mpsafenet
With these locking bits in place, we can now remove the Giant
requirement from BPF, so drop the D_NEEDGIANT device flag.
This change removes Giant acquisitions around BPF device
handlers (read, write, ioctl etc).
MFC after: 1 month
Discussed with: rwatson
bridged, previously legitimate traffic was not passed as the bridge could not
tell that it was on a different Ethernet segment.
All non-tagged traffic is treated as vlan1 as per IEEE 802.1Q-2003
tunnels, and was not MPSAFE. The code can be easily restored in the
event that someone with an IPX over IP tunnel configuration can work
with me to test patches.
This removes one of five remaining consumers of NET_NEEDS_GIANT.
Approved by: re (kensmith)
o major overhaul of the way channels are handled: channels are now
fully enumerated and uniquely identify the operating characteristics;
these changes are visible to user applications which require changes
o make scanning support independent of the state machine to enable
background scanning and roaming
o move scanning support into loadable modules based on the operating
mode to enable different policies and reduce the memory footprint
on systems w/ constrained resources
o add background scanning in station mode (no support for adhoc/ibss
mode yet)
o significantly speedup sta mode scanning with a variety of techniques
o add roaming support when background scanning is supported; for now
we use a simple algorithm to trigger a roam: we threshold the rssi
and tx rate, if either drops too low we try to roam to a new ap
o add tx fragmentation support
o add first cut at 802.11n support: this code works with forthcoming
drivers but is incomplete; it's included now to establish a baseline
for other drivers to be developed and for user applications
o adjust max_linkhdr et. al. to reflect 802.11 requirements; this eliminates
prepending mbufs for traffic generated locally
o add support for Atheros protocol extensions; mainly the fast frames
encapsulation (note this can be used with any card that can tx+rx
large frames correctly)
o add sta support for ap's that beacon both WPA1+2 support
o change all data types from bsd-style to posix-style
o propagate noise floor data from drivers to net80211 and on to user apps
o correct various issues in the sta mode state machine related to handling
authentication and association failures
o enable the addition of sta mode power save support for drivers that need
net80211 support (not in this commit)
o remove old WI compatibility ioctls (wicontrol is officially dead)
o change the data structures returned for get sta info and get scan
results so future additions will not break user apps
o fixed tx rate is now maintained internally as an ieee rate and not an
index into the rate set; this needs to be extended to deal with
multi-mode operation
o add extended channel specifications to radiotap to enable 11n sniffing
Drivers:
o ath: add support for bg scanning, tx fragmentation, fast frames,
dynamic turbo (lightly tested), 11n (sniffing only and needs
new hal)
o awi: compile tested only
o ndis: lightly tested
o ipw: lightly tested
o iwi: add support for bg scanning (well tested but may have some
rough edges)
o ral, ural, rum: add suppoort for bg scanning, calibrate rssi data
o wi: lightly tested
This work is based on contributions by Atheros, kmacy, sephe, thompsa,
mlaier, kevlo, and others. Much of the scanning work was supported by
Atheros. The 11n work was supported by Marvell.
the value of ph_nhooks to zero, not the address. This removes
extranious calls to pfil_run_hooks (and an rw lock) from the
network stack's critical path when no pfil hooks are active.
Reviewed by: csjp
Sponsored by: Myricom Inc.
which support a 2.5Gbps mode over fiber using next page extensions during
autonegotiation. Typically only found in blade systems which also include
a Broadcom 2.5Gbps capable switch.
MFC after: 2 weeks
as to the type of the command argument: int -> u_long.
These types have different widths in the 64-bit world.
Add a note to UPDATING because the change breaks KBI
on 64-bit platforms.
Discussed on: -net, -current
Reviewed by: bms, ru
- In rt_check() remove the senderr() macro and the "bad" label. They
used to simplify code, but now aren't.
- Remove extra RT_LOCK_ASSERT() in rt_setgate(). The RT_REMREF macro
does this.
- In rtfree() convert panics to KASSERTs.
- Strict the routing API: rtfree() should be called only in a case
when we are completely sure we've got the last reference on the
rtentry. In all other cases RTFREE_LOCKED() macro should be used.
If the reference isn't the last one spit out a warning printf.
Correct the only(?) case for this in rt_check().
- Fix typos in comments.
of each port and any further packets are blocked, when the all the marker frames
have been returned to us from the remote network device then we can be sure
that all interface queues are empty.
This is needed when a port is added or removed from the aggregation since it
will affect the hash based distribution, if the queues are not empty then a
packet from an existing connection may be placed on a different interface and
arrive out of order. This was previously achieved by suppressing transmission for
1 second, now that there is an active feedback this timeout as been increased
to 3 seconds and used as a fallback.
The name trunk is misused as the networking term trunk means carrying multiple
VLANs over a single connection. The IEEE standard for link aggregation (802.3
section 3) does not talk about 'trunk' at all while it is used throughout IEEE
802.1Q in describing vlans.
The lagg(4) driver provides link aggregation, failover and fault tolerance.
Discussed on: current@
tolerance. This driver allows aggregation of multiple network interfaces as
one virtual interface using a number of different protocols/algorithms.
failover - Sends traffic through the secondary port if the master becomes
inactive.
fec - Supports Cisco Fast EtherChannel.
lacp - Supports the IEEE 802.3ad Link Aggregation Control Protocol
(LACP) and the Marker Protocol.
loadbalance - Static loadbalancing using an outgoing hash.
roundrobin - Distributes outgoing traffic using a round-robin scheduler
through all active ports.
This code was obtained from OpenBSD and this also includes 802.3ad LACP support
from agr(4) in NetBSD.
doesn't need to be first in softc now. (It was the whole
ifnet structure itself that needed to be first in the good
old days.) Fix the respective comment accordingly.
Add xrefs to ifnet(9) in some other comments while I'm here.
Pointed out by: thompsa
imitating an Ethernet device, so vlan(4) and if_bridge(4) can be
attached to it for testing and benchmarking purposes. Its source
can be an introduction to the anatomy of a network interface driver
due to its simplicity as well as to a bunch of comments in it.
(The rest of needed changes were in my previous commit, which got
interrupted in the middle. Alas, CVS commits are not atomic.)
sequence. First, if rt_ifa is going to be changed, then call
ifa_rtrequest(RTM_DELETE). Second, if gateway is going to be changed,
then call rt_setgate(). Third, change rt_ifa.
With this change we are able to change a link level route to a
gateway one, that wasn't possible before:
# ifconfig em0 192.168.22.1/24
# arp -s 192.168.22.99 00:11:22:33:44:55
# route change 192.168.22.99 192.168.22.199
# ping 192.168.22.99
db>
Reported by: avatar
structures. Detect when ifnet instances are detached from the network
stack and perform appropriate cleanup to prevent memory leaks.
This has been implemented in such a way as to be backwards ABI compatible.
Kernel consumers are changed to use if_delmulti_ifma(); in_delmulti()
is unable to detect interface removal by design, as it performs searches
on structures which are removed with the interface.
With this architectural change, the panics FreeBSD users have experienced
with carp and pfsync should be resolved.
Obtained from: p4 branch bms_netdev
Reviewed by: andre
Sponsored by: Garance A Drosehn
Idea from: NetBSD
MFC after: 1 month
Main points of this change:
* Drop frames immediately if the interface is not marked IFF_UP.
* Always trim off the frame checksum if present.
* Always use M_VLANTAG in preference to passing 802.1Q frames
to consumers.
* Use __func__ consistently for KASSERT().
* Use the M_PROMISC flag to detect situations where ether_input()
may reenter itself on the same call graph with the same mbuf which
was promiscuously received on behalf of subsystems such as
netgraph, carp, and vlan.
* 802.1P frames (that is, VLAN frames with an ID of 0) will now be
passed to layer 3 input paths.
* Deal with the special case for CARP in a sane way.
This is a significant rewrite of code on the critical path. Please report
any issues to me if they arise. Frames will now only pass through dummynet
if M_PROMISC is cleared, to avoid problems with re-entry.
The handling of CARP needs to be revisited architecturally. The M_PROMISC
flag may potentially be demoted to a link-layer flag only as it is in
NetBSD, where the idea originated.
Discussed on: net
Idea from: NetBSD
Reviewed by: yar
MFC after: 1 month
in case of multiple interfaces with the same MAC in the same bridge.
This commit do not solve the entire problem. Only case where packet
arrived from such interface.
PR: kern/109815
MFC after: 7 days
Submitted by: Eygene Ryabinkin and rik@
Discussed with: bms@, thompsa@, yar@
This can help to spot bugs (which it did for me,)
and let people know which mode the vlan module is
actually using if they suspect it isn't picking its
options from the main kernel config file.
- ifv_list member of struct ifvlan is unneeded in array mode,
it's used only in hash mode to resolve hash collisions.
- We don't need the list of trunks at all. (The initial reason for
having it was to be able to destroy all trunks in the MOD_UNLOAD
handler, but a trunk is not to be destroyed forcibly -- it will
go away when all vlan interfaces on it have been deleted.
Note that if_clone_detach() called first of all under MOD_UNLOAD
will delete all vlan interfaces and thus make all trunks go away
quietly.)
- It's enough to use a single [S]LIST_FIRST() in a typical list
destruction loop.
Add macro EVL_APPLY_VLID() which may be used to apply an 802.1q VLAN ID
to the M_VLANTAG field in an mbuf packet header non-destructively.
This will be used by net80211 to begin with.
Add macro EVL_APPLY_PRI() which may be used to apply an 802.1p priority
class to the M_VLANTAG field in an mbuf packet header non-destructively.
Add other macros for manipulating tags and the CFI bit.
Submitted by: Boris Kovalenko (EVL_CFIOFTAG(), EVL_MAKETAG())
- BIOCGDIRECTION and BIOCSDIRECTION get or set the setting determining
whether incoming, outgoing, or all packets on the interface should be
returned by BPF. Set to BPF_D_IN to see only incoming packets on the
interface. Set to BPF_D_INOUT to see packets originating locally and
remotely on the interface. Set to BPF_D_OUT to see only outgoing
packets on the interface. This setting is initialized to BPF_D_INOUT
by default. BIOCGSEESENT and BIOCSSEESENT are obsoleted by these but
kept for backward compatibility.
- BIOCFEEDBACK sets packet feedback mode. This allows injected packets
to be fed back as input to the interface when output via the interface is
successful. When BPF_D_INOUT direction is set, injected outgoing packet
is not returned by BPF to avoid duplication. This flag is initialized to
zero by default.
Note that libpcap has been modified to support BPF_D_OUT direction for
pcap_setdirection(3) and PCAP_D_OUT direction is functional now.
Reviewed by: rwatson
incoming packets have had their 802.1Q tags processed by the
hardware, resulting in them being stripped from the packets, and
placed on the mbuf. This fixes the processing of 802.1Q tags when
hardware offload of 802.1Q tags is enabled.
a link-layer multicast group membership.
Such memberships are needed in order to support protocols such as
IS-IS without putting the interface into PROMISC or ALLMULTI modes.
sa_equal() is not OK for comparing sockaddr_dl as it has deeper structure
than a simple byte array, so add sa_dl_equal() and use that instead.
Reviewed by: rwatson
Verified with: /usr/sbin/mtest
Bug found by: Jouke Witteveen
MFC after: 2 weeks
that of the tun instance even for the !AF_INET case, and properly
remove configured addresses by calling if_purgeaddrs().
Maintain the TUN_DSTADDR behaviour for compatibility with the OS/390
emulator.
MFC after: 3 weeks
PR: 100080
Reviewed by: bz
Make devfs cloning a sysctl/tunable which defaults to on.
If devfs cloning is enabled, only the super-user may create
tun(4)/tap(4)/vmnet(4) instances. Devfs cloning is still enabled by
default; it may be disabled from the loader or via sysctl with
"net.link.tap.devfs_cloning" and "net.link.tun.devfs_cloning".
Disabling its use affects potentially all tun(4)/tap(4) consumers
including OpenSSH, OpenVPN and VMware.
PR: 105228 (potentially also 90413, 105570)
Submitted by: Landon Fuller
Tested by: Andrej Tobola
Approved by: core (rwatson)
MFC after: 4 weeks
of a tap(4) instance, if IFF_PROMISC is not set.
In tap(4), we should emulate the effect IFF_PROMISC would have on
hardware, otherwise we risk introducing layer 2 loops if tap(4) is
used with bridges. This means not even bpf(4) gets to see them.
This patch has been tested in a variety of situations. Multicast and
broadcast frames are correctly allowed through. I have observed this
behaviour causing problems with multiple QEMU instances hosted on
the same FreeBSD machine.
The checks in in ether_demux() [if_ethersubr.c, rev 1.222, line 638]
are insufficient to prevent this bug from occurring, as ifp->if_vlantrunk
will always be NULL for the non-vlan case.
MFC after: 3 weeks
PR: 86429
Submitted by: Pieter de Boer (with changes)
not used in any of our code. Also remove explicit padding variable that
kept the bpf_d structure the same size before and after the change in
select implementation, since binary compatibility is not required for this
data structure on 7-CURRENT.