Commit Graph

2844 Commits

Author SHA1 Message Date
dim
fda4020a88 Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
dim
7dff36caf7 Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE. 2010-11-14 20:23:02 +00:00
marius
278d761d73 o Flesh out the generic IEEE 802.3 annex 31B full duplex flow control
support in mii(4):
  - Merge generic flow control advertisement (which can be enabled by
    passing by MIIF_DOPAUSE to mii_attach(9)) and parsing support from
    NetBSD into mii_physubr.c and ukphy_subr.c. Unlike as in NetBSD,
    IFM_FLOW isn't implemented as a global option via the "don't care
    mask" but instead as a media specific option this. This has the
    following advantages:
    o allows flow control advertisement with autonegotiation to be
      turned on and off via ifconfig(8) with the default typically
      being off (though MIIF_FORCEPAUSE has been added causing flow
      control to be always advertised, allowing to easily MFC this
      changes for drivers that previously used home-grown support for
      flow control that behaved that way without breaking POLA)
    o allows to deal with PHY drivers where flow control advertisement
      with manual selection doesn't work or at least isn't implemented,
      like it's the case with brgphy(4), e1000phy(4) and ip1000phy(4),
      by setting MIIF_NOMANPAUSE
    o the available combinations of media options are readily available
      from the `ifconfig -m` output
  - Add IFM_FLOW to IFM_SHARED_OPTION_DESCRIPTIONS and IFM_ETH_RXPAUSE
    and IFM_ETH_TXPAUSE to IFM_SUBTYPE_ETHERNET_OPTION_DESCRIPTIONS so
    these are understood by ifconfig(8).
o Make the master/slave support in mii(4) actually usable:
  - Change IFM_ETH_MASTER from being implemented as a global option via
    the "don't care mask" to a media specific one as it actually is only
    applicable to IFM_1000_T to date.
  - Let mii_phy_setmedia() set GTCR_MAN_MS in IFM_1000_T slave mode to
    actually configure manually selected slave mode (like we also do in
    the PHY specific implementations).
  - Add IFM_ETH_MASTER to IFM_SUBTYPE_ETHERNET_OPTION_DESCRIPTIONS so it
    is understood by ifconfig(8).
o Switch bge(4), bce(4), msk(4), nfe(4) and stge(4) along with brgphy(4),
  e1000phy(4) and ip1000phy(4) to use the generic flow control support
  instead of home-grown solutions via IFM_FLAGs. This includes changing
  these PHY drivers and smcphy(4) to no longer unconditionally advertise
  support for flow control but only if the selected media has IFM_FLOW
  set (or MIIF_FORCEPAUSE is set) and implemented for these media variants,
  i.e. typically only for copper.
o Switch brgphy(4), ciphy(4), e1000phy(4) and ip1000phy(4) to report and
  set IFM_1000_T master mode via IFM_ETH_MASTER instead of via IFF_LINK0
  and some IFM_FLAGn.
o Switch brgphy(4) to add at least the the supported copper media based on
  the contents of the BMSR via mii_phy_add_media() instead of hardcoding
  them. The latter approach seems to have developed historically, besides
  causing unnecessary code duplication it was also undesirable because
  brgphy_mii_phy_auto() already based the capability advertisement on the
  contents of the BMSR though.
o Let brgphy(4) set IFM_1000_T master mode on all supported PHY and not
  just BCM5701. Apparently this was a misinterpretation of a workaround
  in the Linux tg3 driver; BCM5701 seem to require RGPHY_1000CTL_MSE and
  BRGPHY_1000CTL_MSC to be set when configuring autonegotiation but
  this doesn't mean we can't set these as well on other PHYs for manual
  media selection.
o Let ukphy_status() report IFM_1000_T master mode via IFM_ETH_MASTER so
  IFM_1000_T master mode support now is generally available with all PHY
  drivers.
o Don't let e1000phy(4) set master/slave bits for IFM_1000_SX as it's
  not applicable there.

Reviewed by:	yongari (plus additional testing)
Obtained from:	NetBSD (partially), OpenBSD (partially)
MFC after:	2 weeks
2010-11-14 13:26:10 +00:00
kib
1889bb0afa Use 'z' modifier for size_t printing. 2010-11-13 11:11:51 +00:00
dim
7b0aabca30 Similar to r212647, remove the workaround in sys/net/vnet.h for an ld
bug (incorrect placement of __start_SECNAME in some cases) that was
fixed in r210245.

There is already an UPDATING entry about needing a recent ld.

MFC after:	1 month
2010-11-12 22:59:50 +00:00
gnn
c3225b5eaa Add a queue to hold packets while we await an ARP reply.
When a fast machine first brings up some non TCP networking program
it is quite possible that we will drop packets due to the fact that
only one packet can be held per ARP entry.  This leads to packets
being missed when a program starts or restarts if the ARP data is
not currently in the ARP cache.

This code adds a new sysctl, net.link.ether.inet.maxhold, which defines
a system wide maximum number of packets to be held in each ARP entry.
Up to maxhold packets are queued until an ARP reply is received or
the ARP times out.  The default setting is the old value of 1
which has been part of the BSD networking code since time
immemorial.

Expose the time we hold an incomplete ARP entry by adding
the sysctl net.link.ether.inet.wait, which defaults to 20
seconds, the value used when the new ARP code was added..

Reviewed by:	bz, rpaulo
MFC after: 3 weeks
2010-11-12 22:03:02 +00:00
dim
2c7257916f Use the same treatment as in linker_set.h for the __start and __stop
symbols of the set_vnet and set_pcpu sections, so those symbols will
always be emitted in kernel modules, if they use vnet.h or pcpu.h.

Also, for pcpu.h, make the __(start|stop)_set_pcpu declarations, and
associated macros invisible to userland, to prevent it picking up these
symbols.

Reviewed by:	kib
2010-11-11 19:18:52 +00:00
rpaulo
2631ae0f3d Sync DLTs with the latest pcap version. 2010-10-29 18:41:09 +00:00
bz
491af1942e Factor out DDB commands from r204145, r204279 into if_debug.c for further
enhancements (1).  Switch to a standard 2-clause BSD license for this (2).

Unfortunately we have to un-static the ifindex_table for this but do not
publicly export it.

Suggested by:	rwatson (1) a while back.
Approved by:	thompsa (2) for the change from r204279.
MFC after:	6 days
2010-10-25 08:30:19 +00:00
pluknet
8950ed8036 Reshuffle SIOCGIFCONF32 handler from r155224.
- move all the chunks into one file, which allows to hide SIOCGIFCONF32
  global definition as well.
- replace __amd64__ with proper COMPAT_FREEBSD32 around.
- handle 32bit capacity before going into the handler itself instead of
  doing internal 32bit specific changes within it (e.g. as it's done for
  SIOCGDEFIFACE32_IN6).
- use explicitely sized types for ABI compat.

Approved by:	kib (mentor)
MFC after:	2 weeks
2010-10-21 16:20:48 +00:00
bz
753b9262ae Close a race acquiring the IF_ADDR_LOCK() for each entry while iterating
over all interfaces to make sure the address will neither change nor be
freed while we are working on it.

PR:		kern/146250
Submitted by:	Mikolaj Golub (to.my.trociny gmail.com)
MFC after:	1 week
2010-10-16 19:25:27 +00:00
bz
e7e5079137 lltable_drain() has never been used so far, thus #if 0 it for now.
While touching it add the missing locking to the now disabled code
for the time when we'll resurrect it.

MFC after:	3 days
2010-10-16 18:42:09 +00:00
bz
054f4fff23 Only hide the ifa and not the tp under #ifdef INET as the tp is needed
for locking evenwhen there is no INET.

MFC after:	3 days
2010-10-01 15:14:14 +00:00
jhb
b8914d3479 - Expand scope of tun/tap softc locks to cover more softc fields and
driver-maintained ifnet fields (such as if_drv_flags).
- Use soft locks as the mutex that protects each interface's knote list
  rather than using the global knote list lock.  Also, use the softc
  for kn_hook instead of the cdev.
- Use mtx_sleep() instead of tsleep() when blocking in the read routines.
  This fixes a lost wakeup race.
- Remove D_NEEDGIANT now that the cdevsw routines use the softc lock
  where locking is needed.
- Lock IFQ when calculating the result for FIONREAD in tap(4).  tun(4)
  already did this.
- Remove remaining spl calls.

Submitted by:	Marcin Cieslak  saper of saper|info (3)
MFC after:	2 weeks
2010-09-22 21:02:43 +00:00
jkim
4b1a6e6702 Fix a typo in a comment.
Submitted by:	afiveg
2010-09-16 18:37:33 +00:00
mdf
ab3a8b533a Replace sbuf_overflowed() with sbuf_error(), which returns any error
code associated with overflow or with the drain function.  While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
2010-09-10 16:42:16 +00:00
bz
06977c291c MFp4 CH=183259:
No reason to use if_free_type() as we don't change our type.
  Just if_free() is fine.

MFC after:	3 days
2010-09-02 16:11:12 +00:00
emaste
a9a1b47f1d Add a sysctl knob to accept input packets on any link in a failover lagg. 2010-09-01 16:53:38 +00:00
bz
70761a9ea6 MFp4 CH=182972:
Add explicit linkstate UP/DOWN for the epair.  This is needed by carp(4)
and other things to work.

MFC after:	5 days
2010-08-27 23:22:58 +00:00
rpaulo
ea11ba6788 Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by:	The FreeBSD Foundation
Discussed with:	rwaston [1]
2010-08-22 11:18:57 +00:00
zec
e1e5264fc5 When moving an ethernet ifnet from one vnet to another, destroy the
associated ng_ether netgraph node in the current vnet, and create a
new one in the target vnet.

Reviewed by:	julian
MFC after:	3 days
2010-08-13 18:17:32 +00:00
will
d548943ae9 Unbreak LINT by moving all carp hooks to net/if.c / netinet/ip_carp.h, with
the appropriate ifdefs.

Reviewed by:	bz
Approved by:	ken (mentor)
2010-08-11 20:18:19 +00:00
will
aa4e762c4a Allow carp(4) to be loaded as a kernel module. Follow precedent set by
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.

Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default.  Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.

This commit requires IP6PROTOSPACER, introduced in r211115.

Reviewed by:	bz, simon
Approved by:	ken (mentor)
MFC after:	2 weeks
2010-08-11 00:51:50 +00:00
jhb
4ab164bfd3 Adjust the interface type in the link layer socket address for vlan(4)
interfaces to be a vlan (IFT_L2VLAN) rather than an Ethernet interface
(IFT_ETHER).  The code already fixed if_type in the ifnet causing some
places to report the interface as a vlan (e.g. arp -a output) and other
places to report the interface as Ethernet (getifaddrs(3)).  Now they
should all report IFT_L2VLAN.

Reviewed by:	brooks
MFC after:	1 month
2010-08-06 15:15:26 +00:00
kib
c5df0ad3b8 Properly set ifi_datalen for compat32 struct if_data32.
PR:	kern/149240
Submitted by:	Stef Walter <stef memberwebs com>
MFC after:	1 weeks
2010-08-03 15:40:42 +00:00
glebius
b89adb17c9 Don't check malloc(M_WAITOK) result. 2010-07-27 11:56:49 +00:00
bz
0078d05705 Return NULL rather than 0 for a pointer.
MFC after:	3 days
2010-07-27 11:54:01 +00:00
glebius
43f917d899 When installing a new ARP entry via 'arp -S', lla_lookup() will
either find an existing entry, or allocate a new one. In the latter
case an entry would have flags, that were supplied as argument to
lla_lookup(). In case of an existing entry, flags aren't modified.

This lead to losing LLE_PUB and/or LLE_PROXY flags.

We should apply these flags either in lla_rt_output() or in the
in.c:in_lltable_lookup(). It seems to me that lla_rt_output() is
a more correct choice.

PR:		kern/148784, kern/146539
Silence from:	qingli, 5 days
2010-07-27 10:05:27 +00:00
jkim
9fdab3dd15 Fix an obvious typo from r1.1. We were acquiring an exclusive writer lock
regardless of the given flags.

MFC after:	3 days
2010-07-22 18:44:40 +00:00
luigi
bbbbe1a022 whitespace cleanup 2010-07-15 14:41:59 +00:00
luigi
9f2bcd601a small portability fix to build on linux/windows 2010-07-15 14:41:06 +00:00
jkim
14f08fd627 Implement flexible BPF timestamping framework.
- Allow setting format, resolution and accuracy of BPF time stamps per
listener.  Previously, we were only able to use microtime(9).  Now we can
set various resolutions and accuracies with ioctl(2) BIOCSTSTAMP command.
Similarly, we can get the current resolution and accuracy with BIOCGTSTAMP
command.  Document all supported options in bpf(4) and their uses.

- Introduce new time stamp 'struct bpf_ts' and header 'struct bpf_xhdr'.
The new time stamp has both 64-bit second and fractional parts.  bpf_xhdr
has this time stamp instead of 'struct timeval' for bh_tstamp.  The new
structures let us use bh_tstamp of same size on both 32-bit and 64-bit
platforms without adding additional shims for 32-bit binaries.  On 64-bit
platforms, size of BPF header does not change compared to bpf_hdr as its
members are already all 64-bit long.  On 32-bit platforms, the size may
increase by 8 bytes.  For backward compatibility, struct bpf_hdr with
struct timeval is still the default header unless new time stamp format is
explicitly requested.  However, the behaviour may change in the future and
all relevant code is wrapped around "#ifdef BURN_BRIDGES" for now.

- Add experimental support for tagging mbufs with time stamps from a lower
layer, e.g., device driver.  Currently, mbuf_tags(9) is used to tag mbufs.
The time stamps must be uptime in 'struct bintime' format as binuptime(9)
and getbinuptime(9) do.

Reviewed by:	net@
2010-06-15 19:28:44 +00:00
jhb
9b74a62d73 Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
zec
66c3a596d7 Provide a macro for registering a virtualized sysctl handler for
VNET opaque data.

MFC after:	30 days
2010-06-02 15:29:21 +00:00
qingli
f6ab4a6810 This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after:	3 days
2010-05-25 20:42:35 +00:00
jhb
7df2d528cf Ignore failures from removing multicast addresses from the parent (trunk)
interface when tearing down a vlan interface.  If a trunk interface is
detached, all of its multicast addresses are removed before the ifnet
departure eventhandlers are invoked.  This means that all of the multicast
addresses are removed before the vlan interfaces are removed which causes
the if_delmulti() calls in the vlan teardown to fail.

In the VLAN_ARRAY case, this left vlan interfaces referencing a no longer
valid parent interface.  In the !VLAN_ARRAY case, the eventhandler gets
stuck in an infinite loop retrying vlan_unconfig_locked() forever.  In
general the callers of vlan_unconfig_locked() do not expect nor handle
failure, so I believe it is safer to ignore the errors and tear down as
much of the vlan state as possible.

Silence from:	net@
MFC after:	4 days
2010-05-17 19:36:56 +00:00
kmacy
a68dd336d5 allocate ipv6 flows from the ipv6 flow zone
reported by: rrs@

MFC after:	3 days
2010-05-16 21:48:39 +00:00
bz
c9d1ca826b Fix an issue with the dynamic pcpu/vnet data allocators.
We cannot expect that modspace is the last entry in the linker
set and thus that modspace + possible extra space up to PAGE_SIZE
would be contiguous.  For the moment do not support more than
*_MODMIN space and ignore the extra space (*).

(*) We know how to get it back but it'll need testing.

Discussed with:	jeff, rwatson (briefly)
Reviewed by:	jeff
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	4 days
2010-05-14 21:11:58 +00:00
kmacy
5e73b28c4d workaround bug with ipv6 where a flow can have a null rtentry 2010-05-12 04:51:20 +00:00
alc
e6c77ecaea Remove page queues locking from all sf_buf_mext()-like functions. The page
lock now suffices.

Fix a couple nearby style violations.
2010-05-06 17:43:41 +00:00
alc
c9aaa1e2a2 Add page locking to the vm_page_cow* functions.
Push down the acquisition and release of the page queues lock into
vm_page_wire().

Reviewed by:	kib
2010-05-04 15:55:41 +00:00
sobomax
213eac1f2c Add new tunable 'net.link.ifqmaxlen' to set default send interface
queue length. The default value for this parameter is 50, which is
quite low for many of today's uses and the only way to modify this
parameter right now is to edit if_var.h file. Also add read-only
sysctl with the same name, so that it's possible to retrieve the
current value.

MFC after:	1 month
2010-05-03 07:32:50 +00:00
alc
387e15c45a This is the first step in transitioning responsibility for synchronizing
access to the page's wire_count from the page queues lock to the page lock.

Submitted by:	kmacy
2010-05-03 05:41:50 +00:00
kmacy
1dc1263413 On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib
2010-04-30 00:46:43 +00:00
bz
0a90ef1728 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
kmacy
e04522d342 need to initialize the lock before it is used
MFC after:	3 days
2010-04-27 23:48:50 +00:00
bz
7fcd42cfee MFP4: @177254
Add missing CURVNET_RESTORE() calls for multiple code paths, to stop
leaking the currently cached vnet into callers and to the process.

Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	4 days
2010-04-27 15:16:54 +00:00
kib
2861e20e63 Provide compat32 shims for bpf(4), except zero-copy facilities.
bd_compat32 field of struct bpf_d is kept unconditionally to not
impose the requirement of including "opt_compat.h" on all numerous
users of bpfdesc.h.

Submitted by:	jhb (version for 6.x)
Reviewed and tested by:	emaste
MFC after:	2 weeks
2010-04-25 16:43:41 +00:00
kib
1bf8e84d75 Provide 32bit compat shims for sysctl net.route NET_RT_IFLIST.
This allows getifaddrs(3) to work for compat32 binaries.

Submitted by:	jhb (6.x version)
Reviewed by:	emaste
Tested by:	emaste and <pluknet gmail com>
MFC after:	2 weeks
2010-04-25 16:42:47 +00:00
julian
e5e811b1e9 Move two copies of the same definition to a common include file.
MFC after: 3 weeks
2010-04-14 23:06:07 +00:00
delphij
58be1647f8 When an underlying ioctl(2) handler returns an error, our ioctl(2)
interface considers that it hits a fatal error, and will not copyout
the request structure back for _IOW and _IOWR ioctls, keeping them
untouched.

The previous implementation of the SIOCGIFDESCR ioctl intends to
feed the buffer length back to userland.  However, if we return
an error, the feedback would be defeated and ifconfig(8) would
trap into an infinite loop.

This commit changes SIOCGIFDESCR to set buffer field to NULL to
indicate the previous ENAMETOOLONG case.

Reported by:	bschmidt
MFC after:	2 weeks
2010-04-14 22:02:19 +00:00
bz
778f026aa2 Take a reference to make sure that the interface cannot go away during
if_clone_destroy() in case parallel threads try to.

PR:		kern/116837
Submitted by:	Mikolaj Golub (to.my.trociny gmail.com)
MFC after:	10 days
2010-04-11 18:47:38 +00:00
bz
35f1e4bd73 Check that the interface is on the list of cloned interfaces before trying
to remove it to avoid panics in case of two threads trying to remove it in
parallel.

PR:		kern/116837
Submitted by:	Takahiro Kurosawa (takahiro.kurosawa gmail.com) (orig version)
MFC after:	10 days
2010-04-11 18:41:31 +00:00
bz
d7a91dc6bf Plug reference leaks in the link-layer code ("new-arp") that previously
prevented the link-layer entry from being freed.

In both in.c and in6.c (though that code path seems to be basically dead)
plug a reference leak in case of a pending callout being drained.

In if_ether.c consistently add a reference before resetting the callout
and in case we canceled a pending one remove the reference for that.
In the final case in arptimer, before freeing the expired entry, remove
the reference again and explicitly call callout_stop() to clear the active
flag.

In nd6.c:nd6_free() we are only ever called from the callout function and
thus need to remove the reference there as well before calling into
llentry_free().

In if_llatbl.c when freeing entire tables make sure that in case we cancel
a pending callout to remove the reference as well.

Reviewed by:		qingli (earlier version)
MFC after:		10 days
Problem observed, patch tested by: simon on ipv6gw.f.o,
			Christian Kratzer (ck cksoft.de),
			Evgenii Davidov (dado korolev-net.ru)
PR:			kern/144564
Configurations still affected:	with options FLOWTABLE
2010-04-11 16:04:08 +00:00
bz
d5b3b3f9d9 In if_detach_internal() we cannot hold the af_data lock over the
dom_ifdetach() calls as they might sleep for callout_drain().
Do as we do in if_attachdomain1() [r121470] and handle
if_afdata_initialized earlier and call dom_ifdetach() unlocked.

Discussed with:	rwatson
MFC after:	10 days
2010-04-11 11:51:44 +00:00
bz
674a87c918 In if_detach_internal() only try to do the detach run if if_attachdomain1()
has actually succeeded to initialize and attach.  There is a theoretical
possibility to drop out early in if_attachdomain1() leaving the array
uninitialized if we cannot get the lock.

Discussed with:	rwatson
MFC after:	10 days
2010-04-11 11:49:24 +00:00
jkim
2b1849a3fc Check the pointer to JIT binary filter before its de-allocation.
Submitted by:	Alexander Sack (asack at niksun dot com)
MFC after:	3 days
2010-03-29 20:24:03 +00:00
rpaulo
e14131f581 Add MCS to the list of media types.
Sponsored by:	iXsystems, inc.
2010-03-23 13:15:11 +00:00
kmacy
01cb21605b - boot-time size the ipv4 flowtable and the maximum number of flows
- increase flow cleaning frequency and decrease flow caching time
  when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
  allocating again until below 12.5%

MFC after:	7 days
2010-03-22 23:04:12 +00:00
emaste
41b3806743 Avoid holding the VLAN_LOCK() over the parent interface SIOCGIFMEDIA
ioctl call, as it may sleep.

Reviewed by:	rwatson
2010-03-21 15:00:33 +00:00
bz
102a1f8933 Split eventhandler_register() into an internal part and a wrapper function
that provides the allocated and setup eventhandler entry.

Add a new wrapper for VIMAGE that allocates extra space to hold the
callback function and argument in addition to an extra wrapper function.
While the wrapper function goes as normal callback function the
argument points to the extra space allocated holding the original func
and arg that the wrapper function can then call.

Provide an iterator function for the virtual network stack (vnet) that
will call the callback function for each network stack.

Provide a new set of macros for VNET that in the non-VIMAGE case will
just call eventhandler_register() while in the VIMAGE case it will use
vimage_eventhandler_register() passing in the extra iterator function
but will only register once rather than per-vnet.
We need a special macro in case we are interested in the tag returned
as we must check for curvnet and can neither simply assign the
return value, nor not change it in the non-vnet0 case without that.

Sponsored by:	ISPsystem
Discussed with:	jhb
Reviewed by:	zec (earlier version), jhb
MFC after:	1 month
2010-03-19 19:51:03 +00:00
bz
abd6b4ebb7 Add ddb support to the "new" link layer code ("new-arp"):
- show all lltables [1] (optional flag to also show the llentries as well)
 - show lltable <struct lltable *>
 - show llentry <struct llentry *>

MFC after:	6 days
2010-03-18 09:09:59 +00:00
qingli
4ff4954e4e Verify interface up status using its link state only
if the interface has such capability. The interface
capability flag indicates whether such capability
exists. This approach is much more backward compatible.
Physical device driver changes will be part of another
commit.

Also updated the ifconfig utility to show the LINKSTATE
capability if present.

Reviewed by:	rwatson, imp, juli
MFC after:	3 days
2010-03-16 17:59:12 +00:00
mlaier
d62719cf37 Fix a small bug in drbr_dequeue_cond spotted while preparing MFC of r203834.
MFC after:	3 days
2010-03-15 21:15:03 +00:00
kmacy
a5e0110227 flowtable_get_hashkey is only used by a DDB function - move under #ifdef DDB
pointed out by jkim@
2010-03-12 19:58:51 +00:00
jkim
5e802872af Fix a style(9) nit. 2010-03-12 19:42:42 +00:00
kmacy
965730af4f re-update copyright to 2010
pointed out by danfe@
2010-03-12 19:26:45 +00:00
jkim
df5e72589a Tidy up callout for select(2) and read timeout.
- Add a missing callout_drain(9) before the descriptor deallocation.[1]
- Prefer callout_init_mtx(9) over callout_init(9) and let the callout
subsystem handle the mutex for callout function.

PR:		kern/144453
Submitted by:	Alexander Sack (asack at niksun dot com)[1]
MFC after:	1 week
2010-03-12 19:14:58 +00:00
qingli
5c999930df The flow-table module retrieves the destination and source
address as well as the transport protocol port information
from the outbound packets. The routing code is generic and
compares every byte in the given sockaddr object. Therefore
the temporary sockaddr objects must be cleared due to padding
bytes. In addition, the port information must be stripped
or the route search will either fail or return the incorrect
route entry.

Unit testing is done using OpenVPN over the if_tun interface.

MFC after:	7 days
2010-03-12 10:24:58 +00:00
kmacy
0315e29550 fix stats reporting sysctl 2010-03-12 06:31:19 +00:00
kmacy
128542c758 - restructure flowtable to support ipv6
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
  (e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate

MFC after:	7 days
2010-03-12 05:03:26 +00:00
qingli
cde322640a The if_tap interface is of IFT_ETHERNET type, but it
does not set or update the if_link_state variable.
As such RT_LINK_IS_UP() fails for the if_tap interface.

Also, the RT_LINK_IS_UP() needs to bypass all loopback
interfaces because loopback interfaces are considered
up logically as long as the system is running.

This patch fixes the above issues by setting and updating
the if_link_state variable when the tap interface is
opened or closed respectively. Similary approach is
already done in the if_tun device.

MFC after:	3 days
2010-03-11 17:56:46 +00:00
qingli
93013817b0 One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.

MFC after:	5 days
2010-03-09 01:11:45 +00:00
delphij
fe5f1f57b8 Remove the check for IFF_DRV_OACTIVE right before adding a port into lagg
interface.  The check itself seems to be coming from OpenBSD but does not
seem to be useful for our code.

Discussed with:	thomasa
MFC after:	1 month
2010-03-09 00:52:16 +00:00
bz
07f7a52d59 Not only flush the ipfw tables when unloading ipfw or tearing
down a virtual netowrk stack, but also free the Radix Node Head.

Sponsored by:	ISPsystem
Reviewed by:	julian
MFC after:	5 days
2010-03-07 15:37:58 +00:00
bz
8ef34a3986 Introduce a function rn_detachhead() that will free the
radix table root nodes.  This is only needed (and available)
in the virtualization case to free the resources when tearing
down a virtual network stack.

Sponsored by:	ISPsystem
Reviewed by:	julian, zec
MFC after:	5 days
2010-03-06 21:27:26 +00:00
bz
3338a0e11e Rework reference counting in case we queue into the netisr,
or overflow the netisr queue and fall back to the interface
queue so that we can garuantee that the ifnet pointer stays
valid.   Formerly we ended up with reference counts <= 0 in
case the netisr had returned ENOBUFS.  The idea is to track
any packet in the netisr queue and only change the refount
on edge operations for the fallback interface queue. This
also avoids problems in case the if_snd.ifq_len lies to us.

Also rework refount assertions to make sure they trigger if
we go below 1. Formerly a negative refence count did not
trigger the assert as the refcount variable is u_int.

Sponsored by:	ISPsystem
MFC after:	5 days
2010-03-06 21:22:28 +00:00
luigi
5ceeac4aa8 Bring in the most recent version of ipfw and dummynet, developed
and tested over the past two months in the ipfw3-head branch.  This
also happens to be the same code available in the Linux and Windows
ports of ipfw and dummynet.

The major enhancement is a completely restructured version of
dummynet, with support for different packet scheduling algorithms
(loadable at runtime), faster queue/pipe lookup, and a much cleaner
internal architecture and kernel/userland ABI which simplifies
future extensions.

In addition to the existing schedulers (FIFO and WF2Q+), we include
a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new,
very fast version of WF2Q+ called QFQ.

Some test code is also present (in sys/netinet/ipfw/test) that
lets you build and test schedulers in userland.

Also, we have added a compatibility layer that understands requests
from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries,
and replies correctly (at least, it does its best; sometimes you
just cannot tell who sent the request and how to answer).
The compatibility layer should make it possible to MFC this code in a
relatively short time.

Some minor glitches (e.g. handling of ipfw set enable/disable,
and a workaround for a bug in RELENG_7's /sbin/ipfw) will be
fixed with separate commits.

CREDITS:
This work has been partly supported by the ONELAB2 project, and
mostly developed by Riccardo Panicucci and myself.
The code for the qfq scheduler is mostly from Fabio Checconi,
and Marta Carbone and Francesco Magno have helped with testing,
debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
luigi
1a096d21d2 remove unnecessary casts leftover from a bogus fix to a previous bug 2010-03-02 16:24:16 +00:00
alfred
f34ce3dd38 Merge projects/enhanced_coredumps (r204346) into HEAD:
Enhanced process coredump routines.

  This brings in the following features:
  1) Limit number of cores per process via the %I coredump formatter.
  Example:
    if corefilename is set to %N.%I.core AND num_cores = 3, then
    if a process "rpd" cores, then the corefile will be named
    "rpd.0.core", however if it cores again, then the kernel will
    generate "rpd.1.core" until we hit the limit of "num_cores".

    this is useful to get several corefiles, but also prevent filling
    the machine with corefiles.

  2) Encode machine hostname in core dump name via %H.

  3) Compress coredumps, useful for embedded platforms with limited space.
    A sysctl kern.compress_user_cores is made available if turned on.

    To enable compressed coredumps, the following config options need to be set:
    options COMPRESS_USER_CORES
    device zlib   # brings in the zlib requirements.
    device gzio   # brings in the kernel vnode gzip output module.

  4) Eventhandlers are fired to indicate coredumps in progress.

  5) The imgact sv_coredump routine has grown a flag to pass in more
  state, currently this is used only for passing a flag down to compress
  the coredump or not.

  Note that the gzio facility can be used for generic output of gzip'd
  streams via vnodes.

Obtained from: Juniper Networks
Reviewed by: kan
2010-03-02 06:58:58 +00:00
joel
bb682915c9 The NetBSD Foundation has granted permission to remove clause 3 and 4 from
their software.

Obtained from:	NetBSD
2010-03-01 17:05:46 +00:00
rwatson
384e695fac Whitespace tweak.
MFC after:	3 days
2010-03-01 00:43:05 +00:00
rwatson
0b098aa759 Changes to support crashdump analysis of netisr:
- Rename the netisr protocol registration array, 'np' to 'netisr_proto',
  in order to reduce the chances of symbol name collisions.  It remains
  statically defined, but it will be looked up by netstat(1).

- Move certain internal structure definitions from netisr.c to
  netisr_internal.h so that netstat(1) can find them.  They remain
  private, and should not be used for any other purpose (for example,
  they should not be used by kernel modules, which must instead use the
  public interfaces in netisr.h).

- Store a kernel-compiled version of NETISR_MAXPROT in the global variable
  netisr_maxprot, and export via a sysctl, so that it is available for use
  by netstat(1).  This is especially important for crashdump
  interpretation, where the size of the workstream structure is determined
  by the maximum number of protocols compiled into the kernel.

MFC after:	1 week
Sponsored by:	Juniper Networks
2010-03-01 00:42:36 +00:00
kib
f93a8fa5df In both if_tun and if_tap:
Do not do additional dev_ref() on the newly created interface in the
if_clone create method [1]. This reference is not needed and never
removed, causing struct cdevpriv leakage. Remove the setting of
SI_CHEAPCLONE flag as well, since it is unused.

For dev_clone handlers, create cdevs with the call make_dev_credf(MAKEDEV_REF)
instead of calling make_dev() and then dev_ref(), to avoid a race.

Call drain_dev_clone_events() at the module unload time after dev_clone
handler is deinstalled.

Submitted by:	Mikolaj Golub <to.my.trociny gmail com> [1]
MFC after:	1 week
2010-02-28 16:25:49 +00:00
rwatson
e0805e87a3 Fix edge cases in several KASSERTs: use <= rather than < when testing that
counters have not gone about MAXCPU or NETISR_MAXPROT.  These problems
caused panics on UP kernels with INVARIANTS when using sysctl -a, but
would also have caused problems for 32-core boxes or if the netisr
protocol vector was fully populated.

Reported by:	nwhitehorn, Neel Natu <neelnatu@gmail.com>
MFC after:	4 days
2010-02-25 09:51:14 +00:00
bz
44a0d7a588 Use the DB_SHOW_ALL_COMMAND() macro to register the formerly 'show ifnets'
in the db_show_all_table as 'show all ifnets' and with that follow the
convention for showing complete lists.

Submitted by:	thompsa
MFC after:	3 days
2010-02-24 15:54:24 +00:00
rwatson
79e4f88402 Fix constant assignment for netisr protocol information sysctl.
MFC after:	1 week
Spotted by:	bz
2010-02-22 16:16:16 +00:00
rwatson
ac0cbafef7 Export netisr configuration and statistics to userspace via sysctl(9).
MFC after:	1 week
Sponsored by:	Juniper Networks
2010-02-22 15:03:16 +00:00
rwatson
e1faba5621 ifconfig(8) expects interface fooX to be supported by the module if_foo,
and will try to load it if it's not present.  To better meet these
expectations, change the module name for the loopback interface from
'loop' to 'if_lo'.  The loopback interface is always compiled into the
base kernel, so there are no resulting changes in kld files, etc.

Discussed with:	brooks (ages ago)
MFC after:	1 week
2010-02-21 15:25:47 +00:00
yongari
55d2bfd121 Add __FBSDID.
Reviewed by:	sam
2010-02-21 00:07:45 +00:00
yongari
453676a091 Add TSO support on VLANs. Intentionally separated IFCAP_VLAN_HWTSO
from IFCAP_VLAN_HWTAGGING. I think some hardwares may be able to
TSO over VLAN without VLAN hardware tagging.
Driver changes and userland support will follow.

Reviewed by:	thompsa
2010-02-20 22:47:20 +00:00
bz
27d5a92985 Start to implement ifnet DDB support:
- 'show ifnets' prints a list of ifnet *s per virtual network stack,
- 'show ifnet <struct ifnet *>' prints fields matching the given ifp.

We do not yet print the complete set of fields and might want to
factor this out to an extra if_debug.c file in case this grows
a lot[1]. We may also want to grow 'show ifnet <if_xname>' support[1].

Sponsored by:	ISPsystem
Suggested by:	rwatson [1]
Reviewed by:	rwatson
MFC after:	5 days
2010-02-20 22:09:48 +00:00
bz
77e8f746fc Enhance a panic string to contain more useful debugging information.
Sponsored by:	ISPsystem
Reviewed by:	rwatson
MFC after:	5 days
2010-02-20 21:43:36 +00:00
jkim
30767ce70b Return partially filled buffer for non-blocking read(2)
in non-immediate mode.

PR:		kern/143855
2010-02-20 00:19:21 +00:00
pjd
12d96a648c Mark various sysctls also as tunables.
Reviewed by:	rwatson
MFC after:	1 week
2010-02-15 09:19:07 +00:00
mlaier
b056acb864 Fix drbr and altq interaction:
- introduce drbr_needs_enqueue that returns whether the interface/br needs
   an enqueue operation: returns true if altq is enabled or there are
   already packets in the ring (as we need to maintain packet order)
 - update all drbr consumers
 - fix drbr_flush
 - avoid using the driver queue (IFQ_DRV_*) in the altq case as the
   multiqueue consumer does not provide enough protection, serialize altq
   interaction with the main queue lock
 - make drbr_dequeue_cond work with altq

Discussed with:		kmacy, yongari, jfv
MFC after:		4 weeks
2010-02-13 16:04:58 +00:00
bz
7b5cc4e51b Add DDB support for printing vnet_sysinit and vnet_sysuninit
ordered call lists. Try to lookup function/symbol names and print
those in addition to the pointers, along with the constants for
subsystem and order.
This is useful for debugging vnet teardown ordering issues.

Make it possible to call the actual printing frunction from normal
code at runtime, ie. from vnet_sysuninit(), if DDB support is there.

Sponsored by:	ISPsystem
MFC After:	8 days
2010-02-09 22:39:34 +00:00
bz
eb28e347bf Add an SDT provider for "vnet"s along with probes for vnet_alloc
and vnet_destroy.
Use the line number rather than NULL as dummy argument.

Note: the fbt provider does not reliably provide :return probes
(depending on optimization levels used at compile time) making
it unusable for scripts to generate complete call-traces with
well defined boundaries over allocations or destructions of
virtual network stacks.

Sponsored by:	ISPsystem
MFC After:	8 days
2010-02-09 22:15:59 +00:00
eri
3c38fdad1e Propagate the vlan eventis to the underlying interfaces/members so they can do initialization of hw related features.
PR:	kern/141646
Reviewed by:	thompsa
Approved by:	thompsa(co-mentor)
MFC after:	2 weeks
2010-02-06 13:49:35 +00:00
zec
393450ca7b Instead of spamming the console on each curvnet recursion event, print
out each such call graph only once, along with a stack backtrace.  This
should make kernels built with VNET_DEBUG reasonably usable again in
busy / production environments.

Introduce a new DDB command "show vnetrcrs" which dumps the whole log
of distinctive curvnet recursion events.  This might be useful when
recursion reports get burried / lost too deep in the message buffer.
In the later case stack backtraces are not available.

Reviewed by:	bz
MFC after:	3 days
2010-02-04 07:55:42 +00:00
hrs
0f402c2331 - Check if_type of "addm <interface>" before setting the
interface's MTU to the if_bridge(4) interface.  This fixes a
  bug that MTU value of "addm <interface>" is used even when it
  is invalid for the if_bridge(4) member:

  # ifconfig bridge0 create
  # ifconfig bridge0
  bridge0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
  ...
  # ifconfig bridge0 addm lo0
  ifconfig: BRDGADD lo0: Invalid argument
  # ifconfig bridge0
  bridge0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 16384
  ...

- Do not ignore MTU value of an interface even when if_type == IFT_GIF.
  This fixes MTU mismatch when an if_bridge(4) interface has a
  gif(4) interface and no other interface as the member, and it
  is directly used for L2 communication with EtherIP tunneling
  enabled.

- Implement SIOCSIFMTU ioctl.  Changing the MTU is allowed only
  when all members have the same MTU value.
2010-01-31 08:16:37 +00:00
delphij
d9a0cd0982 Revised revision 199201 (add interface description capability as inspired
by OpenBSD), based on comments from many, including rwatson, jhb, brooks
and others.

Sponsored by:	iXsystems, Inc.
MFC after:	1 month
2010-01-27 00:30:07 +00:00
syrinx
40d92428fb While flushing the multicast filter of an interface, do not zero the relevant
ifmultiaddr structures' reference to the parent interface, unless the parent
interface is really detaching. While here, program only link layer multicast
filters to a wlan's hardware parent interface.

PR:		kern/142391, kern/142392
Reviewed by:	sam, rpaolo, bms
MFC after:	1 week
2010-01-24 16:17:58 +00:00
thompsa
d0093d69aa Do not hold the lock over if_setlladdr() as it calls into the interface driver
init routine.
2010-01-19 04:29:42 +00:00
thompsa
5056e27c2d Declare a new EVENTHANDLER called iflladdr_event which signals that the L2
address on an interface has changed. This lets stacked interfaces such as
vlan(4) detect that their lower interface has changed and adjust things in
order to keep working. Previously this situation broke at least vlan(4) and
lagg(4) configurations.

The EVENTHANDLER_INVOKE call was not placed within if_setlladdr() due to the
risk of a loop.

PR:		kern/142927
Submitted by:	Nikolay Denev
2010-01-18 20:34:00 +00:00
bz
8458b2914e Correct a typo.
MFC after:	5 days
2010-01-10 12:03:53 +00:00
trasz
2ced506c4b Stop GCC from complaining about lagg_port_checkstacking() being unused. 2010-01-08 16:44:33 +00:00
mbr
7450f52a57 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
luigi
f1fcae96ad put ip_var before ip_fw_private.h as this will be needed in
the near future
2010-01-07 10:27:52 +00:00
luigi
40024ff7c3 Various cleanup done in ipfw3-head branch including:
- use a uniform mtag format for all packets that exit and re-enter
  the firewall in the middle of a rulechain. On reentry, all tags
  containing reinject info are renamed to MTAG_IPFW_RULE so the
  processing is simpler.

- make ipfw and dummynet use ip_len and ip_off in network format
  everywhere. Conversion is done only once instead of tracking
  the format in every place.

- use a macro FREE_PKT to dispose of mbufs. This eases portability.

On passing i also removed a few typos, staticise or localise variables,
remove useless declarations and other minor things.

Overall the code shrinks a bit and is hopefully more readable.

I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr.
For ng_ipfw i am actually waiting for feedback from glebius@ because
we might have some small changes to make.
For if_bridge and if_ethersubr feedback would be welcome
(there are still some redundant parts in these two modules that
I would like to remove, but first i need to check functionality).
2010-01-04 19:01:22 +00:00
jhb
26cb0488b3 Use stricter checking to match possible vlan clones by not allowing extra
garbage characters around or within the tag.

Reviewed by:	brooks
MFC after:	3 days
2009-12-31 20:44:38 +00:00
brooks
a5cc24440b The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175.  Remove all definitions, documentation, and usage.

fifo_misc.c:
	Remove all kqueue tests as fifo_io.c performs all those that
	would have remained.

Reviewed by:	rwatson
MFC after:	3 weeks
X-MFC note:	don't change vlan_link_state() function signature
2009-12-31 20:29:58 +00:00
qingli
c44d3e680a Remove a deleted comment line that was brought back by
my previous commit.

MFC after:	5 days
2009-12-31 01:09:16 +00:00
qingli
ed965a92bc The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

MFC after:	5 days
2009-12-30 21:35:34 +00:00
jhb
3ce93dcb7c Change vlan interfaces to cope more usefully with the parent interface being
renamed.  Previously the vlan interfaces would lose their configuration as if
the parent interface had been physically removed.  Now vlan interfaces ignore
rename events.
- Add a new ifnet flag (IFF_RENAMING) that is set while an ifnet is being
  renamed.  This flag can be checked in ifnet departure/arrival event
  handlers to treat rename events differently.
- Change the ifnet departure event handler in the if_vlan(4) driver to
  ignore departure events due to a trunk interface being renamed.

Reviewed by:	brooks, rwatson
MFC after:	1 week
2009-12-29 13:35:18 +00:00
luigi
483862a5a2 bring in several cleanups tested in ipfw3-head branch, namely:
r201011
- move most of ng_ipfw.h into ip_fw_private.h, as this code is
  ipfw-specific. This removes a dependency on ng_ipfw.h from some files.

- move many equivalent definitions of direction (IN, OUT) for
  reinjected packets into ip_fw_private.h

- document the structure of the packet tags used for dummynet
  and netgraph;

r201049
- merge some common code to attach/detach hooks into
  a single function.

r201055
- remove some duplicated code in ip_fw_pfil. The input
  and output processing uses almost exactly the same code so
  there is no need to use two separate hooks.
  ip_fw_pfil.o goes from 2096 to 1382 bytes of .text

r201057 (see the svn log for full details)
- macros to make the conversion of ip_len and ip_off
  between host and network format more explicit

r201113 (the remaining parts)
- readability fixes -- put braces around some large for() blocks,
  localize variables so the compiler does not think they are uninitialized,
  do not insist on precise allocation size if we have more than we need.

r201119
- when doing a lookup, keys must be in big endian format because
  this is what the radix code expects (this fixes a bug in the
  recently-introduced 'lookup' option)

No ABI changes in this commit.

MFC after:	1 week
2009-12-28 10:47:04 +00:00
rwatson
c7656477f0 When warning about possible netisr configuration problems during boot,
report using "netisr_init" rather than "netisr2", which was the development
name for the project.

MFC after:	3 days
2009-12-23 12:33:59 +00:00
rwatson
0b860a6956 Refine netisr.c comments a bit. 2009-12-23 12:31:27 +00:00
luigi
2043aec456 merge code from ipfw3-head to reduce contention on the ipfw lock
and remove all O(N) sequences from kernel critical sections in ipfw.

In detail:

 1. introduce a IPFW_UH_LOCK to arbitrate requests from
     the upper half of the kernel. Some things, such as 'ipfw show',
     can be done holding this lock in read mode, whereas insert and
     delete require IPFW_UH_WLOCK.

  2. introduce a mapping structure to keep rules together. This replaces
     the 'next' chain currently used in ipfw rules. At the moment
     the map is a simple array (sorted by rule number and then rule_id),
     so we can find a rule quickly instead of having to scan the list.
     This reduces many expensive lookups from O(N) to O(log N).

  3. when an expensive operation (such as insert or delete) is done
     by userland, we grab IPFW_UH_WLOCK, create a new copy of the map
     without blocking the bottom half of the kernel, then acquire
     IPFW_WLOCK and quickly update pointers to the map and related info.
     After dropping IPFW_LOCK we can then continue the cleanup protected
     by IPFW_UH_LOCK. So userland still costs O(N) but the kernel side
     is only blocked for O(1).

  4. do not pass pointers to rules through dummynet, netgraph, divert etc,
     but rather pass a <slot, chain_id, rulenum, rule_id> tuple.
     We validate the slot index (in the array of #2) with chain_id,
     and if successful do a O(1) dereference; otherwise, we can find
     the rule in O(log N) through <rulenum, rule_id>

All the above does not change the userland/kernel ABI, though there
are some disgusting casts between pointers and uint32_t

Operation costs now are as follows:

  Function				Old	Now	  Planned
-------------------------------------------------------------------
  + skipto X, non cached		O(N)	O(log N)
  + skipto X, cached			O(1)	O(1)
XXX dynamic rule lookup			O(1)	O(log N)  O(1)
  + skipto tablearg			O(N)	O(1)
  + reinject, non cached		O(N)	O(log N)
  + reinject, cached			O(1)	O(1)
  + kernel blocked during setsockopt()	O(N)	O(1)
-------------------------------------------------------------------

The only (very small) regression is on dynamic rule lookup and this will
be fixed in a day or two, without changing the userland/kernel ABI

Supported by: Valeria Paoli
MFC after:	1 month
2009-12-22 19:01:47 +00:00
jhb
70c567a753 Remove commented out prototype for ifinit(). This prototype has been
commented out since 1.1 and has not been present in <sys/systm.h> since at
least 1.1 of that file.  It is also not needed in FreeBSD due to SYSINIT().
2009-12-21 20:09:19 +00:00
luigi
c4e6c7a490 Start splitting ip_fw2.c and ip_fw.h into smaller components.
At this time we pull out from ip_fw2.c the logging functions, and
support for dynamic rules, and move kernel-only stuff into
netinet/ipfw/ip_fw_private.h

No ABI change involved in this commit, unless I made some mistake.
ip_fw.h has changed, though not in the userland-visible part.

Files touched by this commit:

conf/files
	now references the two new source files

netinet/ip_fw.h
	remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h.

netinet/ipfw/ip_fw_private.h
	new file with kernel-specific ipfw definitions

netinet/ipfw/ip_fw_log.c
	ipfw_log and related functions

netinet/ipfw/ip_fw_dynamic.c
	code related to dynamic rules

netinet/ipfw/ip_fw2.c
	removed the pieces that goes in the new files

netinet/ipfw/ip_fw_nat.c
	minor rearrangement to remove LOOKUP_NAT from the
	main headers. This require a new function pointer.

A bunch of other kernel files that included netinet/ip_fw.h now
require netinet/ipfw/ip_fw_private.h as well.
Not 100% sure i caught all of them.

MFC after:	1 month
2009-12-15 16:15:14 +00:00
luigi
003a092f8a Move the scan for max_keylen into route.c::route_init(),
and make max_keylen an argument for rn_init().
This removes an unnecessary dependency on domain.h from radix.c

MFC after:	7 days
2009-12-14 20:12:51 +00:00
bz
932cbdbe4d Throughout the network stack we have a few places of
if (jailed(cred))
left.  If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that,  as it is used
outside the network stack as well.

Discussed with:	julian, zec, jamie, rwatson (back in Sept)
MFC after:	5 days
2009-12-13 13:57:32 +00:00
luigi
29b041fe97 Make the code buildable in userland so it is easier to test it:
this requires a small reordering of headers and a few #defines to
map functions not available in userland.

Remove a useless #ifndef block at the beginning of the file.

Introduce (temporarily) rn_init2(), see the comment in the code
for the proper long term change.

No ABI or functional change.

MFC after:	7 days
2009-12-12 15:49:28 +00:00
luigi
54e485de2b No functional changes (who dares to touch this code!) but:
- cast the result of LEN() to int as this is the main usage.
- use LEN() in one place where it was forgotten.
- Document the use of a static variable in rw mode.

More small changes to follow.

MFC after:	7 days
2009-12-10 10:34:30 +00:00
jhb
1f48c677b5 Remove if_timer/if_watchdog now that they are no longer used. The space
used by if_timer is reserved for expanding if_index to an int in the
future.

Reviewed by:	rwatson, brooks
2009-11-30 21:25:57 +00:00
jkim
2d93b6e424 General style cleanup, no functional change. 2009-11-20 21:12:40 +00:00
jkim
052bc52af7 - Allocate scratch memory on stack instead of pre-allocating it with
the filter as we do from bpf_filter()[1].
- Revert experimental use of contigmalloc(9)/contigfree(9).  It has no
performance benefit over malloc(9)/free(9)[2].

Requested by:	rwatson[1]
Pointed out by:	rwatson, jhb, alc[2]
2009-11-20 18:49:20 +00:00
jkim
8e75a7257b - Change internal function bpf_jit_compile() to return allocated size of
the generated binary and remove page size limitation for userland.
- Use contigmalloc(9)/contigfree(9) instead of malloc(9)/free(9) to make
sure the generated binary aligns properly and make it physically contiguous.
2009-11-18 23:40:19 +00:00
jkim
efc247aeb3 - Make BPF JIT compiler working again in userland. We are limiting size of
generated native binary to page size for now.
- Update copyright date and fix some style nits.
2009-11-18 19:26:17 +00:00
tuexen
7b17948ff8 Fix a LOR showing up with sctp_bsd_addr(): Do not hold a rt lock
when calling rt_newaddrmsg().

Reviewed by: qingli
Approved by: rrs (mentor)
MFC after: 1 month
2009-11-17 12:57:10 +00:00
delphij
8fed657163 Revert revision 199201 for now as it has introduced a kernel vulnerability
and requires more polishing.
2009-11-12 19:02:10 +00:00
delphij
13a19ef806 Add interface description capability as inspired by OpenBSD.
MFC after:	3 months
2009-11-11 21:30:58 +00:00
jhb
892fe839c4 Take a step towards removing if_watchdog/if_timer. Don't explicitly set
if_watchdog/if_timer to NULL/0 when initializing an ifnet.  if_alloc()
sets those members to NULL/0 already.
2009-11-06 14:55:01 +00:00
rwatson
07e1c4bd14 Remove unneeded blank line from bpf_drvinit().
MFC after:	3 days
2009-10-23 17:26:29 +00:00
brueffer
245a0bbb51 Check pointer for NULL before dereferencing it, not after.
PR:		138390
Submitted by:	Patroklos Argyroudis <argp@census-labs.com>
MFC after:	1 week
2009-10-22 06:17:04 +00:00
qingli
e4c3421f52 Verify "smp_started" is true before calling
sched_bind() and sched_unbind().

Reviewed by:	kmacy
MFC after:	3 days
2009-10-22 00:32:01 +00:00
qingli
88bb68ef0f The flow-table function flowtable_route_flush() may be called
during system initialization time. Since the flow-table is
designed to maintain per CPU flow cache, the existing code
did not check whether "smp_started" is true before calling
sched_bind() and sched_unbind(), which triggers a page fault.

Reviewed by:	jeff
MFC after:	immediately
2009-10-20 21:27:03 +00:00
rwatson
d42d258f85 Clean up comments, white space, and style in pfil.c (especially new VNET
bits).

MFC after:	3 days (not VNET bits)
2009-10-19 15:19:14 +00:00
rwatson
dc93fde3ad Remove unused pfil_flags field in packet_filter_hook.
MFC after:	3 days
2009-10-18 22:54:09 +00:00
rwatson
c564f117b3 Sort function prototypes in pfil.h, clean up white space, and better
align fields for printing.

MFC after:	3 days
2009-10-18 22:43:28 +00:00
rwatson
066984a67b Line-wrap pfil.c so that it prints more nicely.
MFC after:	3 days
2009-10-18 11:27:34 +00:00
bz
b0448eafb0 Unbreak the VIMAGE build with IPSEC, broken with r197952 by
virtualizing the pfil hooks.
For consistency add the V_ to virtualize the pfil hooks in here as well.

MFC after:	55 days
X-MFC after:	julian MFCed r197952.
2009-10-14 11:55:55 +00:00
julian
79c1f884ef Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after:	2 months
2009-10-11 05:59:43 +00:00
bz
1e3cae3b31 Put #ifdef INET around parts of the FLOWTABLE code, to unbreak
nooptions INET kernel builds.

MFC after:	3 days
X-MFC:		with r197687
2009-10-03 10:56:03 +00:00
qingli
42eac0e4cd The flow-table associates TCP/UDP flows and IP destinations with
specific routes. When the routing table changes, for example,
when a new route with a more specific prefix is inserted into the
routing table, the flow-table is not updated to reflect that change.
As such existing connections cannot take advantage of the new path.
In some cases the path is broken. This patch will update the affected
flow-table entries when a more specific route is added. The route
entry is properly marked when a route is deleted from the table.
In this case, when the flow-table performs a search, the stale
entry is updated automatically. Therefore this patch is not
necessary for route deletion.

Submitted by:	simon, phk
Reviewed by:	bz, kmacy
MFC after:	3 days
2009-10-01 20:32:29 +00:00
qingli
85d603eeff A wrong variable is used when setting up the interface
address route, which broke source address selection in
some code paths.

Submitted by:	noted by bz
Reviewed by:	hrs
MFC after:	immediately
2009-09-20 17:22:19 +00:00
zec
58bb001f4b Style fix - break too long a line in two.
Spotted by:	bz
MFC after:	3 days
2009-09-18 09:03:23 +00:00
zec
02353f41db V_irtualize the lltables list, making ARP and ND reasonably
usable again with options VIMAGE kernels.

Submitted by:	bz (the original version, probably identical to this one)
Reviewed by:	many @ DevSummit Cambridge
MFC after:	3 days
2009-09-17 14:52:15 +00:00
qingli
3a82e44273 Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by:	bz
MFC after:	immediately
2009-09-15 19:18:34 +00:00
rwatson
53eaed07bc Use C99 initialization for struct filterops.
Obtained from:	Mac OS X
Sponsored by:	Apple Inc.
MFC after:	3 weeks
2009-09-12 20:03:45 +00:00
emaste
e750f5d891 Compare pointer with NULL, not 0. 2009-09-09 03:36:43 +00:00
np
ba75578c03 Add arp_update_event. This replaces route_arp_update_event, which
has not worked since the arp-v2 rewrite.

The event handler will be called with the llentry write-locked and
can examine la_flags to determine whether the entry is being added
or removed.

Reviewed by:	gnn, kmacy
Approved by:	gnn (mentor)
MFC after:	1 month
2009-09-08 21:17:17 +00:00
qingli
5b9cf14b54 The addresses that are assigned to the loopback interface
should be part of the kernel routing table.

Reviewed by:	bz
MFC after:	immediately
2009-09-05 20:24:37 +00:00
qingli
0cca60c70d This patch fixes the following issues:
- Interface link-local address is not reachable within the
  node that owns the interface, this is due to the mismatch
  in address scope as the result of the installed interface
  address loopback route. Therefore for each interface
  address loopback route, the rt_gateway field (of AF_LINK
  type) will be used to track which interface a given
  address belongs to. This will aid the address source to
  use the proper interface for address scope/zone validation.
- The loopback address is not reachable. The root cause is
  the same as the above.
- Empty nd6 entries are created for the IPv6 loopback addresses
  only for validation reason. Doing so will eliminate as much
  of the special case (loopback addresses) handling code
  as possible, however, these empty nd6 entries should not
  be returned to the userland applications such as the
  "ndp" command.
Since both of the above issues contain common files, these
files are committed together.

Reviewed by:	bz
MFC after:	immediately
2009-09-05 16:43:16 +00:00
gnn
60eb51e7a3 Add ARP statistics to the kernel and netstat.
New counters now exist for:
requests sent
replies sent
requests received
replies received
packets received
total packets dropped due to no ARP entry
entrys timed out
Duplicate IPs seen

The new statistics are seen in the netstat command
when it is given the -s command line switch.

MFC after:	2 weeks
In collaboration with: bz
2009-09-03 21:10:57 +00:00
qingli
04fc72216c As part of r196609, a call to "rtalloc" did not take the fib into
account. So call the appropriate "rtalloc_ign_fib()" instead of
calling "rtalloc_ign()".

Reviewed by:i	pointed out by bz
MFC after:	immediately
2009-08-31 00:14:37 +00:00
zec
9940277b0b Introduce a separate sx lock for protecting lists of vnet sysinit
and sysuninit handlers.

Previously, sx_vnet, which is a lock designated for protecting
the vnet list, was (ab)used for protecting vnet sysinit / sysuninit
handler lists as well.  Holding exclusively the sx_vnet lock while
invoking sysinit and / or sysuninit handlers turned out to be
problematic, since some of the handlers may attempt to wake up
another thread and wait for it to walk over the vnet list, hence
acquire a shared lock on sx_vnet, which in turn leads to a deadlock.
Protecting vnet sysinit / sysuninit lists with a separate lock
mitigates this issue, which was first observed with
flowtable_flush() / flowtable_cleaner() in sys/net/flowtable.c.

Reviewed by:	rwatson, jhb
MFC after:	3 days
2009-08-28 22:30:55 +00:00
qingli
f92dfc8485 In ip_output(), the flow-table module must not try to cache L2/L3
information for interface of IFF_POINTOPOINT or IFF_LOOPBACK type.
Since the L2 information (rt_lle) is invalid for these interface
types, accidental caching attempt will trigger panic when the invalid
rt_lle reference is accessed.

When installing a new route, or when updating an existing route, the
user supplied gateway address may be an interface address (this is
particularly true for point-to-point interface related modules such
as ppp, if_tun, if_gif). Currently the routing command handler always
set the RTF_GATEWAY flag if the gateway address is given as part of the
command paramters. Therefore the gateway address must be verified against
interface addresses or else the route would be treated as an indirect
route, thus making that route unusable.

Reviewed by:	kmacy, julia, rwatson
Verified by:	marcus
MFC after:	3 days
2009-08-28 07:01:09 +00:00
rwatson
cffae4081c Add IFNET_HOLD reserved pointer value for the ifindex ifnet array,
which allows an index to be reserved for an ifnet without making
the ifnet available for management operations.  Use this in if_alloc()
while the ifnet lock is released between initial index allocation and
completion of ifnet initialization.

Add ifindex_free() to centralize the implementation of releasing an
ifindex value.  Use in if_free() and if_vmove(), as well as when
releasing a held index in if_alloc().

Reviewed by:	bz
MFC after:	3 days
2009-08-26 11:13:10 +00:00
rwatson
8184e58b3a Break out allocation of new ifindex values from if_alloc() and if_vmove(),
and centralize in a single function ifindex_alloc().  Assert the
IFNET_WLOCK, and add missing IFNET_WLOCK in if_alloc().  This does not
close all known races in this code.

Reviewed by:	bz
MFC after:	3 days
2009-08-25 20:21:16 +00:00
rwatson
544dfa0789 Use locks specific to the lltable code, rather than borrow the ifnet
list/index locks, to protect link layer address tables.  This avoids
lock order issues during interface teardown, but maintains the bug that
sysctl copy routines may be called while a non-sleepable lock is held.

Reviewed by:	bz, kmacy
MFC after:	3 days
2009-08-25 09:52:38 +00:00
jfv
91fcf01735 When bridging LRO is causing a problem, the believe
that it would work as long as all interfaces have TSO
seems to be false, until the matter gets sorted out
just disable LRO completely.
2009-08-24 21:04:51 +00:00
rwatson
260dfcf9e9 Make if_grow static -- it's not used outside of if.c, and with the
internals destined to change, it's better if it remains that way.

MFC after:	3 days
2009-08-24 12:52:05 +00:00
zec
47445e571b When moving ifnets from one vnet to another, and the ifnet
has ifaddresses of AF_LINK type which thus have an embedded
if_index "backpointer", we must update that if_index backpointer
to reflect the new if_index that our ifnet just got assigned.

This change affects only options VIMAGE builds.

Submitted by:	bz
Reviewed by:	bz
Approved by:	re (rwatson), julian (mentor)
2009-08-24 10:14:09 +00:00
rwatson
f0ee7a159b Rather than using IFNET_RLOCK() when iterating over (and modifying) the
ifnet list during if_ef load, directly acquire the ifnet_sxlock
exclusively.  That way when if_alloc() recurses the lock, it's a write
recursion rather than a read->write recursion.

This code structure is arguably a bug, so add a comment indicating that
this is the case.  Post-8.0, we should fix this, but this commit
resolves panic-on-load for if_ef.

Discussed with:	bz, julian
Reported by:	phk
MFC after:	3 days
2009-08-23 21:00:21 +00:00
rwatson
ef8d755d4d Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
julian
009dd32306 Don't allow access to the internals until it has all been set up.
Specifically, not until the per-vnet parts have been set up.

Submitted by:	kmacy@
Reviewed by:	julian@, zec@
Approved by:	re(rwatson)
MFC after:	immediately
2009-08-21 09:22:32 +00:00
kmacy
e10b9149e4 This change fixes a comment and addresses a complaint by kib@ by
moving a frequently executed flowtable syslog statement from being
conditional on bootverbose to conditional on a per-vnet flowtable
sysctl.

Approved by:	re@
2009-08-19 20:13:09 +00:00
kmacy
bbd03fa206 - change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
 - check that a cached flow corresponds to the same fib index as the
   packet for which we are doing the lookup
 - at interface detach time flush any flows referencing stale rtentrys
   associated with the interface that is going away (fixes reported
   panics)
 - reduce the time between cleans in case the cleaner is running at
   the time the eventhandler is called and the wakeup is missed less
   time will elapse before the eventhandler returns
 - separate per-vnet initialization from global initialization
   (pointed out by jeli@)

Reviewed by:	sam@
Approved by:	re@
2009-08-18 20:28:58 +00:00
kmacy
2836450c4c fix netboot issue by disabling flowtable lookups until initialization has been run
Reviewed by:	rwatson@
Approved by:	re@
2009-08-17 19:09:28 +00:00
rwatson
e905082a5a Remove unused if_rawoutput() macro; it has been unused since at least
FreeBSD 2.

Approved by:	re (kib)
2009-08-15 22:26:26 +00:00
zec
51ca260850 Appease VNET_DEBUG - in if_vmove we temporarily switch i.e.
recurse from one vnet to another which is OK, so no need
to flood the console with warnings here.

Approved by:	re (rwatson), julian (mentor)
2009-08-14 22:46:45 +00:00
zec
b464f1dafc Make VNET_DEBUG a standalone compile-time option, i.e. decouple it from
INVARIANTS.

Reviewed by:	bz
Approved by:	re (rwatson), julian (mentor)
2009-08-14 22:41:39 +00:00
bz
5307a46b8b Make it possible to change the vnet sysctl variables on jails
with their own virtual network stack. Jails only inheriting a
network stack cannot change anything that cannot be changed from
within a prison.

Reviewed by:	rwatson, zec
Approved by:	re (kib)
2009-08-13 10:26:34 +00:00
bz
b6a41509df Put multiple instructions into a block when iterating; unbreaks
NET_RT_DUMP, which otherwise only returned information of AF_MAX.
This was broken in r193232 (save your time - my bug, my fix).

PR:		kern/137700
Reported by:	Larry Baird (lab gta.com)
Tested by:	Larry Baird (lab gta.com)
Reviewed by:	zec, lstewart, qing
Approved by:	re (kib)
2009-08-13 09:29:52 +00:00
jkim
cda13d6d86 Always embed pointer to BPF JIT function in BPF descriptor
to avoid inconsistency when opt_bpf.h is not included.

Reviewed by:	rwatson
Approved by:	re (rwatson)
2009-08-12 17:28:53 +00:00
bz
a5f0cffcb4 Update DDB show vnet command to print all used and available information.
Reviewed by:	rwatson, zec
Approved by:	re
2009-08-12 12:00:21 +00:00
bz
d74deee9a1 Put minimum alignment on the dpcpu and vnet section so that ld
when adding the __start_ symbol knows the expected section alignment
and can place the __start_ symbol correctly.

These sections will not support symbols with super-cache line alignment
requirements.

For full details, see posting to freebsd-current, 2009-08-10,
Message-ID: <20090810133111.C93661@maildrop.int.zabbadoz.net>.

Debugging and testing patches by:
		Kamigishi Rei (spambox haruhiism.net),
		np, lstewart, jhb, kib, rwatson
Tested by:	Kamigishi Rei, lstewart
Reviewed by:	kib
Approved by:	re
2009-08-12 10:26:03 +00:00
rwatson
5c6699ad3d Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
rwatson
2eafab39fe The colour was red as shall be the letters of this warning to people upon
boot if the experimental VIMAGE feature was compiled into the kernel.

Submitted by:	bz
Reviewed by:	zec
Approved by:	re (vimage blanket)
2009-08-01 22:22:45 +00:00
rwatson
3b267ddfaa Minor style tweaks.
Approved by:	re (vimage blanket)
2009-08-01 21:58:32 +00:00
rwatson
648ff24430 Make the vnet alloc/destroy paths a bit easier to followg by merging
vnet_data_init/vnet_data_destroy into vnet_alloc/vnet_destroy.

Reviewed by:	bz, zec
Approved by:	re (vimage blanket)
2009-08-01 21:54:15 +00:00
rwatson
e3917e768d Remove vnet_foreach() utility function, which previously allowed
vnet.c to iterate virtual network stacks without being aware of
the implementation details previously hidden in kern_vimage.c.
Now they are in the same file, so remove this added complexity.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 20:24:45 +00:00
rwatson
fb9ffed650 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
rwatson
466a4af8b2 Reorder and recomment vnet.c and vnet.h on the basis that they are no longer
solely about the virtual network stack memory allocator.

Approved by:	re (vimage blanket)
2009-07-30 12:41:19 +00:00
rwatson
c387c55113 Revise header comments for vnet.h as we now implement VNET_SYSINIT, not
just VNET_DEFINE in vnet.h.

Approved by:	re (vimage blanket)
2009-07-28 22:17:34 +00:00
qingli
4092d532fe The new flow table caches both the routing table entry as well as the
L2 information. For an indirect route the cached L2 entry contains the
MAC address of the gateway. Typically the default route is used to
transmit multicast packets when explicit multicast routes are not
available. The ether_output() function bypasses L2 resolution function
if it verifies the L2 cache is valid, because the cached L2 address
(a unicast MAC address) is copied into the packets as the destination
MAC address. This validation, however, does not apply to broadcast and
multicast packets because the destination MAC address is mapped
according to a standard method instead.

Submitted by:	Xin Li
Reviewed by:	bz
Approved by:	re
2009-07-28 17:16:54 +00:00
qingli
8c1899d934 This patch does the following:
- Allow loopback route to be installed for address assigned to
      interface of IFF_POINTOPOINT type.
    - Install loopback route for an IPv4 interface addreess when the
      "useloopback" sysctl variable is enabled. Similarly, install
      loopback route for an IPv6 interface address when the sysctl variable
      "nd6_useloopback" is enabled. Deleting loopback routes for interface
      addresses is unconditional in case these sysctl variables were
      disabled after an interface address has been assigned.

Reviewed by:	bz
Approved by:	re
2009-07-27 17:08:06 +00:00
bz
83f1495433 Update epair(4) to the new netisr implementation and polish
things a bit:
- use dpcpu data to track the ifps with packets queued up,
- per-cpu locking and driver flags
- along with .nh_drainedcpu and NETISR_POLICY_CPU.
- Put the mbufs in flight reference count, preventing interfaces
  from going away, under INVARIANTS as this is a general problem
  of the stack and should be solved in if.c/netisr but still good
  to verify the internal queuing logic.
- Permit changing the MTU to virtually everythinkg like we do for loopback.

Hook epair(4) up to the build.

Approved by:	re (kib)
2009-07-26 12:20:07 +00:00
bz
3aec900b26 Make the in-kernel logic for the SIOCSIFVNET, SIOCSIFRVNET ioctls
(ifconfig ifN (-)vnet <jname|jid>) work correctly.

Move vi_if_move to if.c and split it up into two functions(*),
one for each ioctl.

In the reclaim case, correctly set the vnet before calling if_vmove.

Instead of silently allowing a move of an interface from the current
vnet to the current vnet, return an error. (*)

There is some duplicate interface name checking before actually moving
the interface between network stacks without locking and thus race
prone. Ideally if_vmove will correctly and automagically handle these
in the future.

Suggested by:	rwatson (*)
Approved by:	re (kib)
2009-07-26 11:29:26 +00:00
rwatson
b3be1c6e3b Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
bz
1f4b104d4d sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-21 21:58:55 +00:00
rwatson
d32e9e9901 Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 13:55:33 +00:00
rwatson
fb3be5ae64 Add macros VNET_SETNAME and VNET_SYMPREFIX, and expose to userspace if
_WANT_VNET is defined.  This way we don't need separate definitions in
libkvm.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 07:50:50 +00:00
rwatson
80ed051e0c Normalize field naming for struct vnet, fix two debugging printfs that
print them.

Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-19 17:40:45 +00:00
rwatson
6955067932 Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
jamie
9f81cbd9ec Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by:	re (kib), bz (mentor)
2009-07-17 14:48:21 +00:00
rwatson
88f8de4d40 Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
rwatson
bc565b5b97 Add missing license line for vnet.h, correct white space nit.
Approved by:	re (kensmith) (implicit)
2009-07-15 00:56:15 +00:00